From Wikimedia Commons, the free media repository
Jump to: navigation, search
The Wizard of Oz, 1903 (Library of Congress)


This page explains some of the technology and decisions made by for batch uploading from the Library of Congress (LoC) image collections. Other upload projects are listed at User:Fæ/Project_list.

Search tips[edit]

The collections are in individual categories, listed further down this page, but some categories may have thousands of files to surf through. Wikimedia Commons has lots of options available for smart searching, here are some practical example searches with links where following the format may help you to narrow down images of interest:

  • Search Matson Collection for West Bank
    incategory:"Matson Collection in the Library of Congress" insource:"West Bank"
  • Search all LOC uploads for TIFFs with Bethlehem in title but not in HABS as these are taken in the US
    insource:"" intitle:Bethlehem -HABS filemime:"image/tiff"
  • Search all LOC uploads for TIFFs with "boy" on the page with resolution over 36 megapixels and not in the HABS uploads
    insource:"" insource:"boy" -intitle:HABS filemime:"image/tiff" fileres:>6000
    search ('fileres' uses the square root of height times width)
  • Search all LOC uploads for TIFFs of posters with dogs or cats with an image width greater than 3,000 pixels, and not in HALS or HABS, using a regular expression search to avoid matches to "category" or URLs with "cat_" in them
    insource:"" poster insource:/ (dog|cat)[s\ \.,]/ -intitle:HABS -intitle:HALS filemime:"image/tiff" filewidth:>3000

Technology and basic options[edit]

Uploads in 2014 were run using the Glamwiki toolset, refer to Commons:Batch uploading/Library of Congress.

Uploads more recently are run using a Pywikibot script running in Python. Since 2014 Pywikibot has been released in a new version (pywikibot-core), Python itself has been released in new versions and the Library of Congress site has changed in various ways. All of these changes mean that it is not possible to release an upload tool or script that can be reliably reused a year or more after development, any upload process needs maintenance to work well, especially as Wikimedia Commons has gradual changes in templates, working practices and may have significant format changes due to further integration with Wikidata.

LoC image pages may include several links to images in jpeg and TIFF these look like: JPEG (54kb) | JPEG (155kb) | TIFF (179.8mb). The upload script finds all images and selects the largest jpeg and largest TIFF based on the filesize given on the record, not by testing the file itself.

The default behaviour is to upload a jpeg and a TIFF version. A discussion on the Village Pump established this was the best way forward to cater for the poor thumbnail rendering of TIFFs, consequently the jpeg version is better for illustration of Wikipedia articles or galleries on Commons. Though the poor rendering of TIFFs has been discussed for several years, a solution has yet to be deployed by WMF development.

Based on previous correspondence with LoC staff, it is likely that automated access to LoC servers is limited to 15 accesses/enquiries per minute. Running faster than this may trigger the IP to be blocked or throttled, in 2014 a faster run was blocked for an hour after exceeding these advised limits. It is not known whether throttle limits apply to certain address types and not others.

Code detail[edit]

Example operator's terminal view showing different messages during batch upload to aid monitoring.

Pywikibot's standard site.upload is used to upload files. This uses a default chunked uploading. Though in theory ignore_warnings can be set as an array to ignore certain types of failure, this has never worked (for me) and remains a code bug. As a work around to enable both jpegs and TIFFs to use the same root file name, the upload module is first run with ignore_warnings = False, then if a returned warning is "exists-normalized", the module automatically runs again with ignore_warnings = True and assumes there are no other errors.

A duplicate check for digitally identical copies is part of the code, this relies on the ability to create a SHA1 value based on a URL to the image file hosted by the LoC to compare with the standard Commons API SHA1 search. No attempt is made to check for existing matches to LCCN identity or similar, on the basis that there may well be existing copies on Wikimedia Commons which are inferior resolution. In 2017 the LoC appears to have blocked bot access, resulting in Error 403 (Forbidden) to a python script request to read image files to generate a SHA1 hash. A work around is to simulate a browser header on access, it's cheating a little, but uploading to Commons is part of why the LoC release the images in the first place, however the outcome is identical to running a full local download, so to limit home broadband hogging, SHA1 duplicate checks are only run for files less than 20 MB in size. Start and end times for SHA1 generation are shown in-terminal for small enough TIFFs, to help keep an eye on whether size limits are being set in a practical way. SHA1s are generated and used to check for duplicate jpegs, just not flagged as worth monitoring.

API flagged errors such as the file having been previously deleted, the same file existing but having been overwritten (such as by being cropped) or the TIFF failing the WMF server fileinfo check, are respected and should cause the upload to be skipped.

Code is run from the command line in a local terminal. A parameter in the call is a URL to the grid search for a LoC image project, for example shows the WW1 poster collection. Options include the ability to skip to an image in the batch upload sequence and to upload a single image by LCCN. Based on the setting for "co=" in the URL, a bucket category for the collection is added to every image; i.e. the code has to be customized for each planned upload batch.

BeautifulSoup is used to help process the LoC webpage. Prior to 2017, metadata was also pulled from a mods page, but these have become unavailable for unknown reasons.


From 2018 onwards, cross-linking of jpeg and TIFF versions are done using the <gallery> tag. Where previous versions or crops may exist using the same digital identities, these are searched for using the normal Commons search engine and displayed in the gallery. For example this colour poster includes an earlier black and white scan in its gallery.

If the earlier uploads use identical file names, then they will be overwritten if the largest LOC file is at least 10% larger in file size than the one currently on Commons.


Author is hard to extract from the LoC record or MODS record, with this either being unidentified or mixed in with a description rather than a separate field. Often the nearest to author is given as publisher or engraver. For these reasons a default of the collection name is placed in the Author field.

Fortunately swapping to JSON records makes contributor_names available. This virtually solves the Author problem.

IP blocks for[edit]

On the 1 February onwards LCCN queries became "forbidden", apparently based on IP blocking. Presumably the number of queries for LCCN based uploading had triggered the automated server security, or the old systems are being phased out. Uploads of NCLC images continued to run, as these rely on MARC metadata rather than LCCN. Rather than swapping IPs to work around this, a code rewrite seems timely.

There are alternatives, such as using the LC labs data services, however since first running batch uploads the main site has made available a JSON service which seems the most relevant solution with the presumption that this is not subject to low throttle limits.

As of 3 February, rather than using the sites, the more recent facility of pulling JSON versions of the records is being used. The URLs look like These are not identical to the old records, with some new metadata options and others missing or possibly renamed such as 'Abstract'.

Known errors[edit]

  • Automated uploading defaults to picking the largest jpeg and largest TIFF versions by filesize. On rare occasions this gives a poor choice as different scans may include different crops with frames, or massive borders. For example includes both an album page and a cropped image with just the photograph, and picking the largest TIFF gives the whole album page rather than the better quality crop, see LCCN00651623. Manual overwrite, or uploading as alternates when found, may be the only solution for these exceptional cases.
  • Metadata structure for "resources" is inconsistent. In the van collection this was especially apparent. Rather than following a { files: [] } format, there is a flat dictionary with URLs against some of the keys. The browser version of the JSON data does not match the version produced with Python calls. In these cases "resources" exists but contains no data, there is no known explanation for why this happens. A work around is to fall back to scraping the website to check for download links, ignoring that part of the JSON record. Example poster, it is possible that all these cases are caused by files with copyright restrictions.
  • TIFFs without matching jpegs get uploaded, these only happen when the jpeg version has already been uploaded under a different filename. Where jpegs exist with no matching TIFF, they are skipped. These are extremely rare. Example The Last supper.
  • Some apparently valid TIFFs fail tiffinfo verification when uploaded to the WMF server. This has been awaiting WMF dev investigation since January 2016. Refer to phab:T124662.
  • Some scans have no proper LOC record and have a temporary statement about being a "Digital display record". An example is LCCN2014637246. It is unknown how rare these are, hopefully very rare. This search lists those needing renaming and descriptions.
  • Outages - for periods of minutes up to days, may be offline. Sometimes the collections are unavailable with read failure errors (example), sometimes they time out to a "technical difficulties" web page or "planned outage" statement. Long term upload scripts have to detect and wait for the server to return, or ask the operator to use their human brain to decide what to do.


Year ranges are checked to calculate a maximum date value, which is presumed a terminus ante quem for copyright.

The default copyright check is basic, the licences used for uploads created after 22 January 2018 are:

max date > 1923: Skip file unless a rule for the collection applies
unknown date: {{PD-old-70-1923}}
max date >= 1890: {{PD-old-70-1923}}
max date <  1890: {{PD-old-100-1923}}

Collections with special templates or with different copyright provisions, such as US Government works which are public domain regardless of date of publication, may have custom copyright settings. These are added at the same point as when a collection category is identified based on collection short name.

The wrapper of {{PD-Art}} or similar is not applied as the objects may be both art or photographs.


Batch uploads before 2018 included:

  • Frank and Frances Carpenter Collection
  • World's Transportation Commission
  • African American Photographs Assembled for 1900 Paris Exposition
  • American 18th and 19th century cartoons
  • Sergey Prodkudin-Gorsky

Popular Graphic Arts (pga)[edit]

28,038 R Popular Graphic Arts

Based on a Village Pump request in Jan 2018, the 'pga' collection was added to the upload script.

Due to uploads moving to chunked uploads, the previous issues with uploading 100MB+ TIFFs no longer exist, and the relevant log and test options were stubbed out.

Jpeg versions as hosted by the LoC are uploaded directly, these are normally of a much lower resolution than the available TIFF. This might be suitable for housekeeping to 'upgrade' the jpeg size, if there is demand.

From this batch, the upload comment has been improved to follow the format:

Library of Congress <date> LCCN <LCCN> <image format> #<batch upload sequence number>

This tweet announced the start of the upload and was retweeted by the @wikicommons twitter stream. ✓ Done 14 February 2018.

Civil War Glass Negatives (cwp)[edit]

14,434 R Civil War Glass Negatives


A few hundred existing copies of the glass negatives in this collection exist on Commons, though it's unclear in many cases if the original source was the LoC. The estimate is between 600 to 900 images may already be uploaded, probably in jpeg formats in various resolutions. The LoC collection shows a potential of 8,710 photographs available.

Upload run started 22 January 2018. Copyright release set at {{PD-old-100-1923}}. ✓ Done

National Child Labor Committee Collection (nclc)[edit]

10,021 R National Child Labor Committee Collection

This is an example of an upload where LCCNs do not exist for the collection. Metadata mining had to swap to using the MARC record.

Filenames take the format:

<title> LOC <reproduction number>.(jpg|tif)

Upload run started 23 January 2018. Photographs go up to 1924, so those after 1922 are {{PD-USGov}}.

31 January 2018. Due to a flaw in how images were selected, it was not always the largest being uploaded. New array sorting and size checks are re-uploading where needed. Examples: LOC cph.3a27326 and LCCN00650089.

✓ Done

Carl Van Vechten (van)[edit]

Category:Photographs by Carl Van Vechten

The LOC collection has 1,395 photographs, while the collection on Commons before this upload was ~500. The total uploaded in this batch can be discovered using this search. The upload is likely to create many duplicates, however many of these will be due to earlier scans or the use of non-LOC sources which may have changed the EXIF data. Photographs use the custom license {{PD-Van Vechten}}.

1 February 2018, Upload run started. ✓ Done at least 1,353 photographs added.

Yanker Poster Collection (yan)[edit]

Category:Yanker Poster Collection

The listed collection has 1,322 images listed, though many may be considered in copyright and all those from 1978 onwards are skipped. No posters appear to have been uploaded to Commons previous to this batch upload, possibly due to the complexity of copyright.

Copyright has been justified by the Library of Congress as public domain for most cases due to publication before or during 1977 without a copyright notice. These are uploaded with a license of {{PD-US-no notice}} and on the normal PD-1923 license for any before 1923.

1 February 2018, upload run started. ✓ Done ~160 distinct posters uploaded, a small proportion due to dates and license filters.

Performing Arts Posters (var)[edit]

Category:Performing Arts Posters

Copyright is left as the default filter, so any published after 1923 will be skipped. ✓ Done ~2,043 posters uploaded.

Artist Posters (pos)[edit]

2,818 R Library of Congress Artist Posters collection

This is a large collection which includes posters in other collections. However only 2,700 are available online and many of these will be dated after 1923.

From this upload an additional field of "partof" is added to the information box. If posters are in sub-collections this will be shown, and the images will be relatively easy to re-categorize on Commons using a source search. For example LCCN2001695176 is given a link to LOC's "posters: world war i posters" collection, which can be used to move it to World War I posters in the Library of Congress.

Default copyright filter applies, so only up to 1923 publications will be uploaded. ✓ Done

PH Filing Series Photographs (ph)[edit]

Category:PH Filing Series Photographs

Though over 3,000 images are public, filtering the collection to be before 1923 reduces the numbers significantly. Of all the LOC collections, this group may have debatable quality. Age and photographer names should make all the photographs of some educational value and the LOC state they are selected for "special aesthetic, technical, or historic importance".

Uploads started 7 February 2018. ✓ Done ~480 unique photographs added.

World War I posters (wwipos)[edit]

Category: World War I posters in the Library of Congress

This is a refresh run, the last run being in 2014.

Extra rules were required as there may be doubt on PD status in some countries depending on author death dates. Though the LOC is correct in hosting these as PD in the USA, many war time materials were seized and under US (or UK) law became government property, including any intellectual property these represent. However, in the country of publication (Germany or allies) there never was a revocation of the rights of the artist and Wikimedia Commons respects the law of the country of first publication as well as US law.

A general filter to test whether partof includes wwi posters and then does a check for location. If location is empty, the image is skipped, this turns out to be the case for many items in the collection. Where location is not given, but subject includes american, location is forced to be "united states". Allowed locations include:

  • united states
    • new york
    • new jersey
    • minneapolis
    • lancaster county
    • pennsylvania
  • south america
  • australia
  • canada
  • great britain
    • england
  • italy
  • india
  • france
    • ecorse
    • brittany
    • calvados
    • aisne
    • alsace
  • africa
    • algeria
  • serbia

Uploads started 13 February 2018. ✓ Done

Gladstone Collection of African American Photographs (gld)[edit]

648 R Gladstone Collection of African American Photographs

This is a relatively small collection of ~350 photographs with some cartoons relating to slavery. Most are civil war period, all are of African Americans. ✓ Done

National Photo Company Collection (npco)[edit]

43,801 R National Photo Company Collection

This is a large collection of photographs dating between 1909 to 1932 and over 69,000 are returned by the search view. Photographs have no known copyright restriction, so can be uploaded as {{PD-old-70-1923}} for pre-1923 photographs and {{PD-US-no notice}} for on or after 1923. Where no date is discovered, the 'no notice' option is taken as a cautionary default.

Some of this collection have no links to images in the JSON records, a special fallback had to be written to scrape the displayed webpage records and use MARC index. It is unclear why the inconsistency exists, possibly because they were mistakenly treated as not visible outside of LOC. Example Roosevelt at Union Station LOC npcc.00046.jpg

Matson Collection (matpc)[edit]

43,546 R Matson Collection in the Library of Congress

Primarily photographs of Palestine up until 1946. A special template already exists {{PD-Matson}} which is added to all images. Uploaded as {{PD-old-70-1923}} for pre-1923 photographs and {{PD-US-no notice}} for on or after 1923. ✓ Done

Liljenquist Family Collection of Civil War Photographs[edit]

4,099 R Liljenquist Family Collection of Civil War Photographs

Many photographs from this collection have been previous uploaded to Commons under various categories. The collection is being added to, so updates are needed in the long term.

All photographs are dated to the American Civil War, so are all public domain. ✓ Done

Harris & Ewing Collection (hec)[edit]

39,580 R Harris & Ewing Collection

2,443 photographs had been uploaded from this collection in advance of batch uploading (though only ~160 in the main category). All uploads use {{PD-Harris-Ewing}} and the PD-old-70-1923 or PD-US-no notice releases depending on date.

Upload run start 16 February 2018.

Bain Collection (ggbain)[edit]

45,381 R George Grantham Bain Collection

The collection includes many portraits of notible early 20th Century figures, including politicians, socialites and sports winners. Around 10,000 images from the collection were uploaded before this run. 41,000 photographs are available at LOC. {{PD-Bain}} is added to all photographs.

Miscellaneous Items in High Demand (cph)[edit]

13,912 R Miscellaneous Items in High Demand, PPOC, Library of Congress

The "Miscellaneous Items" represents a variety of the Prints & Photographs Division's photographic, print, drawing, and architectural holdings. These items were singled out for description because copy photographs or digital copies were requested for a publication, exhibition, or other special project that increased demand for the pictures.

Uploads are limited to pre-1923 dated works.

Lawrence & Houseworth (lawhou)[edit]

1,591 R Lawrence & Houseworth

The images date from 1862 to 1867. Just over 900 images are in the collection. Before this upload, four images had been uploaded to Wikimedia Commons. ✓ Done


LOT 4339[edit]

121 R Japanese and Chinese portraits 1863-1877, Library of Congress LOT 4339

A series of 19th century portraits by two different photographers. Special request by Animalparty. ✓ Done


Adding more matches to galleries[edit]

The batch uploads using LCCN as the primary identifier, only search for matches at time of upload using the LCCN. A housekeeping task is checking for the older digital ID or pnp code, where in some cases versions have been uploaded to Wikimedia Commons before a LCCN was allocated by the library.

Where reasonable to do so, this housekeeping task adds galleries to matching files which were not uploaded in the batch upload project. For more information refer to User talk:Fæ#locpnp.

Collections checked:

  • pga
  • cwp
  • nclc
  • van