Commons:Batch uploading/AucklandMuseumCCBY

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search



March 2017 request

Noideawhatiamdoing (talk) 02:21, 7 March 2017 (UTC)[reply]

1960s replica of RAF badge, categorised under Air warfare.

General notes[edit]

There was a Wikimedian in Residence over 5 weeks in the summer when some uploads were done. AM blog post

At the time of upload, the API returned 70,789 objects marked with a CCBY license. This resulted in just over 101,000 photographs being uploaded.

Technical notes[edit]

Files are in the format File:<title> (AM <accession number>-<sequential image number>).jpg

If there is only one image for the object, then no sequential image number is used. All images are added to a gallery using the "other versions" parameter of the Artwork template. Should any one image fail to load, the gallery will not be fixed automatically.

AM API[edit]

The API swagger page seems to never have been created, so the API usage is suck-it-and-see. Requests to the API are specified as limited to 1000 per day, though in practice this has yet to be an issue.

The API path discovering URIs for the data needed is:

(search) -> (objects) -> (media or dimensions)

Though the (search) can return many objects, the object requests have to be one at a time. It is unclear if the API throttling actually counts all different types of request.

Searching is arbitrarily chosen to be for 10 100 objects at a time, each object may have many images. In the upload comment the search page, number of object within that page and the sequential number for the image of the object is shown. This is for debugging or restarts, it has no long term value. Using the API search from day to day, appears to return the objects in different orders. It's unclear why this happens, but is probably because the search returns the most recently changed catalogue entries first. This means that re-starting batch upload should be run from the first page, rather than jumping to the last used position.

The API structure is complex, with hierarchical families of relationships. For the most part these appear redundant for Commons' purposes, for example identities of curators or massive detail of material composition, when the descriptive text probably suffices.

Metadata mapping[edit]

Date may be difficult to abstract from the metadata. Field names may not be obvious and there is no dictionary to reference, for example "period" precise dates may be estimated even though not marked as such, and "exact" date appears to be used for date of acquisition even when it is against the object "made" metadata. It's not possible to tell if this is by design or practical usage has ended up mapping things differently from the literal descriptions.

Questions may arise from the data parameter including several dates, such as "20th century; 1935-1949; 1996". In these cases the metadata is picking up different ways of describing the object's creation period, plus the year of being donated to the museum is being included, which will always be the most recent year given. For copyright purposes, the earliest definitive date is the most useful, and the most accurate.

Description is based on the object field content and where a note exists against the object, these are added separately. Where the text of an object note exactly matches text in content, it is skipped.

Dimensions were added late in the batch run. These have to be detected as a property of the object, then a dictionary of URIs has to be interrogated. This may add to the API throttle count. Dimensions look like (type, value) where type can be anything like {length, height, note, string of multiple dimensions}.

WW2 period tin of honey, reused to store buttons. By analysis of the government established production company, the design is pre-1946 and so public domain.
1980s Anti-nuclear movement badge, unknown designer. Being simple geometric shapes this is likely to be too simple to raise copyright issues.

Should any question of copyright arise, such as when photographs contain other copyright statements (example), it can be confirmed that all files in this batch upload were created by searching the API with the parameter &q=copyright:CC, as discussed in Auckland Museum's API guidelines. This only returns files marked by the museum with a license of "CCBY". Example search link includes "copyright":["© Auckland Museum CC BY"] against each returned object. The copyright metadata is returned in the search returns per object, is not present in the object metadata, but for each media item under the object the rights can be later checked under that media URI, though the request header must set "accept" to "application/json" to get the metadata rather than the image; free plug-ins are available for most browsers to do this.

Example: returns:

"am:record_score" : [ {
    "type" : "int",
    "value" : "0"
  } ],
  "rdf:type" : [ {
    "type" : "uri",
    "value" : "ecrm:E30_Rights"
  } ],
  "rdf:value" : [ {
    "type" : "string",
    "value" : "© Auckland Museum CC BY"
  } ]

Some objects returned include modern era works, such as lacework, toys or clothing. Though the photograph is CC-BY, it is possible that some of these will have sufficient artwork content to be counted as non-utilitarian objects and potentially be copyrightable, at which point copyright can often be determined by age. Relevant guidelines are Threshold of originality and Copyright rules by subject matter.

Most modern objects appear to be mass produced design rather than drawings or other art. It is presumed that files selected by the museum do not represent any obvious copyright issue, including photographs of objects from the 1960s or later. Initial uploads show some images will be problematic under Commons' policies, for example modern political badges with drawings or product labels later than the 1940s with potentially copyrighted imagery. As these appear to be at the level of less than 1% of uploads, and not obvious from searching metadata at source, they can be weeded out as housekeeping. Uploading the files to Commons poses no risk to the uploader, host or reuser, as relying on the demonstrably good accuracy of prior publication by the museum, backed up both by volunteer manual review and a procedure for responding to take down requests, meets the legal sense of having taken reasonable precaution for respecting potential copyright claims; naturally this explanation is not legal advice.

Copyright housekeeping[edit]

A retrospective (slow) housekeeping job is checking specific image rights (rather than just the object rights), taking action on those that do not explicitly match "CC BY". The rights statements are not of a predictable format, as the entry appears to be free text and may include typos. Actions are:

  1. Add Category:Images from Auckland Museum marked as copyright undetermined to files with rights of "Copyright undetermined - untraced rights owner" or similar.
  2. Add Category:Images from Auckland Museum marked with cultural permissions to files marked with "Cultural Permissions".
  3. Add Category:Images from Auckland Museum marked with All Rights Reserved to files marked "All Rights Reserved".

In theory this extra copyright check could happen at upload, but the potential discrepancy between the object level metadata being CC BY while individual photographs of the object may have a different copyright was raised late in the upload. The numbers involved appear to be under 1% and most of those still have an appropriate copyright release for Wikimedia Commons, so the batch upload is completing in a consistent way, then checks are run as housekeeping, possibly raising non-controversial speedy deletions as needed.

Reference deletion requests[edit]
Failure types[edit]
  • Image not retrievable, e.g. gives Error 500 (Internal server error). Image is skipped after failing. Unfortunately these are inconsistent for multiple photograph batches of the same object, which may lead to blank thumbnails in the cross-reference gallery on the image page.
  • Page not found, e.g. gives Error 400, skipped.
  • No am:accessionNumber found. Accession number is replaced with the database object number. As accession numbers always appear to have periods, this should not lead to any confusion. The root cause is likely to be that the object is in a non-AM collection, in these cases the am:creditLine gives more detail on alternate collection reference numbers.
  • Blank images exist, such as These are detected and skipped if the object has one image and the image is 800x800 (the default blank card size).
  • Corrupt downloads. Due to the way a header field needs to be passed to the API with any image download request, a custom method in Python of opening the file has been used, which is less reliable for larger binary files than using Python's urlretrieve module. In a handful of cases this unexpectedly led to partly corrupted downloads, missing some of the last sets of pixels. This might have been an intermittent issue with a poor home internet connection. An additional test which "looks" at the last row of the downloaded image is used (relying on PIL modules), and reattempts download a few times which should stop this recurring.

Due to the varied nature of failure types, rather than working out where images are failing to be uploaded before adding predicted galleries to image pages, a housekeeping process goes through the collection looking for broken links. When these are in a gallery, and the image was uploaded at least a day ago, the link is removed. Example diff. Invariably these missing images are created by being listed as media by the catalogue entry in the API, but the url returning Internal server error. The associated JSON is correctly being returned, indicating that these images are missing from the source database, the error is not due to a server outage or other glitch.

Also see #Copyright housekeeping.

Opera glasses, unknown date, acquired 1955. Automatically categorised in Personal artifacts.

The main/bucket category is Category:Images from Auckland Museum where the institution template is displayed and sub-categories are added.

Type category

New categories are created based on the object type. Examples:

Some of these may turn out to be not great, but it's a starting point.

Date category

Where the date fields include "19th century" or some other match to the regular expression "\d?\d(st|th) century", then a century category of the format Category:19th century in Auckland Museum is added. As the date fields are flexibly used, this will miss many items with periods or ranges, such as "George V (1910 - 1936)/House of Windsor/English reign".

When a date field matches a decade, like 1850s, or another match to "\d{4}s", then a decade category of the form Category:1850s in Auckland Museum is added. These are initially placed in the top category, but should be manually moved under the century as they arise.


Assigned to Progress Bot name Category
  • 24 Nov 2017 Started
  • 26 Nov 2017 Metadata usage improved, such as date categories, credit line and dimensions. Not retrospective.
  • 6 Dec 2017 Housekeeping of broken links in galleries created.
  • 7 Dec 2017 Change to 100 objects per page rather than 10, speeding up re-starts after recent outage problems with WMF servers.
  • 10 Dec 2017
    • Where a value is set for am:onDisplayFlag, this is taken as current display location and shown under exhibition history.
    • Restarts now jump to results page rather than iterating through them. It presumes returned pages are always the expected perpage value.
  • 17 Jan 2018 Upload run complete. There are remaining doubts about whether some images are individually marked with a NC restriction which is not apparent at the 'object' level license, this may require later housekeeping.
    • Housekeeping based on rights metadata started, but paused while a better re-run of the upload completes.
  • 2019 October
    • Refresh run per Village Pump discussion, after significant numbers of new images released by the museum on CCBY licenses.
NA Images from Auckland Museum

Specific search