User talk:Fæ/Städel museum

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
High resolution photograph of a portrait of Anna von Holzhausen, 1535. 3,176 × 4,303 pixels, 19 MB.
Very high resolution photograph of the Torgauer Altar, 1509. 20,314 × 11,811, 288 MB.

Introduction[edit]

This batch upload is for all photographs of artworks supplied with suitable copyright from https://sammlung.staedelmuseum.de/en (the digital collection of the Städel). These are high resolution images in a lossless png format. The "Staedel Museum" is also referred to as the Städelsches Kunstinstitut und Städtische Galerie. See Village Pump notice.

The Museum's press release (August 2020), states that 22,000 artworks have free access with a CC BY-SA 4.0 license. The number available to upload may be significantly less as some of the records relate to videos which will not be used or other more complex entries which are not for the individual work. The initial presumption is that only files given in 'png' format will be of interest and in all these cases the record contains a title.

The Museum catalogue is available in German and English, both languages are used to create the Commons image pages.

Example searches:

Configuration[edit]

A credit template of {{Städel museum}} is applied for images in this project. Example usage:

{{Städel museum|rights = CC BY-SA 4.0|link = https://creativecommons.org/licenses/by-sa/4.0/}}

The naming convention applied is:

<title> (SM <accession number (Objektnummer)>).png

Where the title is chosen to be the English version provided in the metadata. In many cases the "English" version is actually identical to the German title. To avoid filenames hitting the 255 character limit, titles are truncated to the nearest set of words under a maximum length of 190 characters, this may result is some oddly abrupt sentence endings, but most titles are well under this limit.

As lido data has no specific description field, the set of "subject concepts" is used as suitably descriptive, even though it is presented as a list. Depicted persons and places are also added where available. The artist/creator is checked via the http://d-nb.info/ database where this link is supplied. If unresolved then the default descriptive text is used instead, however mostly this resolves to a handy set of identities, including a Wikidata link, which is then double checked to see if a Commons Creator template has been made. This seems to be mostly working out.

All uploads use the {{Artwork}} template. Where available in the metadata, the type of work, materials, people depicted, places depicted, subject concepts and notes about inscriptions are added.

Sets of photographs[edit]

On the website, some entries are shown as multiple photographs, such as the back of a painting. This is not currently addressed in the upload run and only the primary photograph will be handled.

Accession numbers[edit]

The curator's choices of museum accession numbers can be used to help identify an object series or other types of connections between artworks. Examples

As well as being displayed next to each other in the bucket category, thanks to sorting by accession number, these can be listed using narrow title searches:

Categories[edit]

All images uploaded in this project are added to Category:Images from Städel archives where they are sorted by the accession number. The parent category Städel, already has a useful breakdown of sub-categories including artwork types and dates. For this reason only the 'bucket' category is certain to be added, and later housekeeping by volunteers should be able to easily copy files into the existing hierarchy using the extensive image page metadata, in particular subject concepts using the museum's glossary.

Where a Wikidata entry is found for the materials section, the following category mapping is applied:

		if re.search('Q3305213', wikidata):
			cats.append(u"Paintings in the Städel")
		if re.search('Q93184', wikidata):
			cats.append(u"Drawings in the Städel‎")
		if re.search('Q11633', wikidata):
			cats.append(u"Photographs in the Städel‎")
		if re.search('Q860861', wikidata):
			cats.append(u"Sculptures in the Städel‎")
		if re.search('Q11060274', wikidata):
			cats.append(u"Prints in the Städel‎")

Copyright[edit]

Photo of a photo of a 3D work (statue), the original publication, and the statue shown, being out of copyright by age.

At upload, files are automatically checked as having a CC BY-SA 4.0 release with any other type of license statement rejected. This is the default license used by the museum, matching the 2020 press release.

The automatic copyright check is done with rights == "CC BY-SA 4.0" in the Python code, anything else is skipped. The rights statement is extracted by parsing the lido format record for lido.find('lido:rightstype').find('lido:term').string. Despite the expectation being that all the works are released on this license, the check is still performed before each upload to provide additional confidence in the process.

Legally, there is no impediment for volunteers to adapt these to public domain alternative licenses, if the photograph is equivalent to a simple 'scan' or faithful two dimensional reproduction. Templates of potential use are {{PD-Scan}} and {{PD-Art}} which explain in detail why the photograph can be reused as public domain in the USA. However images such as the triptych displayed on this page are of artworks which are not fully two dimensional, as the photograph includes the hinges and frame, consequently the photographer's right of moral attribution, in this case for the Staedel Museum, should be respected and remains legally meaningful for these cases where 3D elements are in the photograph.

Where public domain licenses are applied later by volunteers, it is a convention for GLAM related photographs on Commons to retain information about the original copyright release, and to retain information about the source and credit details so that reusers are aware of the context and can retain original source information if published elsewhere.

The metadata such as the descriptions, pulled via the OAI (Open Archives Initiative) is stated to be released as CC0, consequently no attribution is required and there is no limit on reuse.

Technical[edit]

This is a fairly challenging upload project due to needing different types of queries mixed together. The metadata is in Lightweight Information Describing Objects (lido) format, an xml schema, which requires special parsing, with the records indexed in a separate OAI format which does not use lido. In addition, because the downloads are restricted to active browser sessions to pass back tokens, which cannot be deduced purely from the OAI/lido records, this must be done through an open browser rather than headless Python http requests.

The upload is run using Pywikibot which queries the museum's metadata catalogue to get lido format records for each object, then opens a download session by visually clicking the download button using Selenium with an instance of Firefox to locally save the png file, then upload it to Commons. The passive metadata requests use BeautifulSoup to parse the xml, however though the OAI can parse using a standard 'lxml' request, the lido pages must be read using the 'html.parser' for the schema to work.

Once the file is downloaded locally, the upload is spawned into its own parallel processing thread, so that the next download does not have to wait. In practice this means that two or more files are uploading at the same time. The complex upload/download process and PNG sizes adds significantly to overall processing time, meaning around four minutes to over twenty minutes for a largish file. This in addition to the need pause background tasks on the ancient ThinkPad X220 being used when needed for videoconferencing or similar, though the project uploads started mid-September, the run may not be completed until December 2020 (around 70 days of run time).

There is an alternate dataset in Dublin Core format, but it is significantly more limited. See this example record in lido format, you may need to view the source depending on browser. Theoretically Staedel forces you to set up an account before accessing the lido data, but this is only enforced by not showing the links to the 'sammlung' subdomain from the main website, in practice it's all public without a login.

Bugs, known errors, trip hazards[edit]

  • Browser prompts For unknown reasons Selenium's instance of Firefox does not successfully set options with set_preference("browser.helperApps.neverAsk.saveToDisk", <mimes>). The work-around is to manually set the automatic download checkbox every time the driver is launched, which by design is once per batch run.
  • Source choices The lido schema used provides download links in lido:resourceSet. Initially it was presumed that one download link would exist, but by the 800th+ image was checked, this included a first link to an extra-large thumbnail image/url (1024px by something) in addition to the tokenized full size PNG. Consequently an extra check to lido.find_all('lido:linkresource') rather than just the first match was added, then check those links for a suitable download link.
  • Conception or production The default of a 'conception' event, when the artist made the work, often is included, but if that does not exist a 'production' event may include the artist name (example).
  • Popups Selenium may not always detect the download popup (dsRemoteLayer), this may be worked around by manually clicking on the download icon, or restarting the session as it may be a slow browser refresh issue.
  • Naming local files During the first ~1,500 uploads there was potential for the wrong downloaded image to be picked up by the uploading thread. This was a coding bug due to the differences between file download names and how the museum accession number is represented, for example sg_1734v_z versus sg1734vz and by default any PNG matching "sg.*" was selected. By more carefully checking timestamps, the most recent only would be added to the queue. Consequently some files like SM sg1734vz have required manually re-uploading, which in turn means that a re-run is then free to upload the original image with its correct text and a correct filename without running into duplicate errors from the API.
  • 502 Bad Gateway This internet glitch has been seen and is not currently handled (there are lots of ways connections can drop out). Instead the image is skipped but can be picked up again during a re-run presuming the error is unrelated to the specific image.
  • application/x-msdownload A few stashfailed errors occur on PNG upload attempts. The attempts are repeated several times before abandoning the upload thread. It is unknown what it is about these images that triggers the detection of a "dangerous file type". As this is rarely seen, it is unlikely to be fixed but is a permanent issue for those files.
  • Duplicates at source seem rare but exist, for example 15699z and 15699vz are digitally identical. These get rejected by the Commons API after uploaded to the WMF server.

Copyright amendment[edit]

Hi Fæ, I am glad that you discovered the Digital Collection and took on this project. Since June 2021 the new copyright law (https://www.gesetze-im-internet.de/englisch_urhg/englisch_urhg.html) has been implemented in Germany, which is the transposition of the EU directive for the Digital Single Market (https://eur-lex.europa.eu/eli/dir/2019/790/oj?locale=en) into German law. Under this copyright law, section §68 puts all reproductions of paintings, drawings and prints, whose artist have been deceased for at least 70 years, into the public domain (https://creativecommons.org/publicdomain/mark/1.0/deed.de) which is a great chance for us to give more access to images and lose some licenses. This copyright amendment means that the works you have uploaded are now also in the public domain and the copyright statement CC BY-SA 4.0 is therefore no longer valid. Would it be possible to update the designation in the uploaded images? --Städel Museum (talk) 15:20, 27 May 2022 (UTC)[reply]

@: , @Raymond: could you do so--Oursana (talk) 20:34, 24 October 2022 (UTC)[reply]
@Oursana This needs someone with an bot account I think. I do not operate bots. Raymond 06:02, 25 October 2022 (UTC)[reply]
Danke @Raymond: , weißt Du jemanden--Oursana (talk) 11:11, 25 October 2022 (UTC)[reply]
@Steinsplitter: , @Reinhard Kraasch: könntet Ihr das?
so hat es Raymond bei smb gemacht.
@Gnom: geht doch so wie oben oder--Oursana (talk) 11:22, 25 October 2022 (UTC)[reply]
Keine Einwände. Gnom (talk) 14:46, 25 October 2022 (UTC)[reply]
Hi Oursana, I think this just means the contributions of Städel Museum have to be modified, what can be done very well by c:VisualFileChange. I don't see that a bot is necessary. --Reinhard Kraasch (talk) 20:23, 25 October 2022 (UTC)[reply]
@Reinhard Kraasch: Can we do this together?--Oursana (talk) 00:47, 26 October 2022 (UTC)[reply]
Hallo Oursana, gerne gemeinsam im Kontor, allerdings nicht diesen Donnerstag, da ist "Digitaler Themenstammtisch". --Reinhard Kraasch (talk) 12:40, 26 October 2022 (UTC)[reply]
Perfekt Reinhard Kraasch, dann nächsten und dankeOursana (talk) 14:00, 26 October 2022 (UTC)[reply]
I obviously underestimated the mere number of files in question - to do the job by VisualFileChange would be a pain in the ███fingers. My bot has worked all the night, has removed all CC BY-SA information from the description pages and has put in the new Template:Städel museum CC instead of Template:Städel museum (I could have used the old one but wanted to see whether there are still transclusions of the old template). --Reinhard Kraasch (talk) 12:25, 4 November 2022 (UTC)[reply]
@Reinhard Kraasch: This is fantastic news, thank you so much for your effort and work, we are totally impressed and grateful. Städel Museum (talk) 16:03, 9 November 2022 (UTC)[reply]