Commons talk:GLAMwiki Toolset Project

From Wikimedia Commons, the free media repository
Jump to: navigation, search

GLAM Upload System[edit]

I am quite curios about GLAM Upload System, and how is it going to overcome some of the challenges of successful upload. I had some email exchanges with Valentine Charles from Europeana about it. Some of the challenges I see:

  1. Mapping from museum metadata to Commons templates. Commons have several specialized templates for different types of objects: {{Artwork}} for 2D and 3D museum objects, {{Photographs}} for historic photograph collections, {{Book}} for scans of books, etc. The GLAM Upload System should pick the best template for the job. If there is a need to modify some templates, to accommodate GLAMToolset needs please start discussions at those template talk pages.
  2. In most cases when artwork in a image is attributed to a known artist, the image metadata will include a creator template and category. Similarly the metadata will also include institution template for institution that owns/displays the artwork. One of the challenges of large uploads is to match long lists of artists or institutions with commons templates.
  1. Commons tries to show the image metadata translated to the language of the viewer. To do that we use bunch of "localization" templates. It might be quite a challenge to use them properly.

--Jarekt (talk) 05:42, 29 January 2013 (UTC)

Hi Jarek,

And first of all, apologies for the delay in replying! I'm quite the Wikipedia noob so didn't figure out until yesterday how to subscribe to a watchlist. BTW, if you want to try out the tool go here and send us a mail to get access. Right now it's an unstable development build but if you have some XML representing the metadata of media objects to start off from you should be able to give it a try!

Point 1: We've abandoned the idea of creating our own Cultural Heritage Object template and the system will work with the standard templates for Artwork, Book, Musical work, and Photograph. We should probably add support for the generic information template as well. To be used for example for ethnographic recordings or films from audio-visual archives where there seems to be no "natural" template to map to.
Point 2: Yes, it's in our backlog to do this but we haven't started working with it yet and so haven't really discussed design or implementation. The database you link to will hopefully come in handy!
Point 3: Not sure if it's the same thing (?) but we are working with adding support in the tool to language tag fields. In the same way as in this video
where there's one description in Dutch and one in English. --DivadH (talk) 08:23, 26 February 2013 (UTC)
Thanks for reply. I do not think I need an access to beta version of the tool yet, but thanks for the offer. I am coming more from perspective of someone involved in two large batch uploads (here and here) and curios about how your software might overcome some of the challenges that consumed 95% of time during those uploads. It will be also one of the first things people will ask when someone proposes to upload a batch of files. So far most of the mass uploads were done by quite experienced users. Your software has potential to allow people not so familiar with Commons landscape to mass upload the files. However for that to happen your software will have to act as a middle-were layer, making some of the quirks of Commons invisible to non-commons-expert users.
For example check File:Albertus Brondgeest - Girl Standing by a Fence - WGA03242.jpg from WGA upload, than check the same file here as it would be shown to user accessing the file from Polish wikipedia (or user with Polish as preferred language). All template field names are in Polish and so is most of the metadata. Also all the information about the artist and institution can be displayed by clicking on their templates. Ideally uploads created by your tool would do the same. That was accomplished by linking to existing creator and institution templates (or creating new ones) and by using metadata to fill Commons templates. For example Dimension string "24.8 x 17.9 cm" was converted to a template call {{size|unit=cm|height=24.8|width=17.9}} which will display: "Height: 24.8 cm (9.8 in). Width: 17.9 cm (7 in). " to English speaker and "hauteur: 24,8 cm. largeur: 17,9 cm. " to French speaker. Same was done with Medium, inscriptions and Object history fields, and is often done with date and other fields, like here.
I can help you get familiar with those templates. Also the templates are not set in stone. They can be modified, and likely many will go through period of changes after programing in Lua goes live on Commons, sometimes this spring. And after templates are allowed to pull metadata from wikidata. --Jarekt (talk) 14:41, 27 February 2013 (UTC)

Timeframe?[edit]

Whats the timeframe for Goal 1? When do you think everything there will be up and running? /Axel Pettersson (WMSE) (talk) 08:31, 25 February 2013 (UTC)

Hi Axel, the plan is that Goal 1 should be completed in June. --DivadH (talk) 08:03, 26 February 2013 (UTC)
Update: We've now been informed that integration into Wikimedia Commons is a technical process that could take months. So while we still estimate we'll have a working first version in June actual availability in Wikimedia Commons is very difficult for me to predict. I would hope somewhere in late Summer or early Autumn. --DivadH (talk) 11:19, 12 May 2013 (UTC)

What would you suggest in the meanwhile?[edit]

Hi, we are receiving the metadata for the picture upload by Zentrablbitliothek Zürich in XML-Format (MARC21). How would you suggest we proceed in the meanwhile, until the Goal 1 tools are available? - Should we just upload a minimalist set of metadata for now? And then the full metadata at a later stage? How about the data that gets edited on Commons in between? Or is there a way that we could participate in the beta-testing of the tools so we could upload the full meta data earlier than June? Doing the mapping of the metadata fields between MARC21 and Commons ourselves wouldn't make much sense, would it? Beat Estermann (talk) 17:00, 25 March 2013 (UTC)

Hi Beat, It looks like you're already busy uploading via a scripted bot? If so it might be easiest to simply continue to do so. I know now that the integration process of the tool into Wikimedia Commons is a long and complex one. This means that the tool will not be available as early as June. My estimate now is September at the earliest.

You have been made a user of the GLAMwiki toolset on Labs so if you want to test it please go ahead. Alternatively maybe you could upload or send me sample MARC data from the Zentralbibliothek? And I could then try it out and get back to you.

Cheers, David--DivadH (talk) 11:25, 12 May 2013 (UTC)

Hi David, Thanks for your response. The "problem" with the Zentralbiblithek Zürich XML is that it is nested XML (MARC21), and as such won't be able to be treated within phase 1 of the toolset development. So we are continuing with our script. In any case, Emmanuel Engelhart, who is taking care of the script, has also started to follow your project meetings. So, coordination could take place there. Beat Estermann (talk) 07:01, 21 May 2013 (UTC)

Filenames[edit]

Thanks for coming and demonstrating the software at GLAM wiki last weekend. This is a wonderful and exciting project, but I have a number of observations and suggestions.

Firstly on filenames, I think you need to advise potential uploaders to read Commons:Naming conventions especially "Files can be named in any language. The name of the file should be descriptive". It should be possible for the software to be used to map a descriptor field followed by the institution's unique ID to the filename field, but as that could get rather long it would be sensible if there was an option to make that the first x bytes of the descriptor field followed by the URL or a sequential number. My worry is that some uploaders might read "should be descriptive" as "this is optional", not "this is optional for newbies and small uploads but expected for batch uploads". But the good thing is that whatever the language used by the uploading institution there is no need to translate to create filenames. Jonathan Cardy (WMUK) (talk) 15:59, 17 April 2013 (UTC)

Hi Jonathan,
We've designed the upload tool to force the user to create unique identifiers. The convention we want to encourage is to combine a human readable title or name of a work with an accession number/object identifier and by combining them create unique page names in Commons that are also descriptive. An example: http://gwtoolset.wmflabs.org/index.php/File:Bamboe_en_mus-RP-P-1956-762-RM0001.COLLECT.45979.jpg
We haven't gotten around to adding help texts to the tool but will do so and will then make sure to reinforce the importance of this and link to the Naming conventons. Another point I'd like to make is that while the GLAMwiki toolset would make it possible for a GLAM-curator to independently make a batch upload I think best practice should always be that they do so as part of explicit Commons partnership and having a contact from the appropriate chapter. That contact should be an experience Commons user who can guide the GLAM concerning naming conventions, rules of categorisation, translations etc.
Cheers,
David--DivadH (talk) 11:15, 12 May 2013 (UTC)
Thanks David. Does this mean that you can include the first x bytes of a metadata field within the filename? My feeling is that some fields might otherwise be too long. One small point, the logical partner might not be from the chapter but for a specialist museum or archive could easily be from the relevant WikiProject. Jonathan Cardy (WMUK) (talk) 14:00, 15 May 2013 (UTC)
Many batch upload projects adopted "AUTHOR - TITLE - XXXNNNNNNNN.ext" naming convention, where XXX stands for institution name acronym and NNNNN stands for accession number or similar ID. Usually only one author is mentioned and title might need to be trimmed intelligently if too long, but otherwise this approach worked well in the past. Other approches like starting with institution name do not sort well in large categories and might not show the most useful part if only small portion of the filename is visible (under icons, etc.) --Jarekt (talk) 14:22, 15 May 2013 (UTC)

Categorisation[edit]

Historically categorisation has been one of the more problematic areas of mass uploading. At worst existing categories get swamped with thousands of additional entries and the contents of mass uploads get underused because they are poorly categorised. One upload has even been suspended due to categorisation problems.

Commons doesn't greatly help here because apart from the use of Latin for species names, categories are almost entirely in English, and our categorisation isn't always compatible with other people's metadata. In the longer term one possibility would be to increase the use of category redirects on Commons so that more languages could be supported - even if they simply redirected to a category in English. But for this project we need to find a way to map from the source institution's metadata to commons categories, and whilst a few institutions such as botanical gardens, Natural History museums and zoos will have an easy bridge by outputting their species field as a category; In most cases we will need to create some sort of lookup table to map the uploaders metadata to our category structure.

The version demonstrated at GLAM wiki 2013 had similar category functionality to the default uploader, with the ability to put all images in an upload into the same set of categories (though the delimiter needs changing from a comma to something else as a lot of our existing categories contain commas). That is perfectly OK for uploads of fifty images and even sometimes somewhat larger volumes, but if we are going to upload thousands of images we need to find a way to generate categories using metadata.

I appreciate that getting hotcat style predictive texting for category names may be difficult to code, but it would be good to have, as would something that verified whether the categories existed and translated redirected ones to the target category. Jonathan Cardy (WMUK) (talk) 15:59, 17 April 2013 (UTC)


Hi Jonathan, First of all my apologies for not responding sooner. I remember well the chat we had at GLAMwiki London about categorization and we have made some changes and improvements thanks to that discussion. Maybe not as much as you would like but at least some steps in the right direction. The updates to the tool we've made are:

  1. We have on you advice removed the use of comma as a delimiter when a user adds categories that are globally applied to the dataset being uploaded. The interaction has been changed so that the user doesn't have to enter any delimiter.
  2. We have added a way to add categories that are based on metadata values in the uploaded objects themselves. With Wikimedia Commons allowing categories in English only its use would be limited to English language metadata. I'm afraid doing something more ambitious in the language area is out of scope for the project given its current level of funding. Personally, I think the future solution would be one based on every Commons catgory having a corresponding WikiData entity. That would allow treating categories as true concepts with multi-lingual labels for the concept available.

I think a workaround for these limitations is to either pre-process the data to be uploaded or to process it post-upload via bot. An example would be if we were to upload the Rijksmuseum collections with the tool. Then we would pre-process the dataset and split it into smaller datasets each matching a category at a decent level of specificity. Examples could be: Jewellery in the Rijksmuseum, Paintings in the Rijksmuseum, Prints in the Rijksmuseum, etc. This could be done fairly simply as Rijksmuseum objects are classified (in Dutch though, so we'd need to do a translation table) at that level. As an alternative the same could be done by bot post-upload.

Cheers, David--DivadH (talk) 11:03, 12 May 2013 (UTC)

Thanks David, good to hear about the delimiter and the mapping of categories. What is the test plan for this, when will there be an opportunity to this? As for non-english categories, I might try and find out why the limitation exists. Jonathan Cardy (WMUK) (talk) 13:13, 15 May 2013 (UTC)
You can signup for the GLAMwiki tool here http://gwtoolset.wmflabs.org/index.php/GWToolset and test the categorisation features we have. As mentioned advanced categorisation is not within scope of the first year of the project so what is there is what have been able to do with small investment of effort. --DivadH (talk) 15:56, 21 May 2013 (UTC)
I think that it is an illusion that you can make proper redirects for the almost 3 million categories in Commons in all languages, especially since many languages have overlapping terms (Bergen/mounts/city, grave/city, ...) . Anyway, even in one single language, for example in Google art, we see sometimes 3 to 4 different spellings for the same artist. So a pre-processing of a lookup/translation table might be the best solution. --Foroa (talk) 08:11, 16 May 2013 (UTC)
Wikidata should be be very useful for this <some word in some language> -> Qxxx -> Commons Category. Multichill (talk) 16:24, 16 May 2013 (UTC)

Format issues from NYPL maps upload[edit]

I have run into two issues running the GWT to complete the upload of Commons:Batch uploading/NYPL Maps:

  1. "; " to join fields—The upload relies on multiple additional fields being added to the standard {{information}} template (neither map or artwork templates are a good fit to the metadata). This means that several items are added to the "other_fields" and "other_fields1" parameters. As the GWT default is that multiple items on the same parameter are joined with the string "; ", I am getting these as a series of odd semi-colons at the top of the page. See this 1700s map of Prussia. I am repairing these with a house-keeping script after upload, but this might be something to have special or optional behaviours on depending on the chosen template as post-upload tweaking is probably poor practice. Once we allow custom templates, this might be resolved by the way ingestion templates could be designed.
  2. xml validation—I have had some difficulties in pre-parsing the xml file. The source metadata text (esp. 16th to 17th century quotations) makes use of the standalone "&" symbol which I am changing to "&" and forms of "etc" such as "&c.;" which again create xml errors. The problem is not pre-parsing these (a small headache on its own) but the fact that the GWT happily runs and starts uploading files, but just does not complete the batch, with no error being reported to the user. It would be much better if xml validation failures were highlighted to the user automatically, preferably pointing to where in the source file the parsing error occurred, as tracking this down with no error messages is like playing Sherlock Holmes.

Anyway, I'm working around these problems, so this is making for an interesting case study of 20,000-ish maps. Smile fasdfdsfoiueire.svg -- (talk) 08:02, 20 April 2014 (UTC)