Commons:Bots/Requests/SLQbot

From Wikimedia Commons, the free media repository
Jump to: navigation, search

SLQbot (talk · contribs)

Operator: John Vandenberg (chat)

Bot's tasks for which permission is being sought: Uploading 50,000 images and adding categories to existing uploads based on community provided categories.

Automatic or manually assisted: supervised for the first three days

Edit type (e.g. Continuous, daily, one time run): continuous

Maximum edit rate (eg edits per minute): as many as allowed

Bot flag requested: (Y/N): Y

Programming language(s): Python

John Vandenberg (chat) 03:41, 17 December 2010 (UTC)

Discussion

I had look at some of the initial uploads and I think they look quite good. The main thing I'm missing is categorization. One way to add them could be to use the current subject tags and add them in the form "SLQ subject <subject tag>". You might want to add them to Category:Black and white photographs as well. --  Docu  at 05:51, 17 December 2010 (UTC)

I'm building the capacity for SLQ dc:subject to be mapped to commons categories on collaboratively edited Commons wiki page which the bot periodically reads. See Commons:State Library of Queensland/Subjects
Almost all of the images are black and white, however a few are not, and there are B&W newspaper clippings as well. Would it be OK if I added :Category:Black and white photographs for everything, and let the community remove the category for the odd photo which isn't ?
A lot of the images do not have author and date metadata. John Vandenberg (chat) 06:37, 17 December 2010 (UTC)
I explicitly asked you not to run your bot before consulting the community. Blocked the bot pending this request. Besides the incomplete metadata, the lack of proper categories, there's also the problem that the titles of the images are bad. Multichill (talk) 10:22, 17 December 2010 (UTC)
  • I think Multichill is exaggerating. We would have asked you to do a test upload anyways. Even if the number is fairly high, it's still manageable. It's not like you uploaded 10000 images just to test as Multichill did .. --  Docu  at 10:04, 18 December 2010 (UTC)
Thanks; I will do any renaming required. It is easy to do. John Vandenberg (chat) 23:48, 18 December 2010 (UTC)
Great; I've added that as a category for all uploads. John Vandenberg (chat) 23:48, 18 December 2010 (UTC)
I have not run the bot since the initial trial, which was before your email. Please explain your concerns, and note that the lack of categories is being handled with Commons:State Library of Queensland/Subjects. John Vandenberg (chat) 10:30, 17 December 2010 (UTC)
I also think that title is bad. Library code could be kept for reference purposes in {{Information}} or dedicated template for files come from this library (like {{Fotothek-Description}}). Draft categorization is also good idea. --EugeneZelenko (talk) 15:44, 17 December 2010 (UTC)
I am afraid I have to agree with Multichill and Eugene here. Titles are bad, and it can be a real mess to fix (renaming and delinking 350 images, needs filemover rights, etc.). Though crowdsourcing categorisation looks like an interesting idea, I am quite skeptical that absolutely no categorisation can be inferred from the 124Mo (!) of metadata supposedly available. I18n templates, such as {{Other date}}, are also missing. Jean-Fred (talk) 01:07, 18 December 2010 (UTC)
The categorisation can be inferred, however it is important that we build the mapping between SLQ subjects and Commons categories, to aid in future collaborations. I will provide more information about the titles shortly. John Vandenberg (chat) 01:11, 18 December 2010 (UTC)
i18n templates are used. Could you explain what is missing?
The bot converts the dates provided in the metadata into w:ISO 8601, which are localised according to user preference. I can use a template if that is preferable. John Vandenberg (chat) 01:17, 18 December 2010 (UTC)
SLQ batch1 title length.png

In regards to the titles, it is a simple matter for me to append a cleansed dc:title field to the asset ID. The image on the right shows the lengths of the dc:title. IMO, the average length (47) is too long for Commons filenames, but I welcome feedback on that. There are a large percentage of titles which will need to be shortened, either before upload or afterwards. John Vandenberg (chat) 02:02, 18 December 2010 (UTC)

Commons file names can bear fairly long titles. The limit is rather high. On the other side, if you want to keep the naming scheme used in the test upload this isn't that big a problem, as we have other series named in a similar way. If the images already have titles, it's preferable to use them. You could append them to the current format.
For the subjects: you could map any ship name to "<shipname> (ship)". Any persons name or "lastname, firstname (Portrait)" to the form "firstname lastname". --  Docu  at 10:04, 18 December 2010 (UTC)
I would prefer to use just the asset ID as the filename.
A large percentage of the images have a dc:title that would need to be manually revised. dc:subject could be used for titles in some cases, however each image has many dc:subject, so I need to create rules to pick the best one, where possible. If the community insists on descriptive titles, I can create rules to skip images which have titles which will need to be manually revised. John Vandenberg (chat) 23:57, 18 December 2010 (UTC)
Could you please explain why these titles need to be manually revised? Multichill (talk) 00:05, 19 December 2010 (UTC)
I see that Multichill has now started a poll to prevent these images from being uploaded without descriptive titles. John Vandenberg (chat) 00:50, 19 December 2010 (UTC)
You make it sound like I started the poll because of this request. This edit triggered me. Having descriptive filenames has always been a requirement here. This poll is just a way to formalize things. Could you please explain why these titles need to be manually revised? Multichill (talk) 01:04, 19 December 2010 (UTC)
There are a lot of reasons why the provided titles would not be the best names for the files. I am not going to answer your every question before I can start uploading again. If descriptive titles are necessary, I will ensure that the bot only uploads images for which I can derive appropriate descriptive titles. John Vandenberg (chat) 01:49, 19 December 2010 (UTC)
The dc:titles are now being used where they are less than 40 characters long. Longer dc:title result in the record being skipped. John Vandenberg (chat) 10:50, 19 December 2010 (UTC)
I'm sorry I pissed you off John, that was not my intention. Unblocked the bot.
As for the title, just copy one of the title functions from https://fisheye.toolserver.org/browse/multichill/bot/ . The way I should do it: Set a max length (n for now). See if dc:title is not longer than n: "<dc:title> - StateLibQld<id>.jpg". If dc:title is longer than n, try if you can find the first sentence ("^([^(\.\s)]+)\.\s" or something like that) and if that works, call the title function again. If that didn't work just truncate at n and you'll get "<truncated dc:title> - StateLibQld<id>.jpg".
Is the metadata accessible somewhere online?
How do you parse the date, see for example this edit. I got this from the website you're linking to (love the handles btw). Is the date in your metadata? If so, if you can't parse it could you please just insert the unparsed date?
Could you make a list of top authors with number of occurrences? Would be nice to make {{Creator}} templates for the top authors.
I really wonder if the category approach will work. I hope it will, categorization is always a problem. You make a list of top subjects? That's what I did with the Tropenmuseum and that really helped to get categorization started.
Do you have the source code online somewhere?
Multichill (talk) 16:26, 18 December 2010 (UTC)
I have created Commons:State Library of Queensland/Creators for creators with 10 or more images, and Commons:State Library of Queensland/Subjects/Common for subjects with more than 10 uses.
The metadata is not available. I am waiting for Tim Starling to install a new ssh key so I can check in my code.
John Vandenberg (chat) 01:35, 19 December 2010 (UTC)
Were all photo's taken around Queensland? Some topics in the list link to main categories and other to Queensland specific categories. Would be good to all do it the the same way. If you can somehow make a distinction between location categories (Brisbane) and topic categories (Houses) you can use the deep intersection toy I made for Geograph to put the images in the right intersected category at upload (Houses in Brisbane). Multichill (talk) 10:28, 19 December 2010 (UTC)
A high percentage of images were taken from around Queensland. Commons doesn't have categories for the vast majority of these topical areas. That is why the bot will need to go back and re-categorise these images as new categories are created. I am currently skipping any image which does not have at least three categories in addition to "B&W photographs" (and now "portraits" as it is so common). John Vandenberg (chat) 10:44, 19 December 2010 (UTC)
I updated most of the top categories. This isn't that hard. You're currently flooding main categories. You don't want to do that. Multichill (talk) 11:09, 19 December 2010 (UTC)
Thanks for identifying the date ranges as in File:StateLibQld_1_101052.jpg. That date range is in the metadata. 'YYYY-YYYY' and 'YYYY-YY' are two date formats I was not picking up. That has now been corrected. John Vandenberg (chat) 01:54, 19 December 2010 (UTC)
And what happens with dates you can't parse now? Multichill (talk) 10:28, 19 December 2010 (UTC)
I am now skipping images which I can't parse the date for, or where the date is explicitly Undated. John Vandenberg (chat) 10:34, 19 December 2010 (UTC)

My thoughts:

  • Can the text stored in the {{Handle}} be localised ?
    • We can build a template for the common case, "Item is held by John Oxley Library, State Library of Queensland." John Vandenberg (chat) 11:26, 19 December 2010 (UTC)
  • « ca. 1902 » should be {{Other date|circa|1902}} → « circa  »
    • ✓ Done
  • If the date is unknown then it should use {{Other date|unknown}} → « Unknown date »
    • I am currently skipping all images that have an unknown date. John Vandenberg (chat) 11:26, 19 December 2010 (UTC)
  • If the author is unknown, then it should use {{Unknown}}
    • Could we defer this until we have a larger sample to analyse? John Vandenberg (chat) 11:26, 19 December 2010 (UTC)
  • Shouldn’t the files use an institution template, namely Institution:State Library of Queensland ?
  • Shouldn't we use a Category:To be checked category ?
  • I noticed that one where the date parsing seems to have failed
  • I see the description field concatenates « Title » and « Description » − maybe that should be kept separate.
    • I expect that humans will merge them in due course. John Vandenberg (chat) 11:26, 19 December 2010 (UTC)
  • I see that some images have Geoloc information − this definitely needs to be parsed and mapped in {{Location}}
    • Not yet. The geocodes need a lot of work before they can be added.
      • This seems to work just fine. Multichill (talk) 11:45, 19 December 2010 (UTC)
        • You have a sample size of one. I am not going to add geocodes for tens of thousands of images until I have analysed the data. John Vandenberg (chat) 11:49, 19 December 2010 (UTC)
        • From the metadata that I've seen (which is not the complete set), a lot of images have a rather useless "Brisbane" coordinate set that is right in the middle of town. It's useful I suppose if you don't know where Brisbane is, but not every image is on Queen Street, outside the GPO. Some of them may be useful but there's no easy way to work that out without checking, I suppose. Lankiveil (talk) 06:25, 21 December 2010 (UTC).
  • We could expand {{Technique}} to render « Black and white photograph »
    • The images are not all of the same type. John Vandenberg (chat) 11:26, 19 December 2010 (UTC)
      • Because you don't elaborate on what you need to do a lot of work on I guessed it was the syntax. Apparently it's something else. Is the data incorrect or don't you trust it? Multichill (talk) 23:29, 19 December 2010 (UTC)
  • Consider using date information to add Category:1933 in Australia & al categories.
    • That is a good idea, however due to the code structure, I would like to defer that until a bit later. John Vandenberg (chat) 11:26, 19 December 2010 (UTC)
      • Actually, it was easy enough to fix, and it will be a very useful categorisation, so it is ✓ Done. John Vandenberg (chat) 11:39, 19 December 2010 (UTC)
  • Overall, if metadata is sufficent, please consider using the multi-purpose {{Artwork}}
    • I don't think it is sufficient; I have not seen any that would call for that template. John Vandenberg (chat) 11:26, 19 December 2010 (UTC)

Jean-Fred (talk) 10:56, 19 December 2010 (UTC)

I see you started uploading again. Could you please address the concerns raised before starting to upload again? Multichill (talk) 11:09, 19 December 2010 (UTC)
Of course. I am doing small batches at the moment, and fixing any issues that arise. John Vandenberg (chat) 11:13, 19 December 2010 (UTC)

The use of Category:Architectural elements in Australia by this bot needs addressing. The bot has flooded what should to my mind be a meta-cat Unless the specific architectural element is identified (veranda, balustrade, fence, facade etc.) I don't think this is a useful categorisation for these images. -- Mattinbgn/talk 04:29, 21 December 2010 (UTC)

This category was being added because subject "architectural features" is mapped to "Architectural elements in Australia" at Commons:State Library of Queensland/Subjects/Common. I have removed this mapping.[1] Would you like me to remove the category from all of the StateLibQld uploads? (The code to remove a category is ready; I just dont want the bot to interfere with your work). John Vandenberg (chat) 04:55, 21 December 2010 (UTC)
I think it would be a good idea to remove this category from the already-uploaded pictures. I suspect that many of the "architectural elements" are verandas (not surprising historically in Queensland) which are generally picked up in the sub-category, and most of the other photographs are of whole buildings or streetscapes (rather than elements) and generally the image descriptions don't allude to elements (doors, columns, arches etc). Melburnian (talk) 07:27, 21 December 2010 (UTC)
✓ Done cat removed John Vandenberg (chat) 22:20, 21 December 2010 (UTC)
Thanks for the response. You may wish to take a look at Category:Views as well. Again, this is too vague a category to be useful for individual images. -- Mattinbgn/talk 22:41, 21 December 2010 (UTC)
Top populated categories. Multichill (talk) 10:18, 22 December 2010 (UTC)

During the Upload patrol, I came accross a lot of SLQbot's uploads because the account doesn't have a botflag and isn't autopatrolled. Seeing the issues that had to be resolved first this is not a bad thing, since it shouldn't be autopatrolled untill it's good and ready. However, what about the 3,000 uploads that have been made so far. Are those going to be fixed retroactively ? For example the filepage mentioned here (and others like it) are still broken. And the ones before {{other date}} came in the picture and have "Contributors:" hardcoded in the author field: [2], [3], [4]. For now I've added a temporary exception in Commons Upload Patrol to pretend "SLQbot" is autopatrolled so that they dont have to be patrolled/fixed by humans. –Krinkletalk 15:23, 21 December 2010 (UTC)

The lack of {{otherdate}} applies to the first 350 uploads, and the date range parsing problem is limited to about 200 uploads after that. I will go back through the first ~800 and fix any issues (inc. renaming the first batches of images).
There is a lot of variance in dc:creator vs dc:contributor, and I have added 'Contributor(s):' so that the distinction between the two is visible on the Commons pages. We'll need to investigate this in more detail, and talk to the library staff where we would like more information than is contained in the metadata. The most common set of contributors is major Australian newspapers, and these should probably go through a separate workflow in order to have these images located in the newspaper pagescans (which are being brought online by the national library[5]).
As a general rule, I will be retrospectively applying any improvements that the community recommends. John Vandenberg (chat) 20:53, 21 December 2010 (UTC)

One of the open items on Commons talk:State Library of Queensland/Subjects#Patterns is the question about the approach to use for the many subjects for portraits, e.g. the approach used on Subjects/M. --  Docu  at 05:05, 23 December 2010 (UTC)

  • Please change the black and white category to "State Library of Queensland Black and white photographs" which should be a subcategory of the black and white category.Thanks.--Diaa abdelmoneim (talk) 22:00, 25 February 2011 (UTC)
    • Please don't intersect source (State Library of Queensland) with topic (Black and white photographs) categories. Multichill (talk) 10:18, 26 February 2011 (UTC)

Moving right along

It's been a number of days since there was any update here; have all concerns been properly addressed? Lankiveil (talk) 02:44, 29 December 2010 (UTC).

The bot is still not able to find super/sub-category relationships reliably, as the category tree is a frightful mess (mostly circular). I will continue to solve this riddle (by adding stop categories), but as a temporary measure I am doing pattern matching to guess supercat/subcat relationships, which is working quite well. Category:Media from SLQ Public domain needing categories has been emptied out except for cases where the subject headings don't provide useful categorisation, and humans will be far better at addressing the problem. I am manually controlling the recategorisation task to limit the number of times the bot works against the humans (re-adding categories removed by humans). Adding in useful geocoords data is an important 'todo', but I still have analysis and coding to do before this is ready to roll. I hope that thse are not impediments to being able to run with the 'bot' flag, as it should be obvious that this iterative process will be a net positive and not result in excessive human labour (other than my own). John Vandenberg (chat) 07:14, 29 December 2010 (UTC)

I think until you get things sorted you should continue to test, without a flag, so we can find your tests marginally more easily... Advise of any change in things ok? ++Lar: t/c 18:12, 6 January 2011 (UTC)

Any new news? ++Lar: t/c 23:33, 2 February 2011 (UTC)

Category:Lattices versus Category:Trellis

Hello, this bot mixes up Category:Lattices used for mathematical/cristallographic lattices; and Category:grilles or Category:Trellis used for architectural grilles and garden trellis. Both have parent category:Grids. Can you clean up Category:Lattices and train the bot to differentiate? --Havang(nl) (talk) 15:52, 18 April 2011 (UTC)

In Australia we call architectural grilles and garden trellis as lattice, you would find the bot is using what is in the meta data. Bidgee (talk) 07:24, 19 April 2011 (UTC).
That's like in many languages so.Clearly the photo's are not about crystals; the bot should be trained to label lattice to be directed to Category:Grilles which will do for grilles and trellis. --Havang(nl) (talk) 11:05, 19 April 2011 (UTC).
No need to be aggressive about it. I'm sure John will fix it. Bidgee (talk) 11:16, 19 April 2011 (UTC)
??? May-be I'am not sufficiently good in english to sense aggression in my texte. --Havang(nl) (talk) 11:44, 19 April 2011 (UTC)

Thank you for reporting this. It is now fixed[6] Sorry about the delay; I've been traveling.--John Vandenberg (chat) 02:50, 20 April 2011 (UTC)

Thank you, and have a good Easter. --Havang(nl) (talk) 07:10, 23 April 2011 (UTC)

✓ Approved No major issues for over a year. I am familiar with its work, and all seems in order. --99of9 (talk) 12:59, 10 May 2012 (UTC)