Commons:Requests for comment/Batch categorization requirements

From Wikimedia Commons, the free media repository
Jump to: navigation, search

Should we continue to allow batch uploads where categorization is not complete at the time of upload and rely on check or 'burn-down' categories?

Many upload projects require months of manual teamwork from volunteers after the upload to gradually work through large categories, using tools such as hotcat, to gradually improve categorization, often by checking if there are close category matches to keywords copied from the source metadata to the image page description. Frequently uploads start off with either a single basic top level category which needs diffusion by interested volunteers, or use a backlog category which can be managed and reported on as part of the project.

I propose that batch upload guidelines are added to Commons:Batch uploading to clarify best practice which should include a requirement that if large batch uploads (more than 5,000 images) depend on backlog categories, that a project page must exist that explains how the backlog will be worked on.

Examples
  1. Category:Images from MoD uploaded by Fæ - a moderate sized continuing batch upload of just under 3,000 high quality photos, no automated categorization, only a check category for tracking purposes, but with informal cooperation by the MILHIST wikiproject and emails on record with the Ministry of Defence to use their API as a source of metadata.
  2. Commons:Batch uploading/Los Angeles County Museum of Public Art - a medium sized project of 22,000 high quality art history photographs with an initial category for each image based on the LACMA metadata. This project page was created to gain consensus on format and approach, with the discussion taking a couple of months before batch upload started.
  3. Category:Airliners.net_photos_(check_needed) - a large project, currently at 40,000 images with plans to continue to more than 100,000 photographs of aircraft. In cooperation with transport enthusiasts and due to the nature of the templates around 2/3 start off with a low level category for the aircraft but those missing need to have a category created by an avionics wizard. A project page is being considered to show progress, project members and to explain how the cropping and OTRS releases have been organized.
Endnotes
  • The "preserve" aspect of the aims of this project means that uploading with some categorization or the use of project backlog categories may be an adequate response in order to preserve human knowledge given the risk that these assets often do become inaccessible to the public over time, as a priority over other considerations. A recent example is the threat this month by the Library of Congress to remove public access through websites due to a shortage of funding.[1]
  • We should recognize that fully automated categorization has rarely been successful in the past when deducing these from external metadata such as keywords or place names, though in some cases practical heuristics have evolved after learning from substantial manual categorization.

-- (talk) 09:52, 29 September 2013 (UTC)

Pictogram voting comment.svg Comment I have nothing against batch uploads, or anything against temporary categories used as holding pens for such uploads. However in the case of very large uploads, files can become lost in limbo. Hidden counting coup categories should not be added to file if they prevent the categorisation bots from recognising that files are uncategorised or have unchecked categorisations and placing them in either Category:Media needing categories and Category:Media needing category review, these being the established venues for such files.--KTo288 (talk) 22:01, 30 September 2013 (UTC)

But on the other hand, it would be better for them not to appear in those venues for a while, because they may flood that space which is normally occupied by the uploads of newbies. If there's a working system for their categorization, it might be better not to show them in that category unless they do actually get "lost in limbo". --99of9 (talk) 04:13, 21 October 2013 (UTC)
I'm not clear on current practice, but if the workdown category appeared there as a subcategory it should be workable. Maybe a custom template? Dankarl (talk) 16:43, 29 October 2013 (UTC)
As one of the most active current batch uploaders, I do not know of any best practice or norm established about this. In practice I have two of my projects with categories appearing under Category:Media needing category review rather than using the {{Unc}} template directly. It would be nice if there was a strong enough community view to establish the guidelines as when it was expected and when not. I am on the Steering Group of the GLAM toolset project, and guidelines like this might be in a position to influence the batch upload of many millions of high quality images next year, hopefully most will be well categorized and side-step this problem, but there may be cases where the source metadata is too weak or unreliable enough to need manual checks. -- (talk) 16:53, 29 October 2013 (UTC)
  • As a first step, it would be worth setting up a category for all batch upload temporary holding categories, so that we can spot any that don't go down to zero at the expected time. --99of9 (talk) 04:13, 21 October 2013 (UTC)

Pictogram voting comment.svg Comment I still think it is incumbent on the batch uploader to make sure there is enough information about each image to describe and categorize it, or at least to research it further. Otherwise don't bother to upload it as images of unidentified subjects are unlikely to be of educational value. (Yes there are exceptions but I do not think they are suitable for batch upload.) Dankarl (talk) 16:44, 29 October 2013 (UTC)