User:Multichill/Imagecopy

From Wikimedia Commons, the free media repository
Jump to: navigation, search

Wikimedia Commons wants to be the central place for free images and other media. That's a nice goal, but before we try to conquer the world, we'd better start with the Wikimedia Universe. The different Wikimedia wiki's still contain a lot of free files which should be moved to Commons (stats). Not all files can be moved, but at least a lot. Take for example en:Category:All free media, it contains 480.000 files. This is a huge job and can't be done by hand and automatically. Semi-automatic is the approach.

Requirements[edit]

  • Accurate, errors gets stuff deleted, we don't want that
  • To the point, we don't want to drown in redundant information
  • Fast, a lot to move
  • Easy to use, to have more users help out

New bot[edit]

To match these requirements I wrote a bot based on the lessons learned from previous tools. The bot is called imagecopy_self.py and is part of pywikipedia. The first version of the bot focuses on self-published works. It won't work on other files. This is over 300.000 files so it should keep us busy for a while. The bot will try to figure out the fields of {{Information}} based on the current information template and will fallback to free text, first uploader and first upload date. Category suggestions are from CommonSense and some filtering is applied. The bot is currently beta so still full of bugs (no, not really, just be careful ;-) ).

Approaches[edit]

300.000 files is a lot of work. For each subset you have two aproaches:

  • Cherry picking: Just do the good and easy files. Any file which takes too much time or if something is wrong: Just skip it. According to the Pareto principle we should be able to move a lot of files in not too much time with this approach.
  • Complete: When you're done the files are either moved to Commons, marked as non-free or deleted. This takes a lot of time and you might have to deal with some upset users.

Subsets[edit]

To be able to cope with the large number of files it's good to work on certain subsets:

Issues[edit]

Open[edit]

  • Date should maybe be i18ned
  • Source should maybe be i18ned
  • Use a regex to fuzzy extract a date when only upload date is available (example)
  • {{pd-self}} isn't caught correctly, see en:File:BHF-177 structure.png. Multichill (talk) 21:54, 8 August 2010 (UTC)
    • Looks like the re.ignorecase is ignored. According to the manual in 2.7 this should function correctly. Multichill (talk) 21:58, 8 August 2010 (UTC)
      • Found it and fixed in r8814. Multichill (talk) 19:32, 30 December 2010 (UTC)

Fixed[edit]

  • Should pre fetch descriptions and put it in a queue
  • Should put files to upload in a queue
  • Always fetch default fields
    • Now fetching it if the information based fields are empty. Multichill (talk) 16:46, 7 August 2010 (UTC)
  • If date is left empty, use default field
    • Now fetching it if the information based fields is empty. Multichill (talk) 16:46, 7 August 2010 (UTC)
  • If source is left empty, use default field
    • Now fetching it if the information based fields is empty. Multichill (talk) 16:46, 7 August 2010 (UTC)
  • If author is left empty, use default field
    • Now fetching it if the information based fields is empty. Multichill (talk) 14:44, 8 August 2010 (UTC)
  • If filename already exists, don't loose all filed out fields
  • Add a line in the code saying "This is still in test - please report any errors to xxx".
  • Enwp seems to use a location field in {{Information}}
    • Is now added to description if found. Multichill (talk) 16:26, 8 August 2010 (UTC)
  • Get fields from current information template with pywikipedia code (not regex)
  • Remove 1= in cases like this.
    • Looks like ==Licensing:== was causing this and not the 1=. Made the regex more flexible. Multichill (talk) 16:26, 8 August 2010 (UTC)
  • If author=<uploader>, use default field
  • Should not add images to Category:People by name
  • It seems that templates like "cite book" in the information template makes the bot create a lot of blank fields.

Forget for now[edit]