Commons:Bots/Work requests

From Wikimedia Commons, the free media repository
Jump to: navigation, search

Bot policy and list · Requests to operate a bot
Requests for work to be done by a bot · Changes to allow localization  · Requests for batch uploads

Filing cabinet icon.svg

SpBot archives all sections tagged with {{Section resolved|1=~~~~}} after 5 days.

Detect identical duplicates in file namespace[edit]

Is it possible to query the database for files with identical checksum? If possible could this be made a regular query with an output file made similar to Commons:Database reports/Largely duplicative file names? Most interesting would be files with 1 or 2 duplicates, these are most likely not intended. --Denniss (talk) 20:43, 7 January 2014 (UTC)

That might be a pretty complicated query.  Doing… --Zhuyifei1999 (talk) 05:10, 8 January 2014 (UTC)
WTF 'image' table has no id column? @Hazard-SJ: Can you do this? --Zhuyifei1999 (talk) 05:24, 8 January 2014 (UTC)
Bump --Denniss (talk) 14:27, 29 January 2014 (UTC)
I've just started to test some things, but so far I've received no results. Of course, we can't only rely on duplicate sha1 values, so I'm testing with other fields to see if I can get anything.  Hazard SJ  04:07, 7 February 2014 (UTC)
im hoping that something like this will eventually become a special page (gerrit:85446. Only pending review for 5 months now..). Feel free to copy the query used there. I doubt many other of the fields would be useful (img_metadata possibly, but that would be super slow). From a longer term prespective, i think it would be interesting to have a log of recently uploaded files that are dupes of old and deleted imaged. As for the structure of image/oldimage/filearchive, yeah they suck (from what i understand the article related tables got fixed to something saner in mediawiki 1.5, but nobody ever fixed up the image table). Bawolff (talk) 02:27, 14 February 2014 (UTC)
@Bawolff: If I'm not mistaken, that query just brings up files with the same sha1 values, correct? It is possible for different files to have the same value, so we wouldn't want to rely on that alone.  Hazard SJ  03:31, 14 February 2014 (UTC)
well there certainly exists in theory such files (by the pigeon hole principle), as far as i know, nobody has ever found any examples for sha1 thus far. If you manage to find 2 different files with the same sha1 (excluding some case of db corruption or integrity issues. I know there was at least one bug a while back where a file had the incorrect sha1 in db, but i think that has all been fixed), it would be a very big deal (to crypto folks anyways) Bawolff (talk) 05:20, 14 February 2014 (UTC)
SELECT count(*) as "Numb dupes", MIN(img_name) as "Image name" from image group by img_sha1 HAVING count(*) > 1 order by count(*) desc Gives results like [1] Bawolff (talk) 20:45, 14 February 2014 (UTC)
tldr. Is User:RLuts/duplicates the wanted report? --McZusatz (talk) 19:35, 16 February 2014 (UTC)
@Denniss: is that (Bawolff's list) okay? If so, would you still want a database reports page for it? It could link to Special:FileDuplicateSearch such as Special:FileDuplicateSearch/Abhandlung_von_den_Zähnen_(Pfaff)_219.jpg.  Hazard SJ  01:17, 18 February 2014 (UTC)
User:RLuts/duplicates is almost exactly what I want, just as a regularly updated (weekly?) database report from bottom to top (least numbers of duplicates first) and limited to ~500 entries to keep the page size in reasonable limits. The format of Bawolff's list is also OK, it seems to list only one file with existing dupes and not all existing dupes, if you can transform it into proper wikilinks it would be easier to handle than the pure text. Both pages show a hell lot of dupes.--Denniss (talk) 10:34, 18 February 2014 (UTC)
I'm not really familiar with how the formatting of commons db reports work. I put my list through a quick awk script which yielded [2]. Bawolff (talk) 22:17, 18 February 2014 (UTC)
@Denniss: Would that work?  Hazard SJ  02:51, 23 February 2014 (UTC)
Yes. --Denniss (talk) 07:19, 23 February 2014 (UTC)
Could someone integrate this report into the Databse report overview site? --Denniss (talk) 12:45, 17 March 2014 (UTC)
Could someone please create a regular Bot job for this? Bawolff's page was last updated on March 12. --Denniss (talk) 09:12, 1 April 2014 (UTC)
@Steinsplitter: Try this. You're good at dynamic DBRs. --Zhuyifei1999 (talk) 10:53, 1 April 2014 (UTC)
@Bawolff: is there an ETA for your Special:ListDuplicatedFiles? --McZusatz (talk) 12:59, 1 April 2014 (UTC)
toollabs:steinsplitter/dupe.php (update every day, ~13:00 UTC) --Steinsplitter (talk) 14:37, 1 April 2014 (UTC)
Special:ListDuplicatedFiles should be available on April 4 (It was deployed today, but cached special pages only update every 3 days). Sorry about the script on my tool labs dying. Tool labs moved clusters, and decided not to transfer cron scripts, so all my repetitive tasks stopped working. Bawolff (talk) 21:13, 1 April 2014 (UTC)
The special page finally works! I was half expecting something bad to happen to prevent it from working. Bawolff (talk) 14:16, 4 April 2014 (UTC)

PD images from Flickr with a non-free license[edit]


There are many public domain images on Flickr which have a non-free license. They can't be uploaded with the Upload Wizard. I wonder if it would be possible to use a bot. Examples:

What information do you need to upload all this with a bot? Yann (talk) 13:59, 30 November 2013 (UTC)

@Yann: you mean to use a bot to upload them? And how do the bot show that the image is not a copyvio? --Zhuyifei1999 (talk) 07:29, 8 December 2013 (UTC)
@Zhuyifei1999: I would check that. That's what I ask: if I make a list, could a bot upload them? Sorry if I was not clear. Actually, I was also asking if the bot could make a list. Upload all images where one of this photographer is mentioned in the description, and add a category "Files to be checked". Yann (talk) 09:36, 8 December 2013 (UTC)
@Yann: If there's a list of photos (or even better, urls), yes. If there isn't, or the bot have to google it, I'm afraid it would be hard for me to do so. Not sure about other bot operators, though. --Zhuyifei1999 (talk) 09:11, 9 December 2013 (UTC)
OK, I am going to make a list at User:Yann/PD images from Flickr with a non-free license. Yann (talk) 20:12, 11 December 2013 (UTC)
@Zhuyifei1999: OK, I made a list: User:Yann/PD images from Flickr with a non-free license. Could you start with that? And I found more photographers with the same case. Thanks in advance, Yann (talk) 19:28, 30 January 2014 (UTC)
Will do. --Zhuyifei1999 (talk) 06:21, 31 January 2014 (UTC)
I completed the first part: all photographers I had initially selected above. Yann (talk) 19:19, 31 January 2014 (UTC)
@Yann: How to bypass {{Flickrreview}} and show the image really is PD? --Zhuyifei1999 (talk) 06:55, 8 February 2014 (UTC)
 Doing… --Zhuyifei1999 (talk) 07:01, 8 February 2014 (UTC)

@Zhuyifei1999: Ok, thanks, tell me when you are done. Regards, Yann (talk) 12:28, 13 February 2014 (UTC)

Sorry, will take some time. I'm just busy with everything. --Zhuyifei1999 (talk) 11:10, 14 February 2014 (UTC)Timestamp (no archiving): Zhuyifei1999 (talk) 11:00, 6 March 2014 (UTC)
@Zhuyifei1999: Hi, What's the status of this request? Regards, Yann (talk) 14:43, 16 March 2014 (UTC)
It's mostly done, but technical problem with User:Yann/PD_images_from_Flickr_with_a_non-free_license, I'll get this done before or on Friday (and a test run and a BRFA after that). --Zhuyifei1999 (talk) 15:02, 16 March 2014 (UTC)
OK, thanks. Yann (talk) 04:11, 17 March 2014 (UTC)
@Yann: Commons:Bots/Requests/YiFeiBot_(17) Also what filename would you like for ? --Zhuyifei1999 (talk) 12:52, 21 March 2014 (UTC)

@Yann:@Yann: All listed ones ✓ Done except if it is detected as a duplicate. --Zhuyifei1999 (talk) 08:53, 28 March 2014 (UTC)

Uncategorized Bundesarchive images[edit]

The famous photo The famous photo
The famous photo

Hello all. I think it'ld be reasonable to make a category of Bundesarchive images having no categories, except for the hidden, like this one. Obviously this is work for a bot. Thank you beforehands! Ain92 (talk) 10:53, 18 March 2014 (UTC)

✓ Done See catscan2 report. -- (talk) 10:28, 21 March 2014 (UTC)
  • Smiley.svg Thank you. It seems that it'ld be more convenient for me to use Media needing category review as of 19 March 2014 and Media needing category review as of 20 March 2014, and I've already found some nice photos. BTW, note these dupes from IWM: two Churchills giving the V-sign on the right, HU59487 and HU 059765, HU59487 and HU 059487. Ain92 (talk) 09:07, 22 March 2014 (UTC)
    • Yes, duplicate checking through the API remains slightly crap. There is also a specific problem with IWM images as the IWM appear to have re-scanned some of the images but not marked the images to indicate this. Consequently the SHA1 checks show what at first glance seems the same image at the same resolution (800px max on their website) to be different images. There are no easy solutions, though I do use any unique IDs as a check which stops most duplication. Feel free to use {{duplicate}} on any that you think are obvious. -- (talk) 09:35, 22 March 2014 (UTC)


Hi there,

it would be nice if somebody could generate a list of all svg files (current version) that contain the string "xlink:href="data:image". Anybody up for that?

Cheers --Cwbm (commons) (talk) 11:54, 21 March 2014 (UTC)

Pictogram voting info.svg This will likely take weeks/months/years --Zhuyifei1999 (talk) 03:01, 22 March 2014 (UTC)

It would also be helpful if a bot could check the recent uploads and mark them with {{BadSVG}}. --Cwbm (commons) (talk) 08:36, 24 March 2014 (UTC)

Could you explain why this would be valuable and (considering the bandwidth cost) if there are any other non-controversial checks that could be usefully added at the same time? -- (talk) 08:53, 24 March 2014 (UTC)
1) There are always new users who think that uploading raster files in a svg container is a good idea and user that convert existing pngs by uploading them in a svg container. Having a bot checking the recent uploads would enable to inform the uploader in a timely manner. 2) One could also check if the file is corrupt or if the svg is valid. But I don't know if that's reasonable. --Cwbm (commons) (talk) 11:25, 24 March 2014 (UTC)

As a method of filtering out suspect files, the following table was generated in pywikipediabot by looking at the latest 2,000 files uploaded and filtering out those with "svg" extensions and those over 50,000 bytes in size. It took under 10s to run. Do you think if we only checked SVGs of this size for further possible problems, this would be useful? A bot running this once an hour might be sufficient. -- (talk) 13:27, 24 March 2014 (UTC)

You are going to miss some files, particularly logos which can be under 10 kb even as embedded raster graphics. But it's better than nothing. --Cwbm (commons) (talk) 13:37, 24 March 2014 (UTC)

I have dropped the old report from this discussion. A recent run is at User:Faebot/SandboxS. This goes through the last hour of uploads, to a maximum of 2,000 images (which appears plenty), picks out the SVG files over 5k in size, and then retrieves these files and examines their contents. If they match the regex "xlink..?href.?=.?.?data:image" then {{BadSVG}} is added to the image page. It seems a bit silly to load each matching SVG if this is the only test. It would be neat if there were several things to test while the bot is looking.
I am running the script locally rather than on WMF labs, which means it would probably be faster later. At the moment this run took 4 minutes to do but this represented the last 3 hours worth of uploaded images. So, say, running every 60 minutes back through the recently uploaded log, up until the point where it last ran to a maximum of 2,000 files in each slurp, would probably cope quite easily. -- (talk) 21:55, 28 March 2014 (UTC)

First test run—running hourly, updating User:Faebot/SandboxS, I expect to leave this as a soak test over the weekend:
Images marked as BadSVG:
  1. 15:16, 31 March 2014 File:Cilician Armenia-hy.svg
  2. 15:16, 31 March 2014 File:KHW logo color.svg
  3. 00:17, 31 March 2014 File:Cannabis laws.svg
  4. 23:17, 30 March 2014 File:MH370 data graphs.svg
  5. 23:16, 30 March 2014 File:Biml logo.svg
  6. 20:16, 30 March 2014 File:IndoParthianKingdom.svg
  7. 17:16, 30 March 2014 File:Roman Empire 125 general map.SVG
  8. 16:16, 30 March 2014 File:Blason ville fr Érize-la-Brûlée (Meuse).svg
  9. 13:16, 30 March 2014 File:Roman Empire 125 political map.svg
  10. 09:16, 30 March 2014 File:Arms of the Viscounts of Villamur.svg
  11. 09:16, 30 March 2014 File:Greco-BactrianKingdomMap-es.svg
  12. 00:17, 30 March 2014 File:2014 Latakia Offensive Map.svg
  13. 23:17, 29 March 2014 File:Gundam Sentinel Wikipedia espanol.svg
  14. 23:16, 29 March 2014 File:Rif Aleppo2.svg
  15. 22:17, 29 March 2014 File:Battle of Qalamoun.svg
  16. 22:16, 29 March 2014 File:Einsatz Pu-12M im Regiment 9K33.svg
  17. 22:16, 29 March 2014 File:Battle of Daraa City.svg
  18. 20:17, 29 March 2014 File:IndoGreekKingdomAndCampaigns.svg
  19. 15:17, 29 March 2014 File:Battle of Marathon Initial Situation (it).svg
  20. 15:16, 29 March 2014 File:2.5 Russian road sign.svg
  21. 12:16, 29 March 2014 File:Major-Cultural-spheres.svg
  22. 11:16, 29 March 2014 File:South East Asia location-Naja-siamensis.svg
  23. 10:17, 29 March 2014 File:HUN Visegrád Címer.svg
  24. 23:16, 28 March 2014 File:Vostok Spacecraft Diagram.svg
✓ Done

I think this request can be marked as done for the moment. I will think about moving my working script over to labs at some point, though it will need little re-writing before then. The data volumes involved and the number of images that might be templated each day are relatively small, for this specific check. As it seems non-controversial, I will keep the bot running like this once an hour (with an average run time of under 60s). -- (talk) 14:51, 31 March 2014 (UTC)

Thanks a lot for your effort. Cheers --Cwbm (commons) (talk) 06:35, 1 April 2014 (UTC)

Fixing typo "transfered" to "transferred" 2[edit]

Apologies for resurrecting a two-year old thread, but since this issue with the Upload Bot has been fixed, can someone contact User:Schlurcher on their dewiki page if they're still active and let them know they've got the go-ahead to run the bot to fix the typos, or perhaps write up a bot yourself for this task? TeleComNasSprVen (talk) 10:19, 30 March 2014 (UTC)

@TeleComNasSprVen: {{transferred from}} should be used instead of simple text. --Ricordisamoa 11:05, 30 March 2014 (UTC)
I guess this would be more in the scope of an internationalization issue then --TeleComNasSprVen (talk) 17:06, 30 March 2014 (UTC)
Then, as SchlurcherBot 1, SamoaBot 3 and OgreBot 3. --Ricordisamoa 21:56, 30 March 2014 (UTC)

split category from user template[edit]

I have uploaded lots of files on which I use the templates user:Thryduulf/cc-by-sa-all or user:Thryduulf/copyleft which categorises the files into Category:Photos by Chris McKenna. I now wish to subcategorise some of my image uploads. The easiest way I can think to do this is if someone could go through all images tagged with those templates and explicitly add them to my category so I can subcategorise using cat-a-lot as I get time. When the run is complete (please advise me here or on my talk page) I will remove the categories from the templates. Thanks. Thryduulf (talk) 19:28, 31 March 2014 (UTC)

Anyone? Thryduulf (talk) 09:53, 7 April 2014 (UTC)
✓ Done You may want to play around with AWB if you don't want to install Pywikipediabot or find a work-around for large categories using VisualFileChange.js. I think we need more people with these tool skills. -- (talk) 15:55, 8 April 2014 (UTC)
Thank you. I don't think I'm the right person for doing this sort of task - as a Linux user AWB isn't an realistic option (I've never successfully used wine for anything), I know no python, don't feel comfortable with the idea of running a bot and much of the VisualFileChange page is gibberish to me. Sorry. Thryduulf (talk) 20:39, 8 April 2014 (UTC)

Spaces before file extensions[edit]

Someone suggested on Commons:Village pump/Archive/2014/03#Spaces before file extensions in titles that a bot go looking around for filenames with spaces before the file extension and rename them to eliminate the space, while still retaining the redirect for reusers linking back to the file just in case. As a subtask, can a bot also look for instances of %E2%80%8E (hidden left-to-right mark U+200E) in filenames, or perhaps upload a test file containing the hidden character? TeleComNasSprVen (talk) 01:24, 9 April 2014 (UTC)

Sounds good to me. --99of9 (talk) 04:12, 9 April 2014 (UTC)
To clarify the subtask, I'd like to see if we can query the database to see if we have any left-to-right marks in either filenames or category names and if possible fix them. TeleComNasSprVen (talk) 06:08, 9 April 2014 (UTC)
Uploading such files are impossible as far as I know. Database query:  Doing… --Zhuyifei1999 (talk) 10:06, 9 April 2014 (UTC)
Query SELECT * FROM page WHERE page_title LIKE CONCAT("%", CHAR(0x200e USING utf8), "%"); → Empty set (37.66 sec) --Zhuyifei1999 (talk) 10:32, 9 April 2014 (UTC)

TIF files with jpeg compression[edit]

The image thumbnail generating process seems to have problems with .tif files using .jpeg compression (distorted colors). Is there an easy way to detect them and re-upload them as uncompressed version? --Denniss (talk) 14:39, 16 April 2014 (UTC)

 Making a list... The first 2 tiff images with jpeg compression is grayscale (no color problems) --Zhuyifei1999 (talk) 08:24, 17 April 2014 (UTC)
SELECT img_name FROM image WHERE img_minor_mime = "tiff" AND img_metadata LIKE '%s:11:"Compression";i:7;%'User:Zhuyifei1999/sandbox (in fact, most, even colored photos look good) --Zhuyifei1999 (talk) 08:57, 17 April 2014 (UTC)
File:AloisianumLinzMaerz.tif looks fine but check the other thumb sizes,File:Alte Schule mit Pausenhalle, Eschelbronn.tif look completely wrong colored. See Commons:Village_pump#Change_of_colours_after_a_move and linked bugzilla report. Anyway I don't think it's a good idea to have TIF images using JPEG compression, the original intention of TIF is lossless. --Denniss (talk) 10:51, 17 April 2014 (UTC)