Jump to content

User talk:Dominic

From Wikimedia Commons, the free media repository

DPLA bot uploading hundreds of duplicates

[edit]

Hi. It looks like the bot has been reuploading hundreds, if not thousands, of images that it already uploaded a few years ago. File:Forsythe Post, Toledo, O. - DPLA - 38a36309eb80ebb59906851d1558dd68 (page 1).jpg from a week ago and File:Forsythe Post, Toledo, O. - DPLA - dfaf04e4a46a5703bd52b228170c6335 (page 1).jpg from 2020 being two of many examples. Is there something you can do about it and/or a way to deal with the duplicate files it has already uploaded? Thanks. Adamant1 (talk) 06:44, 4 March 2025 (UTC)[reply]

@Adamant1: Yikes, that's not great. This is a little more complicated than it seems. We already prevent upload of images with a matching hash. It looks like at some point since the first one was uploaded 5 years ago, it was assigned a new identifier (the original catalog link on the first one is also broken). But also the new uploaded file is not an exact duplicate in a technical sense, which you can also see by the different file sizes. So it is impossible to have detected this beforehand, or have prevented, if the image is both a different identifier and image hash—that is indistinguishable from true distinct items. Maybe this is isolated to this one institution, and something happened to cause it like a change across their whole site, but I am investigating still. 16:10, 4 March 2025 (UTC)
I could figure out all the bad identifiers and we could delete those ones, but those are the ones that were uploaded first, and some might be used in articles. We'd want to know where to redirect them still. I think I can figure out a large number of these by making certain assumptions. For example, if there are exactly two files with the same name (disreagarding the identifier), and one was uploaded before this year, and one was uploded this year, that is almost certainly a duplicate. There are about 20,000 of these. The main problem arises with certain titles that are very common in the data (e.g. Special:PrefixIndex/File:Block_Card_(address_not_identified)), and I am not sure yet how to reliably figure out what duplicates what. There are close to 1000 of these. Dominic (talk) 17:59, 4 March 2025 (UTC)[reply]
Hhhmm, I've only noticed it in relation to postcards in the Ken Levin Toledo Postcard Collection. I was thinking it might be because they changed the hashes for files on their end but there's instances where the bot is overwriting exiting images instead of just uploading separate files. So I don't know. Unfortunately it isn't an area that I know much about. Thanks for looking into and dealing with it though. --Adamant1 (talk) 20:07, 4 March 2025 (UTC)[reply]
File:"I am Looking Forward to Dictating Peace to the United States in the White House of Washington." - NARA - 514556.tif has been listed at Commons:Deletion requests so that the community can discuss whether it should be kept or not. We would appreciate it if you could go to voice your opinion about this at its entry.

If you created this file, please note that the fact that it has been proposed for deletion does not necessarily mean that we do not value your kind contribution. It simply means that one person believes that there is some specific problem with it, such as a copyright issue. Please see Commons:But it's my own work! for a guide on how to address these issues.

Please remember to respond to and – if appropriate – contradict the arguments supporting deletion. Arguments which focus on the nominator will not affect the result of the nomination. Thank you!

The Squirrel Conspiracy (talk) 20:53, 12 March 2025 (UTC)[reply]