User:Multichill/Scaled-down duplicates

From Wikimedia Commons, the free media repository
Jump to: navigation, search

People sometimes make the mistake to not transfer the original image from a wikipedia to Commons, but a thumbnail. This page describes a bot to spot and mark these kind of mistakes.

Process[edit]

Find pairs to work on[edit]

First we need pairs of images to work on. These pairs can be found in several ways:

  1. On Commons we have an image and at some wikipedia we have an image with the same name, but a different hash
  2. On Commons we have an image with a name in the form <number>px-<name>.<extension> where an image <name>.<extension> exists at some wikipedia or Commons

We should probably divide it:

  1. Batch runs to find old duplicates
  2. Daily run to find yesterdays duplicates

Match duplicates[edit]

We're working on pairs to peform matches

Size[edit]

One of the images should be smaller in size. This is the image which could be marked in the end.

Aspect ratio[edit]

The image should have about the same aspect ratio. For example with a 20% margin: 80% < (height image A / width image A) / (height image B / width image B) * 100 < 120%

Histogram[edit]

Histograms are the core of the matching. First the biggest image has to be scaled down to the same size as the other image. It's probably best to make a couple of histograms:

  • Whole images
  • Top left part of the images
  • Top right part of the images
  • Bottom left part of the images
  • Bottom right part of the images
  • Central part of the images

These histograms will match for a certain percentage. If this is above a certain threshold we have a match

Mark duplicates[edit]

The lowest quality image of the match should be marked with a template containing:

  • The location of the higher quality image
  • The size of this image and the other image
  • The height of this image and the other image
  • The width of this image and the other image
  • Maybe aspect ratio
  • The results of the histogram calculations
  • The match percentage

Implementation[edit]

The first implementation is available in the pywikipedia package and is called match_images.py (source).