Commons:OpenRefine/Adding structured data with OpenRefine

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
If you prefer watching a video, this tutorial during Wikidata Lab XXXIV (9 June 2022) also gives an introduction to editing Wikimedia Commons with OpenRefine. (Approximately 1 hour 20 minutes)

Below, you find step by step instructions on how to (batch) add structured data to (existing) Wikimedia Commons files with OpenRefine.

Download and install OpenRefine (version 3.6 or later!)[edit]

Download and install OpenRefine on your computer. To edit files on Wikimedia Commons, you need OpenRefine 3.6 or newer. You can download OpenRefine for Windows, MacOS and Linux from https://openrefine.org/download.html.

Tips:

Get a list of file names that you want to work with[edit]

Get a list of the file names (on Wikimedia Commons) that you want to add structured data to. You can obtain such a list with various tools. One example is the PetScan tool. Expand the table below for detailed instructions on how to do this with PetScan:

A step-by-step guide on retrieving a list of Commons file names using the PetScan tool  
 Launch PetScan First of all, launch the PetScan tool!
PetScan Commons category selection.png You will start the tool inside the first tab ('Categories').
  • Make sure you select Wikimedia Commons here, by clicking on Commons.
  • Categories: Type or paste one or more Commons category names here which contain the file names you want to retrieve. Omit the Category: prefix.
    • You can indicate the depth with which you want to retrieve files from the category tree. In the example shown on the left, we are retrieving files that are directly in the category Uploaded with iNaturalist2Commons AND that have the category Lepidoptera of Australia (or one of its subcategories, up to three levels deep).
  • Combination: if you select the radio button 'Intersection', you will only retrieve those files that are in all your chosen categories - usually a smaller number of files. If you select the radio button 'Union', you will retrieve a larger number of files that are in either of the categories you entered. Usually 'Intersection' is the logical option.
PetScan Commons choose only files.png Go to the next tab in the tool ('Page properties').
  • Namespaces: Deselect the first (unnamed) checkbox and select the File checkbox. This indicates that you only want to retrieve file names (not categories, not gallery page titles, etc).
PetScan Commons do it button.png If you like, you can now already click on the 'Do it!' button to verify that you are indeed retrieving the right file names.
PetScan Commons output plain text.png In some cases, it is convenient to retrieve the file names as plain text, or in another format. You can adjust that in the last PetScan tab ('Output').
  • Format: for instance, select the 'Plain text' radio button.
  • Scroll down and click on the 'Do it!' button again. You will now see the list of file names as plain text.
PetScan Commons plain text file output.png If you chose 'Plain text' as output format, you will get a list of file names in plain text.

The following link gives you the above-shown example, with HTML output: https://petscan.wmflabs.org/?psid=22129478

PetScan's full manual is available on meta.wikimedia.org.

Start an OpenRefine project[edit]

Usually, you will start editing Wikimedia Commons files in OpenRefine from a list of file names from Commons. The section above described how to retrieve such a list via the PetScan tool, but you can also retrieve / obtain this list in other ways (e.g. from the Wikimedia Commons or Wikidata query service, or via another method of your choosing). The format of the file names can vary. The following formats (and possibly more) will work well in OpenRefine:

  • Just the file name - Nv-diniltsxo.mp3
  • With File: prefix - File:PDP-UY_-_Orquesta_Estable_del_Teatro_Colon_-_Himno_Nacional_Uruguayo_-_instrumental_-_Debali_-_Victor-79694a-a79694a.flac
  • Or various URL formats:
    • http://commons.wikimedia.org/wiki/Special:FilePath/3NT13DanJ.gif
    • https://commons.wikimedia.org/entity/M93838388
    • https://commons.wikimedia.org/wiki/File:Aalscholver_op_paal-4676900.webm

You may have just a list of file names, or a larger spreadsheet or dataset with extra data about the files. Both are good starting points in OpenRefine.

Depending on the data format you have, you can now enter this data into OpenRefine and start a project with it.

Indicate that you will work with Wikimedia Commons (not Wikidata)[edit]

In order to make edits to Wikimedia Commons possible, start by adding the Wikimedia Commons manifest to OpenRefine. This manifest is a kind of 'settings' file that provides OpenRefine with all the information it needs to be able to edit Wikimedia Commons. Do this as follows:

  • In the Wikidata extension menu at the top right of your OpenRefine project, choose Select Wikibase instance.... Click Add Wikibase. You will be prompted to paste either a manifest URL, or paste the JSON directly. Wikimedia Commons' manifest URL is: https://raw.githubusercontent.com/OpenRefine/wikibase-manifests/master/wikimedia-commons-manifest.json
  • After adding this URL, you should now see Wikimedia Commons in your list of Wikibase instances. Click Wikimedia Commons to activate it. You can now close this dialog window by clicking the Close button.
  • Adding the Wikimedia Commons manifest in OpenRefine will also automatically add the Wikimedia Commons reconciliation service, which you will need a bit later in the process.

Reconcile the file names with Wikimedia Commons[edit]

Once you have loaded (at least) a list of files from Wikimedia Commons into OpenRefine, you can use the Wikimedia Commons Reconciliation Service as a starting point to begin batch editing these files. This step makes sure that OpenRefine recognizes these files, links them to their M-ids on Wikimedia Commons, and can edit them later. You start the reconciliation process by selecting ReconcileStart reconciling... in the file column's menu. Then select the Wikimedia Commons reconciliation service and click the Start reconciling... button.

If you don’t have the Wikimedia Commons reconciliation service installed in OpenRefine yet, click the button Add standard service... and paste https://commonsreconcile.toolforge.org/en/api there. If you prefer working with properties and labels in a different language, you can replace the en string in that URL with the two-letter language code of your choice.

Extract Wikitext and structured data[edit]

This step is optional, but may be very useful. Existing files on Wikimedia Commons are always described with Wikitext, which usually contains information about the file's creator, license, and one or more Wikimedia Commons categories. It will often make sense to parse this Wikitext in OpenRefine, retrieving valuable bits of data from it which can be converted to structured data in a next step. Good examples of such data may include:

  • The file's description, which you can convert to a file caption
  • The file's creator
  • The file's source
  • Things depicted in the file may be mentioned in the file's categories

In order to create one or more new columns with Wikitext (and structured data statements) from your column of reconciled file names, select Edit columnAdd columns from reconciled values... in the file column's menu. You will get a dialog window in which you can select one or more options; you can choose just one or multiple.

  • Wikitext: will create a column with the (full) Wikitext of each file
  • Various structured data statements; the dialog windows suggests several common ones, but you can use the search functionality to search for any property that you are interested in
  • You can retrieve file captions by typing the capital letter C, followed by the two-letter language code (e.g Cen for English file captions, Cja for Japanese file captions).

Reconcile other columns with Wikidata[edit]

Structured data on Commons describes files on Commons by using (multilingual) items and properties from Wikidata.

Perhaps some of your columns correspond to Wikidata items. You will need to reconcile these, to help OpenRefine understand that it will need to make the link to these Wikidata items. Examples include:

You will reconcile these columns against the Wikidata reconciliation service, in English or another language that may be relevant (English usually works fine). The English Wikidata reconciliation service is installed by default in OpenRefine.

Reconciled columns have a header that is underlined with a dark green stripe; values in the column are blue hyperlinks which point to Wikidata items.

Create your editing schema[edit]

Finally, you will build a schema in OpenRefine, to model the Wikimedia Commons edits that OpenRefine will perform for each row in your project.

Click on the Schema tab in the blue bar above your dataset, or go to the Wikidata/Wikibase extension menu and select Edit Wikibase schema. You will get an empty schema window at first. Verify that the info text on top mentions Wikimedia Commons; if it mentions Wikidata, then you need to switch your Wikibase instance to Wikimedia Commons via the Select Wikibase instance... menu item in the Wikidata/Wikibase extension menu.

Click on the blue + add media link. Several fields will appear.

You can now type, and/or drag and drop all the info you want to include in the files' metadata.

Preview and upload your edits to Wikimedia Commons[edit]

You can preview your edits by clicking the Preview tab on top of your schema. The Issues tab will inform you about errors that may be present in your data or schema, so that you can fix them.

When you are ready to upload your edits, then select Upload edits to Wikibase... in the Wikidata/Wikibase extension menu, and log in with your Wikimedia Commons credentials. OpenRefine will encourage you to use a bot password, but if you like, you can ignore this warning. Provide a descriptive edit summary. No need to change the maxlag value. Click Upload edits and your batch edit will start.

You will see your recently edited files in your own edit history on Wikimedia Commons.

Oops! Made a mistake?[edit]

When checking your user contributions, you will see your recent Wikimedia Commons edits done with OpenRefine. Each OpenRefine edit displays a (details) hyperlink after the edit summary, which links to the edit batch in the EditGroups tool.

In EditGroups, entire batches can be easily undone, in case some mistakes have been made.

All Wikimedia Commons batches with OpenRefine are listed at https://editgroups-commons.toolforge.org/?tool=OR.