Commons:OpenRefine/Adding structured data with OpenRefine
Below, you find step by step instructions on how to (batch) add structured data to (existing) Wikimedia Commons files with OpenRefine.
Download and install OpenRefine (version 3.6 or later!)[edit]
Download and install OpenRefine on your computer. To edit files on Wikimedia Commons, you need OpenRefine 3.6 or newer. You can download OpenRefine for Windows, MacOS and Linux from https://openrefine.org/download.html.
Tips:
- Do you use Windows? Then choose the Windows with embedded JRE (Java Runtime Environment) version - it is more likely to run well on your system.
- If you have used OpenRefine before and have important projects that you don't want to lose, then it is wise to back up your OpenRefine project directory.
- You can find more download and installation instructions in OpenRefine’s documentation: https://docs.openrefine.org/manual/installing and https://docs.openrefine.org/manual/running
Get a list of file names that you want to work with[edit]
Get a list of the file names (on Wikimedia Commons) that you want to add structured data to. You can obtain such a list with various tools. One example is the PetScan tool. Expand the table below for detailed instructions on how to do this with PetScan:
The following link gives you the above-shown example, with HTML output: https://petscan.wmflabs.org/?psid=22129478
PetScan's full manual is available on meta.wikimedia.org.
Start an OpenRefine project[edit]
Usually, you will start editing Wikimedia Commons files in OpenRefine from a list of file names from Commons. The section above described how to retrieve such a list via the PetScan tool, but you can also retrieve / obtain this list in other ways (e.g. from the Wikimedia Commons or Wikidata query service, or via another method of your choosing). The format of the file names can vary. The following formats (and possibly more) will work well in OpenRefine:
- Just the file name -
Nv-diniltsxo.mp3
- With File: prefix -
File:PDP-UY_-_Orquesta_Estable_del_Teatro_Colon_-_Himno_Nacional_Uruguayo_-_instrumental_-_Debali_-_Victor-79694a-a79694a.flac
- Or various URL formats:
http://commons.wikimedia.org/wiki/Special:FilePath/3NT13DanJ.gif
https://commons.wikimedia.org/entity/M93838388
https://commons.wikimedia.org/wiki/File:Aalscholver_op_paal-4676900.webm
You may have just a list of file names, or a larger spreadsheet or dataset with extra data about the files. Both are good starting points in OpenRefine.
Depending on the data format you have, you can now enter this data into OpenRefine and start a project with it.
Indicate that you will work with Wikimedia Commons (not Wikidata)[edit]
In order to make edits to Wikimedia Commons possible, start by adding the Wikimedia Commons manifest to OpenRefine. This manifest is a kind of 'settings' file that provides OpenRefine with all the information it needs to be able to edit Wikimedia Commons. Do this as follows:
- In the Wikidata extension menu at the top right of your OpenRefine project, choose
Select Wikibase instance...
. ClickAdd Wikibase
. You will be prompted to paste either a manifest URL, or paste the JSON directly. Wikimedia Commons' manifest URL is:https://raw.githubusercontent.com/OpenRefine/wikibase-manifests/master/wikimedia-commons-manifest.json
- After adding this URL, you should now see Wikimedia Commons in your list of Wikibase instances. Click Wikimedia Commons to activate it. You can now close this dialog window by clicking the
Close
button. - Adding the Wikimedia Commons manifest in OpenRefine will also automatically add the Wikimedia Commons reconciliation service, which you will need a bit later in the process.
Reconcile the file names with Wikimedia Commons[edit]
Once you have loaded (at least) a list of files from Wikimedia Commons into OpenRefine, you can use the Wikimedia Commons Reconciliation Service as a starting point to begin batch editing these files. This step makes sure that OpenRefine recognizes these files, links them to their M-ids on Wikimedia Commons, and can edit them later. You start the reconciliation process by selecting Reconcile
→ Start reconciling...
in the file column's menu. Then select the Wikimedia Commons reconciliation service and click the Start reconciling...
button.
Watch a short (3'26") demo video of Wikimedia Commons reconciliation in OpenRefine
If you don’t have the Wikimedia Commons reconciliation service installed in OpenRefine yet, click the button Add standard service...
and paste https://commonsreconcile.toolforge.org/en/api
there. If you prefer working with properties and labels in a different language, you can replace the en
string in that URL with the two-letter language code of your choice.
Extract Wikitext and structured data[edit]
This step is optional, but may be very useful. Existing files on Wikimedia Commons are always described with Wikitext, which usually contains information about the file's creator, license, and one or more Wikimedia Commons categories. It will often make sense to parse this Wikitext in OpenRefine, retrieving valuable bits of data from it which can be converted to structured data in a next step. Good examples of such data may include:
- The file's description, which you can convert to a file caption
- The file's creator
- The file's source
- Things depicted in the file may be mentioned in the file's categories
In order to create one or more new columns with Wikitext (and structured data statements) from your column of reconciled file names, select Edit column
→ Add columns from reconciled values...
in the file column's menu. You will get a dialog window in which you can select one or more options; you can choose just one or multiple.
- Wikitext: will create a column with the (full) Wikitext of each file
- Various structured data statements; the dialog windows suggests several common ones, but you can use the search functionality to search for any property that you are interested in
- You can retrieve file captions by typing the capital letter
C
, followed by the two-letter language code (e.gCen
for English file captions,Cja
for Japanese file captions).
Reconcile other columns with Wikidata[edit]
Structured data on Commons describes files on Commons by using (multilingual) items and properties from Wikidata.
Perhaps some of your columns correspond to Wikidata items. You will need to reconcile these, to help OpenRefine understand that it will need to make the link to these Wikidata items. Examples include:
- Creators (if they have a Wikidata item)
- Copyright statuses and licenses
- Depicted things, artworks, places, species, people…
You will reconcile these columns against the Wikidata reconciliation service, in English or another language that may be relevant (English usually works fine). The English Wikidata reconciliation service is installed by default in OpenRefine.
Reconciled columns have a header that is underlined with a dark green stripe; values in the column are blue hyperlinks which point to Wikidata items.
Create your editing schema[edit]
Finally, you will build a schema in OpenRefine, to model the Wikimedia Commons edits that OpenRefine will perform for each row in your project.
Click on the Schema
tab in the blue bar above your dataset, or go to the Wikidata/Wikibase extension menu and select Edit Wikibase schema
. You will get an empty schema window at first. Verify that the info text on top mentions Wikimedia Commons; if it mentions Wikidata, then you need to switch your Wikibase instance to Wikimedia Commons via the Select Wikibase instance... menu item in the Wikidata/Wikibase extension menu.
Click on the blue + add media
link. Several fields will appear.
You can now type, and/or drag and drop all the info you want to include in the files' metadata.
- In the main field (which says
type entity or drag reconciled column here
), you will drag your reconciled column of file names (see previous instructions above). Note: that column must have a green line (as a result of the reconciliation). - Captions: if you have created columns with file captions, then you can drag them here. Make sure to add the corresponding language.
- Statements: click
+ add statement
to add structured data statements, one by one. You can type values that are the same for all your files, or drag (reconciled) columns. Good basic statements are:- inception (P571) (creation date of the file, not of the artwork in the file) - see https://commons.wikimedia.org/wiki/Commons:Structured_data/Modeling/Date for some guidelines
- creator (P170) (of the file) - see https://commons.wikimedia.org/wiki/Commons:Structured_data/Modeling/Author for some guidelines
- source of file (P7482) - see https://commons.wikimedia.org/wiki/Commons:Structured_data/Modeling/Source for some guidelines
- copyright status (P6216) (of the file) - https://commons.wikimedia.org/wiki/Commons:Structured_data/Modeling/Copyright
- copyright license (P275) (of the file) - https://commons.wikimedia.org/wiki/Commons:Structured_data/Modeling/Copyright
- depicts (P180), and for artworks also main subject and digital representation - https://commons.wikimedia.org/wiki/Commons:Structured_data/Modeling/Depiction and https://commons.wikimedia.org/wiki/Commons:Structured_data/Modeling/Depiction#Works_of_art
Preview and upload your edits to Wikimedia Commons[edit]
You can preview your edits by clicking the Preview
tab on top of your schema. The Issues
tab will inform you about errors that may be present in your data or schema, so that you can fix them.
When you are ready to upload your edits, then select Upload edits to Wikibase...
in the Wikidata/Wikibase extension menu, and log in with your Wikimedia Commons credentials. OpenRefine will encourage you to use a bot password, but if you like, you can ignore this warning. Provide a descriptive edit summary. No need to change the maxlag value. Click Upload edits
and your batch edit will start.
You will see your recently edited files in your own edit history on Wikimedia Commons.
Oops! Made a mistake?[edit]
When checking your user contributions, you will see your recent Wikimedia Commons edits done with OpenRefine. Each OpenRefine edit displays a (details) hyperlink after the edit summary, which links to the edit batch in the EditGroups tool.
In EditGroups, entire batches can be easily undone, in case some mistakes have been made.
All Wikimedia Commons batches with OpenRefine are listed at https://editgroups-commons.toolforge.org/?tool=OR.