Commons:OpenRefine/Advanced tips and tricks
About | How to: upload files | How to: edit files | Advanced tips and tricks | Training | Projects |
Advanced tasks - general
[edit]Wikimedia Commons functionalities not present? Adding the Wikimedia Commons manifest to OpenRefine
[edit]If you don't see Wikimedia Commons as an option for reconciliation or in the schema, then you must still add the Wikimedia Commons manifest to OpenRefine.
This manifest is a kind of 'settings' file that provides OpenRefine with all the information it needs to be able to edit Wikimedia Commons. Do this as follows:
- In the Wikidata extension menu at the top right of your OpenRefine project, choose
Select Wikibase instance...
. ClickAdd Wikibase
. You will be prompted to paste either a manifest URL (this is recommended), or paste the JSON directly. Wikimedia Commons' manifest URL is:https://raw.githubusercontent.com/OpenRefine/wikibase-manifests/master/wikimedia-commons-manifest.json
- After adding this URL, you should now see Wikimedia Commons in your list of Wikibase instances. Click Wikimedia Commons to activate it. You can now close this dialog window by clicking the
Close
button. - Adding the Wikimedia Commons manifest in OpenRefine will also automatically add the Wikimedia Commons reconciliation service.
-
Paste the link to the Wikimedia Commons manifest
-
Make sure to select (activate) the Wikimedia Commons manifest
- You can read more about Wikibase manifests and their application and usage in OpenRefine's user manual.
- A list of Wikibase manifests (including the one of Wikimedia Commons) is available on GitHub.
Adding the Wikimedia Commons reconciliation service to OpenRefine
[edit]If you don't see Wikimedia Commons as an option for reconciliation, then you must still add the Wikimedia Commons reconciliation service to OpenRefine.
Select Reconcile
→ Start reconciling...
In the resulting (reconciliation) dialog window, click the button Add standard service...
and paste https://commonsreconcile.toolforge.org/en/api
there. If you prefer working with properties and labels in a different language, you can replace the en
string in that URL with the two-letter language code of your choice.
More info and documentation about the Commons reconciliation service is available at https://commonsreconcile.toolforge.org/.
Manually reconciling file names with Wikimedia Commons
[edit]If you start OpenRefine projects via OpenRefine's Wikimedia Commons extension, then file names will already be reconciled. They will be blue and clickable, and the file name column will be highlighted with a dark green line.
If you start an OpenRefine project in another way, using a list of Wikimedia Commons files, you will still need to actively use the Wikimedia Commons Reconciliation Service as a starting point to begin batch editing these files. This step makes sure that OpenRefine recognizes these files, links them to their M-ids on Wikimedia Commons, and ensures that OpenRefine can edit them later.
You start the reconciliation process by selecting Reconcile
→ Start reconciling...
in the file column's menu. Then select the Wikimedia Commons reconciliation service and click the Start reconciling...
button. (See above on how to add the service if you don't see the Wikimedia Commons option yet.)
-
Watch a short (3'26") demo video of Wikimedia Commons reconciliation in OpenRefine
-
First step to reconcile a column of file names against Wikimedia Commons
-
A list of reconciled files. Notice that the file names are now blue hyperlinks.
Favorite schemas in OpenRefine
[edit]Since OpenRefine version 3.7 it is possible to use, save, share and re-use favorite schemas in OpenRefine.
Watch this video demo:
Working with somevalue / novalue (or unknown value / no value) for Wikibase in OpenRefine
[edit]Watch this video demo to discover a way to work with somevalue/novalue Wikibase statements in OpenRefine (partially developed, end 2023).
Advanced tasks - editing files
[edit]Obtain file names with the PetScan tool
[edit]If you want to get a list of file names from Wikimedia Commons in another way than via the "categories" approach through OpenRefine's Wikimedia Commons extension, you can also retrieve a selection of file names with the PetScan tool.
PetScan gives you many different options to retrieve lists of file names based on various criteria, e.g. usage of specific templates, or using search.
Expand the table below for detailed instructions on how to do this with PetScan:
Commons:PetScan/Generate list of Commons files
Other ways to obtain lists of file names to work with
[edit]You can also retrieve / obtain this list in other ways, e.g. from the Wikimedia Commons or Wikidata query service, or via another method of your choosing.
Other ways to start OpenRefine projects with lists of file names
[edit]You may have just a list of file names, or a larger spreadsheet or dataset with extra data about the files. Both are good starting points in OpenRefine.
Depending on the data format you have, you can enter this data into OpenRefine and start a project with it. You can use OpenRefine's Clipboard option to paste a list of file names (or a small dataset) from your computer's clipboard. Or you can have a list of files in a .csv or spreadsheet which you can open regularly in OpenRefine.
-
Starting a project from clipboard. Here, you can (for instance) simply paste a list of file names.
-
Starting an OpenRefine project by giving it a file on your computer.
You can read more about starting projects (and the settings for various data formats) in OpenRefine's user manual.
Advanced tasks - uploading files
[edit]Retrieve EXIF data from files
[edit]Sometimes, you have very little metadata about a set of files, but there may be valuable information (e.g. the name of the author, the creation date, a description, geographic coordinates...) in the EXIF data of each file.
OpenRefine does not offer you the ability to retrieve this EXIF data, but there are other very doable ways.
You can use Exiftool to create a csv file with all the EXIF data from a list of files, which you then load in OpenRefine. This YouTube video explains the process nicely. The command that’s used is a variant of exiftool -csv *.jpg > exifdata.csv
GREL recipes for Wikimedia Commons
[edit]GREL to extract information from Wikitext
[edit](Wikimedia Commons extension only) Extract values from template parameters: extractFromTemplate
[edit]This syntax only works when you have installed the Wikimedia Commons extension in OpenRefine.
Use the following syntax:
extractFromTemplate(value, "BHL", "source")[0]
where you replace BHL
with the name of the template (without curly brackets) and source with the parameter from which you want to extract the value. This GREL syntax will return the first (and usually the only) value of said parameter, e.g. https://www.flickr.com/photos/biodivlibrary/10329116385
.
(Wikimedia Commons extension only) Extract Wikimedia Commons categories: value.extractCategories
[edit]This syntax only works when you have installed the Wikimedia Commons extension in OpenRefine.
Use the following syntax:
value.extractCategories().join('#')
This GREL syntax will return all categories mentioned in the Wikitext, separated by the #
character, which you can then use to split the resulting cell further as needed.