Using database dumps of Commons
The database dumps of Commons (and other Wikis) can be invaluable when looking for information about large numbers of pages. Using them, however, is quite difficult, especially if you're not a MySQL wizard. The information available on the web about the subject, especially the instructions on Meta and Wikipedia, is incomplete and not very helpful. The online tool to perform SQL queries directly on the dump without downloading it is gone now. I spent a long time trying to figure out how to query the dumps on a Windows system. This is the recipe.
Note 1: This describes how to use a MySQL dump (.sql). XML dumps (.xml) can readily be parsed with any XML parser, and many websites give instructions on doing so (e.g. this one). If you want to use the procedure described here to open an XML dump, you can convert it to an SQL dump using Xml2sql (if you do so, it is highly recommended you convert to the mysqlimport format and load the data using
LOAD DATA INFILE, which is about 20 times faster than
INSERTing, but that is not within the scope of this tutorial).
Note 2: These instructions assume a Windows system. With minor modifications, though, they should work on other operating systems as well.
Carrying out these instructions may take a long time (several hours) but what you will have in the end can be well worth it! See the section Querying the database for an appetizer.
One last warning: Using databases from a project as big as Commons means crunching Gigabytes of data (the images database I use has 7.5 GB). If you have doubts that your system is up to the task, then it probably isn't. 4 GB of physical memory and a quadcore or better are the absolute minimum. I won't keep you from trying with less, but don't be disappointed if it doesn't work.
Install the following software (if not installed already):
- Winrar, or any other archive program capable of opening .gz and .bz2 archives (some of the dumps also use the .7z format that can be opened with 7-Zip)
- MySQL Community Server (version used for this tutorial: 5.1.37)
- Choose a Typical install
- After installation:
- Skip the two ad screens
- Check "Configure now", uncheck "Register"
- In the wizard, choose a Standard Configuration
- Deselect "Install as a service"
- Select "Include in PATH"
- If prompted to, choose a new root password (leave the "Current password" field blank)
- In the next screen, click "Execute"
- MySQL GUI Tools (version used for this tutorial: 5.0-r17)
- Choose a Complete install
Getting a dump
The database dumps for all Wikimedia projects can be found here. To get the most recent dumps of Commons, visit that page, search for "commonswiki" using the browser and click on the link that comes up (should be http://download.wikimedia.org/commonswiki/DATEYYYYMMDD/). You will be taken to the Commons dump page, where you can download the dump of your choice. For this tutorial, we will use image.sql.gz, the dump containing Metadata on current versions of uploaded images.
After the download is complete (which can take quite long), extract the .sql file contained inside the archive. For this tutorial, we will extract it to the hard disk root folder, C:\, making the path of the SQL dump C:\commonswiki-DATEYYYYMMDD-image.sql.
Creating the database and importing the dump
- Open a command prompt and execute the following command:
mysqld.exe --max_allowed_packet=500M --innodb_buffer_pool_size=500M --innodb_log_file_size=24MExplanation: This starts the MySQL server. The maximum packet size has to be increased in order to prevent the error "Server has gone away" when importing the HUGE dump file. Buffer pool size and log file size are increased to make importing faster (ideally, the log file size would be much bigger yet; but this caused MySQL to quit unexpectedly on my system).
- Now open another command prompt (the one opened before will be blocked by the MySQL server), go to the same folder and execute:
mysql.exe -u root -p -h localhostExplanation: This connects to the local MySQL server as user root.
- You might be prompted to Enter password: now; if you have chosen a password during installation, enter it, otherwise just press enter.
- The MySQL console will now be displayed and waiting for your commands (signified by the text mysql>). In the console, type the following and execute it by pressing enter:
CREATE DATABASE commonsimages;(Don't forget the semicolon!)
Explanation: As you might have guessed, this creates a MySQL database called "commonsimages".
- Type and execute:
USE commonsimages;Explanation: This switches to the newly created database.
- Now comes the critical step. Execute the following command:
SOURCE C:\commonswiki-DATEYYYYMMDD-image.sql;(Replace with the path to the dump as needed.)
Explanation: This imports the data from the dump file into the newly created database. This can take several hours (three and a half in my case), depending on your system and the size of the dump you downloaded.
When the import is finished, you will once again see your command prompt waiting for your input. No error messages should be displayed. Congratulations! You have completed the hard part, and now have an incredible amount of data at your fingertips.
Querying the database
So what to do with this data? Well, for starters, you can find things (e.g. images) in Commons that would be near-impossible to find otherwise. Before you read further, if you aren't familar with SQL, now is a good time for learning a little about it. SQL is a language that can be used to query databases for specific records in a very powerful and flexible fashion. An introduction to SQL for absolute beginners can be found here.
Ready? Then run MySQL Query Browser (part of the package MySQL GUI Tools you installed earlier). You will see a prompt for connecting to a MySQL server. In this prompt, enter the following:
- Host: "localhost"
- User: "root"
- Password: Password chosen during installation, if none was chosen leave empty
Click "OK". A popup window might appear telling you that you didn't specify a database schema. If it does, just click "Ignore".
Note: If you get an error message, chances are you either entered the wrong password, or the MySQL server isn't running. If this is the case, simply open a command prompt and run:
You should now see the Query Browser main window. The Query Browser allows you to use your database in a visual way, which is much more convenient than using a console.
That's it! You are ready to execute an SQL query of your choice on the database. I will now assume that you are familiar with SQL syntax and give you a few examples of the awesome possibilities such queries afford you.
Warning: Unless you are doing a very specific query, always LIMIT the number of results, or you will grow old before you see them.
For starters, let's find a few photographs with a resolution of 200 Megapixels or more (didn't know there were any on Commons? neither did I!):
Result (name, width, height):
Boyan_and_Tyravska.jpg, 31169, 10001
Column_of_Marcus_Aurelius_detailed_view_03.jpg, 8614, 31044
(If you leave out the "*.jpg" constraint you will end up with a lot of SVGs.)
Let's find some really complex vector images (complexity ~ file size > 20 MB):
Result (name, size):
Find a few photos taken with a Canon camera:
There are many more examples of searches that are impossible to carry out on Commons directly. Database dumps are an invaluable tool for any Commons user interested in looking beneath the surface.