Commons:Monuments database/Harvesting

From Wikimedia Commons, the free media repository
Jump to: navigation, search

This page describes the Harvesting of structured lists into the monuments database. This is a process that needs to be setup once per source and will run every night after that is done.

How does it work?[edit]

A bot harvests templates on Wikipedia. For the bot the Pywikibot framework (rewrite) is used. The bot loops over all the sources. For each source the bot loops over pages that contain a row template. For each row template the bot grabs the different fields and inserts it into the database. Each source has a separate table with fields matching the fields in the row template. After all the sources have been harvested, all these source tables a merged into one big table using one very big query.

Configure a new source[edit]

Anyone can configure a new source. You'll have to use Gerrit to get the code and submit a change.

  1. You first need structured lists. This guide assumes you have those already. We need the header and row template.
  2. You'll have to setup Git/Gerrit. The location of the repository is ssh://USERNAME@gerrit.wikimedia.org:29418/labs/tools/heritage.git
  3. Branch with a suitable name (bug/<bugid> or something like source/<country>_<lang>)
  4. Open erfgoedbot/monuments_config.py in your favorite editor
  5. Copy a country you like and start filling out fields. It's easiest to just copy and modify the configuration of an existing country. All fields are described here.
    • project : This is always wikipedia ('wikipedia')
    • lang : The Wikipedia language code ('nl')
    • headerTemplate : The header template for the lists ('Tabelkop rijksmonumenten')
    • rowTemplate : The row template for the lists ('Tabelrij rijksmonument')
    • commonsTemplate : The template here at Commons to track images ('Rijksmonument'). You can leave this empty
    • commonsTrackerCategory : The category added by the previous template ('Rijksmonumenten with known IDs'). You can leave this empty
    • commonsCategoryBase : The base of the category tree at Commons ('Rijksmonumenten'). You can leave this empty.
    • autoGeocode : Do you want to do auto geocoding (False/True). Always start with False
    • unusedImagesPage : Page on Wikipedia where to report unused images ('Wikipedia:Wikiproject/Erfgoed/Nederlandse Erfgoed Inventarisatie/Ongebruikte foto\'s'). You can leave this empty.
    • imagesWithoutIdPage : Page with a list of images without an identifier template at Commons ('Wikipedia:Wikiproject/Erfgoed/Nederlandse Erfgoed Inventarisatie/Foto\'s zonder id'). You can leave this empty.
    • namespaces : Namespaces to work on at Wikipedia ([0]).
    • table : Name of the table to store everything in (u'monuments_nl_(nl)'). The convention is monuments_<countrycode>_(<lang>). Please keep this in line.
    • truncate : To empty out the table on update (False). You need this if you don't have strong identifiers.
    • primkey : The primary key in the table ('objrijksnr'). This should be the identifier.
    • fields : All the fields the bot can find
      • source : The name of the field in the row template ('objrijksnr')
      • dest : The destination field in the sql table (u'objrijksnr'). Please keep this ascii to prevent problems. If you leave it empty this information will just be dropped.
      • conv : Do we want to do any conversions? Deprecated leave empty please.
    • Save the file

Now you setup the source, but you want to have it in the big shared table too

  1. Edit erfgoedbot/sql/fill_table_monuments_all.sql. This is the mysql code to put everything in one big table.
  2. Copy a section and put it in the right location (sorted by alphabet).
  3. Modify the section to match your source fields to the shared fields (like id, name, address, municipality, lat, lon, image, source, changed & monument_article)

Now that everything is setup, submit your patch.

Deploying a new source[edit]

  1. Become heritage on Toollabs
  2. Update ~heritage on toollabs so that the code includes the new patch
  3. Run python monument_tables.py, that will create the sql file for local monuments table based on config in monuments_config.py file
  4. Run mysql -h tools-db s51138__heritage_p < sql/create_table_monuments_xx_(yy).sql (where xx is your countrycode and yy the language, this should match the configuration) to create your table.
  5. Run python update_database.py -lang:yy -countrycode:xx -fullupdate and see if nothing strange happens
  6. Check in sql if you got about the expected number of items (for example SELECT COUNT(*) FROM `monuments_xx_(yy)`)
  7. Run mysql -h tools-db s51138__heritage_p < sql/fill_table_monuments_all.sql to put everything in the big table
  8. When done log your change in #wikimedia-labs : !log tools local-heritage Added source bla die bla
Monuments database
ErfgoedBot