Commons:Monuments database/Harvesting

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

This page describes the Harvesting of structured lists into the monuments database. This is a process that needs to be setup once per source and will run every night after that is done.

How does it work?[edit]

A bot harvests templates on Wikipedia. For the bot the Pywikibot framework (rewrite) is used. The bot loops over all the sources. For each source the bot loops over pages that contain a row template. For each row template the bot grabs the different fields and inserts it into the database. Each source has a separate table with fields matching the fields in the row template. After all the sources have been harvested, all these source tables a merged into one big table using one very big query (automatically generated).

Configure a new source[edit]

Anyone can configure a new source. You'll have to use Gerrit to get the code and submit a change.

  1. You first need structured lists. This guide assumes you have those already. We need the header and row template.
  2. You'll have to setup Git/Gerrit. The location of the repository is ssh://USERNAME@gerrit.wikimedia.org:29418/labs/tools/heritage.git
  3. Branch with a suitable name (bug/<bugid> or something like source/<country>_<lang>)
  4. Look into the monuments_config directory: each JSON file is a dataset
  5. Copy a country you like and start filling out fields. It's easiest to just copy and modify the configuration of an existing country. Most fields are described here.
    • project : This is most likely Wikipedia ('wikipedia')
    • lang : The Wikipedia language code ('nl')
    • headerTemplate : The header template for the lists ('Tabelkop rijksmonumenten')
    • rowTemplate : The row template for the lists ('Tabelrij rijksmonument')
    • commonsTemplate : The template here at Commons to track images ('Rijksmonument'). You can leave this empty
    • commonsTrackerCategory : The category added by the previous template ('Rijksmonumenten with known IDs'). You can leave this empty
    • commonsCategoryBase : The base of the category tree at Commons ('Rijksmonumenten'). You can leave this empty.
    • autoGeocode : Do you want to do auto geocoding (False/True). Always start with False
    • unusedImagesPage : Page on Wikipedia where to report unused images ('Wikipedia:Wikiproject/Erfgoed/Nederlandse Erfgoed Inventarisatie/Ongebruikte foto\'s'). You can leave this empty.
    • imagesWithoutIdPage : Page with a list of images without an identifier template at Commons ('Wikipedia:Wikiproject/Erfgoed/Nederlandse Erfgoed Inventarisatie/Foto\'s zonder id'). You can leave this empty.
    • missingCommonscatPage : Page with a list of monuments where a category about the monument exists on Commons, but no link is in the list yet. ('Wikipedia:Wikiproject/Erfgoed/Nederlandse Erfgoed Inventarisatie/Missende commonscat links'). You can leave this empty.
    • namespaces : Namespaces to work on at Wikipedia ([0]).
    • table : Name of the table to store everything in (u'monuments_nl_(nl)'). The convention is monuments_<countrycode>_(<lang>). Please keep this in line.
    • truncate : To empty out the table on update (False). You need this if you don't have strong identifiers.
    • primkey : The primary key in the table ('objrijksnr'). This should be the identifier.
    • fields : All the fields the bot can find
      • source : The name of the field in the row template ('objrijksnr')
      • dest : The destination field in the sql table (u'objrijksnr'). Please keep this ascii to prevent problems. If you leave it empty this information will just be dropped.
      • conv : Do we want to do any conversions? Deprecated leave empty please.
    • sql_data: this is the mapping between your source fields to the shared fields (like id, name, address, municipality, lat, lon, image, source, changed & monument_article)

Now that everything is setup, submit your patch.

Deploying a new source[edit]

[needs update]

  1. Become heritage on Toolforge
  2. Update ~heritage on toolforge so that the code includes the new patch
  3. Run python monument_tables.py, that will create the sql file for local monuments table based on config in monuments_config.py file
  4. Run mysql -h tools-db s51138__heritage_p < sql/create_table_monuments_xx_(yy).sql (where xx is your countrycode and yy the language, this should match the configuration) to create your table.
  5. Run python update_database.py -lang:yy -countrycode:xx -fullupdate and see if nothing strange happens
  6. Check in sql if you got about the expected number of items (for example SELECT COUNT(*) FROM `monuments_xx_(yy)`)
  7. Run mysql -h tools-db s51138__heritage_p < sql/fill_table_monuments_all.sql to put everything in the big table
  8. When done log your change in #wikimedia-cloud : !log tools local-heritage Added source bla die bla
Monuments database
ErfgoedBot