Commons:Guide to batch uploading

From Wikimedia Commons, the free media repository
Jump to: navigation, search
Gnome-preferences-system.svg
This page is a work in progress page, not an article or policy, and may be incomplete and/or unreliable.
Please offer suggestions on the talk page.

Deutsch | English | Español | Suomi | Français | Magyar | Italiano | Македонски | Nederlands | Português | Português do Brasil | Română | Русский | +/−

Batch uploading or data ingestion is uploading multiple files in an automated manner. This guide aims to explain how to do this. See also Commons:Batch uploading for more information. For information on forming a relationship with a content partner in order to obtain content to upload see Commons:Guide to content partnerships

Before you start: licensing and permissions[edit]

Before you even think about batch uploading a set of files, be 100% sure all the files are free and fall within the project scope. Make sure you know the artists and when they died or which century they lived in and use creator templates to allow others easily verify the year of death. If you need to arrange permission, please do so before starting the batch upload. In most cases each batch upload will use a single license for the whole batch. Which license is appropriate depends on a variety of factors, especially whether the images are of 2D objects or 3D objects.

2D works[edit]

2D works include paintings, drawings, photographs, and documents. If you intend to use one of the licenses below, make sure the images are actually 2D; try to crop frames of paintings if present, do not upload photographs of rooms with murals or frescoes if they show architectural features unless you provide a separate license for the photographer. Some appropriate licenses for 2D works include:

  • {{PD-Art}} - use for images of 2D artworks by artists who died more than 70 years ago.
  • {{Licensed-PD-Art}} - use for images of 2D artworks by artists who died more than 70 years ago when the photographer of the work has explicitly released their photographs under a free license.
  • {{PD-scan}} - use for scans or photocopies of 2D works by authors who died more than 70 years ago.

All of these templates can be passed a sub-license as a parameter. For example, if you are uploading images of paintings and you know that all of the artists died at least 100 years ago, you could use {{PD-Art|PD-old-100}}

3D works[edit]

3D works include sculptures, buildings, paintings with artistic frames, coins, and some textiles. In most cases, two copyrights will be involved in these images: a copyright on the original work and a copyright on the photograph itself. (See Commons:Freedom of panorama for exceptions.) When you upload the images, be sure to specify the licensing for both the photograph and the work depicted in the photograph. See, for example, this photo of a 3D artwork from the Walters Art Museum. 3D works generally don't require specialized licensing tags. You can usually use standard PD-Old tags for the works and Creative Commons tags for the photographs.

Prerequisites[edit]

Get the files[edit]

Before you upload anything you need to have the files. You can either have the files stored locally or have URL links to the exact location of the source files. The URL should deep link directly to the jpg/ogg/... files. In case all or some of the images need to be altered, for example to crop frames of the paintings or remove watermarks, then it is easier to download all the images first and store them locally. Download scripts can be written in many languages, for example in R programming language one can use:

X <- read.csv("download.csv", header = TRUE)
for(i in 1:dim(X)[1]) download.file(as.character(X[i,1]), as.character(X[i,2]), mode = "wb")

Script to download large number of files based on CSV file with URL in the first column and file name in the second.

Get the metadata[edit]

Get the metadata so we have enough information to later construct the filename, description and categories. For example in case of artworks you might need: authors, titles, techniques, dates, institutions holding the artwork, etc. Sources of metadata might include:

  • can be provided by the websites, see for example here
  • can be provided by the GLAM institutions collaborating in the upload process (for example, if a museum is using TMS, they could export their data as an SQL file)
  • can be scraped from the website: page scraping

Process[edit]

Open a subpage at Commons:Batch uploading to discuss the upload. On this page you can describe what you're uploading, get feedback and document progress.

Check for duplicates[edit]

Before bothering to construct all the information, check if the file doesn't already exist at Commons.

  1. Calculate the SHA-1 hash of the file. (In PHP, you can use the sha1_file() function.)
  2. Ask the API if a file with that hash exists
  3. Skip the file if a file with the same hash already exists, or verify that the current description is in correct format and add metadata information if needed

Naming[edit]

See also Commons:File naming

Define a file naming convention that makes sense, while making sure filenames are unique and not already used at Commons. You might want to include:

  • Title and/or brief description (up front, so that when names are clipped we still figure out what it is about)
  • Year
  • Source/Institution name
  • Accession number/Record identifier - Adding unique identifier used by other institution allows easier linking and increases probability that the filename is unique.

Descriptions and templates[edit]

Derived from the metadata, in wikitext format

Categories[edit]

See also Commons:Categories

The files you're going to upload need to get integrated into the category tree on Commons. Categories are a common way for users to find files. A full categorization as possible is key to getting your images seen by the widest audience.

Types of categories[edit]

Broadly speaking there are two types of categories a file can have: categories related to the topic of the file, and the ones related to the origin of the file.

Auto des Kdo Det Simplon - CH-BAR - 3236933.tif

Consider the picture on the right. It's a reproduction of a 1910's photograph of men in a Pic-Pic car, from the collection of the Swiss Federal Archives.

The topic categories for this picture are Pic-Pic_vehicles, 1910s photographs and Switzerland in World War I (mechanized vehicles).

The origin categories are CH-BAR Collection First World War Switzerland and Media contributed by the Swiss Federal Archives.

Origin category can be further divided in tracking categories and source categories. Tracking categories are of little use to 'normal users', but are essential for tracking all content from a source. For example, the BaGLAMa tool uses this category to provide monthly page views.

A tracking category can also indicate that the reproduction is 'officially donated'. Reproductions (especially of public domain works) can be found widely on the internet and uploaded by any Commons users. In the case of a batch upload, the source is usually 'verified' (e.g. via an API or a datadump). It therefore makes sense to have both a tracking category such as Media contributed by <institution> and a source category indicating the collection of that institution such as Collections of <institution>.

For files that 'belong together' it also makes sense to create an overarching source category, such as Decorative arts in the Louvre - Room 19. This category should be properly placed in the correct hierarchy (e.g. it's a child of Decorative arts in the Louvre).

Note that categories on Commons are far from standardised and can be unpredictable. In general, category names are in English, but there are many cases of inconsistent naming and errors.

Putting it into action[edit]

To summarize: every file you upload should have:

Your files can have:

  • As many topic categories as possible from the metadata. These may include, as appropriate: date (Category:1905 in France), location, place of creation, artist, type of object, style, material, technique, subject, etc. Ideally use as precise a category as possible within each tree (check what sub-categories there are), and if a large number of files - say over 20, but sometimes fewer - are going to be added to a category, it is often best to either spread them among subcategories, or create a new sub-category for them.
  • A « To check » category for the post-upload maintenance (see Category:To be checked)

Do a test upload[edit]

Upload a few images, and ask for feedback on the Commons:Batch uploading subpage. Reviewers will point out or come up with some crazy and unheard-of templates that you will want to integrate. You will certainly have to go through several iterations before you have the green light. Please be patient: it is better to get it right the first time than to fix uploads afterwards.

If you want to test uploading and safely experiment with using tools or new templates in a safe environment, you can set up an account on the beta cluster. This is a mirror of Wikimedia Commons where if things go wrong you will not cause any disruption to the live environment. See http://commons.wikimedia.beta.wmflabs.org and this explanation.

Create new user for your upload bot[edit]

If you do not already have one you will need to request bot account at Commons:Bots/Requests.

Do the real upload[edit]

Although upload bots can be written in several different languages and using different existing frameworks, most bots so far were based on the Python Wikipediabot Framework. You can also reuse code shared by other batch uploaders.

Another possibility is the GLAMwiki Toolset (also known as the GWToolset), an on-wiki tool that allows you to batch upload files from a structured data source, such as a XML file.

Other possibilities include:

In the case your files are too big, or for very large batches, you can request a Server-side upload.

Documentation[edit]

Set up a Commons page to describe the project :