Commons:Structured data/About/FAQ

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

This page is a work in progress page, not an article or policy, and may be incomplete and/or unreliable.
Please offer suggestions on the talk page.

About the project[edit]

What is Structured Data (on Commons)?[edit]

Structured Data on Wikimedia Commons (also known more briefly as Structured Data on Commons, or Structured Commons) is a project that provides the technical infrastructure to complement the wikitext, templates and categories on Commons with structured data.

Wikimedia Commons operates on MediaWiki, the same software that powers Wikipedia. MediaWiki was primarily developed for hosting text like Wikipedia. So, typically, each media file on Commons is accompanied with plain-text descriptions (wikitext, templates) and categories. These are usually only available in one language—mostly English—and, most importantly, not consistently machine-readable.

Structured metadata allows data—in this case about media files on Wikimedia Commons—to be accessible in a robust, consistent, structured, and linked format: a format that allows software to understand, on a large scale, what the metadata fields mean (structured) and to connect them to other databases on the internet, putting them in a broader context (linked). Structured metadata is also more granular, thus easier to translate between languages, than unstructured data.

This additional metadata makes it possible to use Commons' media in new ways, and makes the files on Commons much easier to view, search, edit, curate, organize, use and reuse, in many languages.

Why is this important? Why do we put effort and invest resources in this project?[edit]

Wikimedia Commons is an immensely valuable place for advancing human knowledge: it is one of the most comprehensive sites on the internet that serves as a repository for entirely free media. Organizing this knowledge better, making it easier to find, and making it fully multilingual, vastly increases its value to the rest of the world.

Structured Data on Commons has many concrete benefits for Wikimedia editors, for the Wikimedia movement at large, for organizations with which we build partnerships, and for people and organizations that wish to reuse content from Commons across the web.


Who is working on this?[edit]

Structured Commons is a collaboration between developers, the communities of Wikimedia Commons, Wikidata, Wikipedia, and sister projects, and partners and allies of the Wikimedia movement.

The developer team consists of staff from both the Wikimedia Foundation and Wikimedia Deutschland. Community developers (tool developers, bot operators, developers at partner organizations) can also play a large role in this project. All the developed features are conceptualized, created, tested and improved in close collaboration with the community of active contributors to Commons and Wikidata, as well as Wikipedia and sister projects. We also warmly welcome active feedback from cultural institutions (GLAMs - Galleries, Libraries, Archives and Museums).

Who pays for the work on this project?[edit]

In October 2016, the Wikimedia Foundation and Wikimedia Deutschland announced a funding agreement that would provide multi-year support for Wikidata, including backend support for integrating Wikidata into Wikimedia Commons. This funding agreement was supplemented in late 2016 by a $3 million external grant from the Alfred P. Sloan Foundation, which makes it possible to develop structured data functionality in Wikimedia Commons in an accelerated three-year period (2017-2019).

Read more about the grant application and the community consultation that has preceded it.

What is this project's timeline?[edit]

Development on this project, financed by the Alfred P. Sloan Foundation, takes place between 2017 and the end of 2019. This builds upon many earlier discussions that have already taken place since 2004 (see the project history). A timeline of development on Structured Commons is available on the development page.

Will Structured Data also be made available for other Wikimedia projects (for instance on Wikisource, for fair use images on English Wikipedia...)?[edit]

The team focuses on Wikimedia Commons first. This is also the project for which we received funding from the Alfred P. Sloan Foundation.

But we hope to extend it to other projects at some point. The software development for implementing structured data on Commons is based on Wikibase (used for and is being written with compatibility with other Wikimedia projects in mind. Information from Wikidata is already being used for sitelinks in Wikipedia and for some templates. Structured file description information from Commons should, similarly, be available where files from Commons are used on Wikimedia projects and elsewhere. Additional support for Wikimedia projects is out of scope for the specific work done in the Structured Data on Wikimedia Commons, but we have it on the list of longer-term opportunities.

About metadata and structured data on Commons[edit]

How long will it take until Wikimedia Commons is entirely converted to Structured Data?[edit]

Quite a bit of time! By the end of 2019, not all media files on Wikimedia Commons will be complemented with structured data yet. The team will work with the Commons community so that the community can add structured data to approximately 5 million files by that point. The further conversion process will likely take more years.

Some numbers to give an idea on low-hanging fruit...[edit]

In August 2017, Wikimedia Commons contains 42.5 million files.

Magnus Manske has created an online tool, Commonsedge, which roughly indicates whether the metadata templates of a Commons file can be easily transferred to machine-readable data, or not. According to results from this tool, we notice that around 60% of Commons files are ready for conversion—which usually means that they are described with templates that are technically convertible to structured data.

Converting files to structured data will probably be most straightforward for the following types of files.

Newly uploaded files[edit]

The Structured Data on Commons project will work to allow newly uploaded files to be enhanced with structured data, by working on/with the most frequently used upload tools so that they support structured data as well: the UploadWizard, cross-wiki upload, upload campaign tools (e.g. for Wiki Loves Monuments), and mass upload tools developed by the volunteer community (Pattypan, GLAMpipe, and others).

Isn't information on Wikimedia Commons structured already?[edit]

(answer under construction; examples/screenshots to be added)

Without Structured Data, many files on Wikimedia Commons are indeed already described with some structure: categories and templates (various information templates, license templates, source/creator/institution templates, and more).

However, this is not fully structured data that can be consistently read and understood by machines.

Structured metadata contains a logical structure that is uniform, and explicitly expressed. This allows data—in this case media files on Wikimedia Commons—to be accessible in a robust, consistent, structured and linked format: a format that allows software to understand, on a large scale, what the metadata fields mean (structured) and to connect them to other databases on the internet, putting them in a broader context (linked). Structured metadata is also more granular and easier to translate than unstructured data.

Will Commons still have templates? Will templates on Commons disappear?[edit]

It will still be possible to work with templates on Wikimedia Commons. The Structured Data on Commons team will not replace wikitext—including templates—on the software/technical side. It is likely that some information in current templates (such as creator and institution info) will easily be converted to structured data from an early stage. When this (structured) information can easily and multilingually be searched via the improved search functionality of Structured Commons, the Wikimedia Commons community will likely need to have a discussion whether it's worthwhile to also keep the same information in a wikitext template.

Will Commons still have categories? Will Commons categories disappear?[edit]

The Structured Data on Commons team will NOT remove the ability to create and edit categories. We keep all existing systems in place and it will still be possible to work with categories. What we do want to do, is to build a system that serves many of the use cases for categories in a different (and profoundly multilingual) way. Structured metadata on Commons will update search on Wikimedia Commons so that it will also be possible to search multilingually across different criteria.

For instance, with structured data and its new search functionality, users will be able to search for a painting (Wikidata/Gemälde/cuadro/peinture/絵画作品/...) that depicts a pheasant (Wikidata/Fasan/faisán/faisan/コウライキジ/...) in any language, leading to the same search results. If someone uploads an additional painting of a pheasant and describes it properly with structured data, that painting will then also immediately show up in those same search results, even if the uploader is not aware of the existence of categories and has not added any.

How can the information in Commons categories be used for structured data on Commons?[edit]

Wikimedia Commons contains roughly 6,066,000 categories. (source 1) (source 2) (checked on November 9, 2017)

Until now, without fully structured data, categories on Wikimedia Commons have been the best instrument to 'tag' media files on Commons and to organize them. The Commons category system is multihierarchical (i.e. it's a tree structure and each 'node/branch' in the tree can have multiple parents and children).
A lot of (often detailed) information is stored in Commons categories.

Commons categories are extremely varied in topic and purpose:

  • they range from single-topic 'tags' (example: George Washington) to highly complex intersection categories (example)
  • they are used in a variety of different 'relationships' to a media file, e.g. to simply describe what is depicted in a media file, or to indicate creators or sources of media files;
  • they sometimes contain copyright-related information (example: CC-BY-SA-2.0), or information about the creation or upload process behind the media file (example: Uploaded with Mobile/Web)
  • and they are sometimes purely used for structural and maintenance purposes (example: Mérimée with PA parameter).

This enormous variation and complexity makes them problematic to deal with as such as structured data: in APIs it is impossible to distinguish between these different uses, and to make that diversity solidly machine-readable. It is therefore highly recommended that as much information as possible, currently contained into categories, will also be translated to more refined and semantically correct structured data.

The Structured Commons team helps the Wikimedia community to work towards this transition by, among other things, assisting in the creation and maintenance of volunteer-driven conversion tools.

Will 'old' ('unstructured') information on Commons be removed?[edit]

The Structured Data on Commons team designs the project in such a way that no information is taken away from Wikimedia Commons. The team only adds features that might slowly supplement and perhaps replace existing ones. When to add and replace data is the Commons community's own decision. In the team’s current roadmap:

  • Categories are not touched. Certain features of Structured Data on Commons might, at some point, overlap with categories (most notably combined categories such as "Bridges in India" or "1988 in Lima"). It is up to the Commons community to decide whether structured data, at some point, will make these categories obsolete.
  • File pages will continue to hold wikitext. Structured data is inserted as an addition to this.
  • The {{Information}} template (with its many variations - {{Artwork}}, {{Photograph}} and many more) are not touched. Many (if not all) template parameters could be migrated, but deciding how and when is up to the Commons community.
  • The edit history of the file description page is not touched.
  • How the binary files are stored is not affected by this project.
  • The upload history is not touched.

How will usernames be stored in Structured Data?[edit]

Media files on Wikimedia Commons will be described with concepts from Wikidata (e.g. people, institutions, species, places). But not every topic on Wikimedia Commons is notable enough to merit its own Wikidata item, and a lot of metadata will be described as strings (plain text), not Wikidata items. This probably applies to most usernames on Wikimedia Commons.

How will dates be stored in Structured Data?[edit]

Media files on Wikimedia Commons will be described with concepts from Wikidata (e.g. people, institutions, species, places). Dates of media on Commons will be described with date-related properties from Wikidata, and with the time data type. See the "Time" section at d:Special:ListDatatypes.

About Wikidata[edit]

Why is Wikidata used in this project?[edit]

Wikidata is the free and open, multilingual knowledge base of the Wikimedia projects, that can be edited, read and re-used by humans and machines. Its data is available under a CC0 license. Wikidata describes, structures and interrelates all the concepts about which there is a Wikipedia article (in any language), and has much more data than that! Wikidata acts as central storage for structured data that can be re-used and improved across Wikimedia projects and beyond.

Wikidata's software, Wikibase, is in fact a set of two extensions to MediaWiki, the software that powers all Wikimedia projects. In the Structured Data on Commons project, Wikibase—with data from Wikidata—is integrated in file pages on Wikimedia Commons.

Where can I learn more about Wikidata?[edit]

The best introduction to Wikidata, for absolute beginners, is probably Asaf Bartov's three-hour (!) presentation. It is worth your time!

Short videos (YouTube links)

Text-based introductions include:

If you want to have good examples of the structure of data, check one of the showcase items (10min+)

The Wikidata Query Service offers one of the most powerful ways to search and re-use Wikidata's data.

A few examples of tools that reuse Wikidata data:

Tools, gadgets, bots and workflows[edit]

I am a tool / bot developer. What consequences does Structured Commons have for me?[edit]

The Structured Data on Commons team designs the project in such a way that no information is taken away from Wikimedia Commons. The team only adds features that might slowly supplement and replace existing ones. This addition and replacement is the Commons community's own decision.

This means that, in a technical sense, existing workflows, bots and tools on Wikimedia Commons should still work with already uploaded files. However, the underlying API of Wikimedia Commons might change considerably, and an increasing amount of files on Commons will be enriched with structured data. As soon as Structured Commons is rolled out, many volunteers and partners might start uploading media files to Commons that are only described with structured data, not with wikitext. So, increasingly, it will make sense to either create new tools and/or to update current ones.

Many volunteer developers are already aware of these changes, are taking updates to their tools into account, and keep an eye on developments. If you are a tool developer / bot operator who is not yet informed and curious how the changes to Commons might affect you: no problem, don't hesitate to ask, help and support is available! The tool and bot page is specifically designed for you; feel free to ask any questions there and to request assistance. Besides volunteer developers from the Wikimedia community, Structured Commons' community liaison Sandra can also support you.

Partners and allies[edit]

Is this project interesting for external parties like researchers, tech companies, developers...?[edit]

Yes. As soon as a significant amount of media files on Wikimedia Commons are described via structured data, a large corpus or knowledge base of media files will become freely available (on Commons and via an API) for building applications, for re-use and research purposes.

Why do cultural institutions (GLAMs - Galleries, Libraries, Archives and Museums) and heritage organizations play a large role in this project?[edit]

The vision statement of the Wikimedia movement - Imagine a world in which every single human being can freely share in the sum of all knowledge - is closely aligned with the mission of public knowledge and cultural institutions. And indeed, cultural organisations (GLAMs) have already donated millions of media files to Wikimedia Commons in the past, and continue to do so. The Wikimedia movement’s wikis are major portals for educational and heritage material, and Wikipedia is very often a first stop for the public when someone is learning and researching. Offering media on Wikimedia Commons stimulates re-use of cultural and heritage collections on Wikipedia and beyond.

Without structured data, Wikimedia Commons does not have the refined APIs and Linked Open Data technology that meets the needs of cultural institutions, who regularly share their collections through structured, open end-points for both reuse and aggregation into hubs like DPLA and Europeana. Even when GLAM organizations, STEM organizations, and other sharers of educational and heritage material do not have capacity to host or provide utilities for reuse on their own digital platforms, they are not consistently choosing Wikimedia Commons as a site to upload their content. Instead these organizations choose platforms like Flickr or commercial-vendor-controlled digital platforms. In part, this is because Commons does not provide the kinds of robust data structures and APIs needed for monitoring changes to the data, so that the institutions can benefit from the Wikimedia community’s improvements to that data.

Structured Commons expands Commons with features that are central to cultural heritage and media sharing software, but does so with the focus towards openness and collaboration which make our projects, like Wikidata, widely useful for GLAM communities. By closely consulting cultural heritage organizations, especially those who have been strong partners with us in the past or want to partner with us more, we can make sure that Structured Commons becomes a better platform for encouraging partner collaboration in the future and that grows our impact on public knowledge.

For more information about how GLAMs work with Commons and digitized content, see the GLAM portal on

Getting involved and staying up to date[edit]

I am a Wikimedia community member. How can I help and be involved in this project?[edit]

Check the Get involved page! There are many ways to contribute - by providing feedback, helping others, translating content...

I represent a cultural / educational / research institution. I would like to get involved in Structured Data on Commons. How can I get in touch?[edit]

The Get involved page includes information on how to engage with this project as a representative from a cultural or knowledge institution. You can also get directly in touch with Sandra Fauconnier, community liaison and GLAM contact for Structured Data on Commons:

How can I stay up to date?[edit]