Commons talk:Structured data/Get involved/Feedback requests/GLAM metadata and ontologies mapping

The following discussion is archived. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.

Thanks to everyone who participated! A summary of the discussion and a list of next steps is available below. Feel free to add additional comments under the Further comments and feedback after May 7, 2018 header. SandraF (WMF) (talk) 15:19, 6 June 2018 (UTC)[reply]

In order to improve the way GLAMs can interact with Wikimedia Commons, we propose to actively work on mapping GLAM metadata schemes and ontologies to Wikidata and to structured data on Commons. This feedback request runs until Friday 4 May. Read more about it here.

This is a proposal to check whether members of the Wikidata and Wikimedia Commons communities, and GLAM staff, think this is a correct and valid issue to work on, and to inventorize who is interested to actively work on this topic. We warmly welcome feedback on this proposal. Some questions to guide your feedback:

Do you think this is a worthy undertaking?
If no: why not?
If yes, do you volunteer to work on this? Which are your main (sub-)interests?
Any suggestions on how to best organize this work?
In which way can the GLAM team of Structured Commons (most specifically Sandra) support this endeavour? What do you need from the team?
Is the proposal itself accurately written and does it address the right issues? Which changes and updates would you propose?

This feedback request runs until Friday 4 May. Your feedback is very helpful in letting us all decide together on next steps.

All the best! SandraF (WMF) (talk) 10:13, 19 April 2018 (UTC)[reply]

Europeana

Hi, what I'm missing at this point (maybe I have overlooked it) is some reflection on how Structured Commons / Wikidata will be hooked up with Europeana (and similar platforms) in the future. Several questions come to mind in this context:

Will a mapping between the Europeana Data Model (EDM) and Structured Commons / Wikidata be provided?
- if there is recognized need to upload content from Europeana institutions to SC/WD (as per next point) then sure, I'd like to work on it if we have resources. Isaacantoine (talk)
Will institutions who make their data available through Europeana and expose high-resolution files of their content on their own repositories still be expected to actively upload content to Wikimedia Commons in the future or will this process be automatized via Europeana?
- For the moment there's no plan I'm aware of. In the longer run, I don't know and the call is not up to me... Isaacantoine(talk)
- In the latter case: Do we aim for the upload of all eligible content to Wikimedia Commons or will there be a function to upload relevant content from the Europeana ecosystem to Commons on demand?
Is anybody aware of how well the Europeana LOD project is doing, e.g. how good the data quality is, how well it could be interlinked with Wikidata?
- The data on our SPARQL API (https://pro.europeana.eu/page/sparql) is not always in sync with our production data, but the service is still live. As well as our data.europeana.eu URIs, even if the redirection for that domain (but only the root) has been broken some weeks ago. Syncronization issues apart, the data quality of our LOD data is the same as the quality of our 'regular' data (because our regular data actually tries to follow the LOD principles)Isaacantoine (talk)
- Our data is already interlinked with Wikidata (I don't know the figure on top of my head, but it's probably thousands of links) and we are soon going to investigate how these links could be uploaded in WD so that the linking can be represented both ways Isaacantoine (talk)

Cheers, --Beat Estermann (talk) 11:06, 19 April 2018 (UTC)[reply]

Hi @Isaacantoine: Thanks for your responses. Regarding the interlinking between Wikidata and Europeana I have a few additional questions:

Who is taking care of the matching of entities? And at what level (Heritage institution - aggregator - Europeana)?
Are there any statistics regarding the level of interlinking – e.g. the percentage of person names, organizations, time periods, events, items, etc. in the Europeana database that are either linked to Wikidata or to GND/VIAF (or both)? (Would it be possible to post SPARQL queries answering this question?)
Do you know what the situation is on the side of the Archives Portal Europe? - Are they presently doing any matching against Wikidata or GND/VIAF?

-- Beat Estermann (talk) 11:42, 25 April 2018 (UTC)[reply]

@Isaacantoine and Beat Estermann: I've done several imports from Europeana to both Wikidata and Wikimedia Commons. The EDM is internally not very consistent and I usually ended up making the mappings based on the collection or going directly to the source data. With the current state of data in Europeana it's impossible to make a generic mapping to Wikidata. Multichill (talk) 09:37, 28 April 2018 (UTC)[reply]

Good to know. Thanks! - Are there any plans at Europeana to improve this situation? --Beat Estermann (talk) 12:23, 1 May 2018 (UTC)[reply]

Observations

Somebody that I think it will be particularly useful to hear from in this area is User:PKM. She's been doing a huge amount of work to match up Wikidata with the AAT and the Europeana Fashion Thesaurus in the areas of clothing, accessories, armour, materials, etc, under the aegis of Wikidata:WikiProject Fashion, so I think would be able to give some particularly strong insight into the state of content at Wikidata as it is at the moment, the scale of opportunities available through work of this kind, and the detailed practicalities involved.

For myself, I have got quite an interest in external sources with hierarchical information, and the extent to which they are reflected in Wikidata (or not). In terms of coverage, here are some reasonably up-to-date pages for

The entries with matched wikidata items are in blue, with links; those without are in black. It's not an area I've done a huge amount of work on recently, because there's another project I have been very keen to make some progress on, but in principle it should be straightforward enough to produce similar lists for other properties. ~~(Some sort of better CSS highlighting to make unmatched items stand out more might be useful, although some of the pages are quite close to the allowed maximum size for a wiki page)~~. Sorted: links now underlined, old school.

I also started to try to make a scratch list of some other useful thesauruses and external vocabularies with hierarchical structure, here; but I'm aware this is just the tiniest tip of the iceberg, and even of our existing external ID properties, many many of them have skos:broader internal links -- for example properties as basic and as heavily in use as Bibliothèque nationale de France ID (P268) or BNCF Thesaurus ID (P508).

Addition -- I should emphasise that this was just a personal scratch page that I was using to keep a few links. I think a more systematic, centralised effort would make sense: to consistently track which properties correspond to vocabularies etc with an external hierarchy, and how thoroughly that hierarchy has been compared with relations on Wikidata. d:Wikidata:WikiProject Ontology tends to look at abstractions and wider questions; so perhaps the narrower d:Wikidata:WikiProject KOS ("Knowledge Organisation Systems") might be a better fit for locating such a survey, and tracking and coordinating any improvement drive.

However, as well as tracking the whole spectrum, it is probably worth trying to also identify whether some vocabularies should be prioritised as particularly well thought-out and well organised, and particularly widely used. I'd guess that AAT is probably a front-runner here. But it would be useful to hear from GLAMs what vocabularies are the most commonly in use. It might also be useful to get any idea whether their coverage is also overwhelmingly loaded towards a few key values -- eg does 1% of the vocabulary account for 99% of the references? But the choice of vocabulary may matter less than just getting on and matching at least one really well: because the more the items we have on Wikidata can be got organised and extended into a well-built structure, the more help that should give to the process of matching-in further vocabulary items (as discussed more extensively below). Jheald (talk) 21:19, 20 April 2018 (UTC)[reply]

Something that I believe may be quite useful is to start to try to represent those external hierarchical relationships internally within Wikidata, using the qualifier broader concept (P4900). This qualifier on an external-id statement points to the nearest upward item in that external hierarchy, that has a match to a Wikidata item. I rolled this out for the EFV vocabulary and the fashion part of the AAT thesaurus -- it means one can extract and analyse all the items from a particular part of the thesaurus, regardless of what primary relations may or may not exist between them on Wikidata. One can also identify items which have hierarchical connections in the external vocabulary, but are not connected through subclass of (P279) on Wikidata, such as the results of these queries tinyurl.com/yaephu2l and tinyurl.com/ya4cotlj for the EFV and AAT respectively. (Though User:PKM may have addressed most of the anomalies since then!) These can reveal cases where eg Wikidata probably ought to have a subclass of (P279) connection between the two but currently doesn't; or where an incorrect item on Wikidata has been matched to the external entry (eg an incorrect homonym); or where Wikidata is simply modelling the hierarchy differently. I think it can be quite a revealing quality control. One caution though, is that broader concept (P4900) is probably best added and kept synchronised using a script, one the external database has been significantly matched. Also, it's easy enough using QuickStatements at the moment to add a new qualifier; but not possible with QS to update the qualifier without re-writing the whole statement, which at the very least needs care to preserve all other qualifiers, references, and statement ranks - a bit of a pain.

Regarding current data extent on Wikidata, my impression is that while we may have done a lot of adding instances of paintings, people, objects etc at the leaf-levels of the tree, in many cases there is a huge amount missing at the trunk and branch level, and even when we do have items they are not necessarily well connected-up. There may be various reasons for this: often, I have found, when Wikipedia does have a page for quite a general concept, it is not an article page, but instead a disambiguation page linking to articles on more specific aspects of the concept. Wikidata makes a distinction between items linked to disambiguation pages, as against properly substantive items; so in many cases it has only inherited a disambiguation item for the concept, not a substantive item, leaving an empty gap in the hierarchy. General concept articles are also notoriously harder to write than articles on narrower concepts or particular concrete objects, so again another reason why there is a gap on Wikidata. Finally, they may have less readily harvested categories or infoboxes, compared to instances of a big class, so even when items exist on Wikidata they may have very limited properties, and very limited connections to other items.

Addition - a further issue is that even when we have items, and they are linked together, there may have been confusion as to whether to use instance of (P31) or subclass of (P279) for the relationship. It is essential that subclass of (P279) is used, if one wants to indicate a subset of a larger class, with a transitive 'subclass' relationship from end to end of the chain. Jheald (talk) 21:26, 20 April 2018 (UTC)[reply]

All of this means that my impression is that in very many areas our existing hierarchy on Wikidata may be regarded as rather weakly populated and weakly linked. This has an implication for matching new items. The power of OpenRefine, for example, can be seriously limited when the hierarchy is weak. OpenRefine can be very good for matching concrete items like people or institutions that are instances of a particular class. If instead it is classes that one is trying to match, then one first needs to turn off matching to particular classes. But to avoid being flooded with everything, it is useful instead to specify trying to match to subclasses of a particular item through a chain of P279s. However that only helps to the extent that those P279s are in place. When they aren't -- as they typically aren't (at least, not yet) -- that really hurts OpenRefine's ability to identify which potential matches are more or less likely.

The other key tool for matching is of course Mix'n'match, and in many ways this is a lot nicer than OpenRefine -- matches get made instantly they are set, without having to export, appropriately format, and then upload with QuickStatements; and the environment for searches is very nice. It is also easy enough to upload any new thesaurus into MnM. But there are some significant weaknesses, perhaps the most significant of which that with MnM it is hard to work on only a part of a thesaurus -- eg only the fashion items in the AAT. To some extent Magnus's latest MnM addition can help -- the ability to run a SPARQL query on Wikidata, and then try to match just the items generated from that SPARQL query, either to the whole of MnM, or to a particular catalogue. However, again, this is greatly weakened if the hierarchical structure on Wikidata is not good -- because without reasonably complete hierarchy, the SPARQL query can't find most of the items that it would be good to match, seriously restricting this mode of using MnM. (Another limitation of MnM is that is the moment the catalogue description there is not good as to what kind of things each catalogue there contains, eg: people, artworks, texts, 'things', concepts... -- so it is not as easy as it should be to filter out irrelevant catalogues, or filter in only the ones that are most relevant. Improving this could be a quick win.)

A bit more still to follow, but I need to leave this here for now. Jheald (talk) 17:38, 20 April 2018 (UTC)[reply]

More:

CIDOC. You mentioned CIDOC as an ontological model on the overview page. A breakdown of the CIDOC classes can be found at d:User:Jheald/cidoc. No blue-links this time, because (as far as I know) we don't currently have a property matching Wikidata items to CIDOC classes. (The British Museum also systematically types items into seven uber-classes listed at d:User:Jheald/bm, and uses a number of subject thesauruses, see d:User:Jheald/bmt.)

But detailed matching of Wikidata items to CIDOC would perhaps be of limited relevance anyway. For in-depth description of what collection items in fact are, CIDOC defers to detailed thesauruses, such as the BM thesaurus, or AAT, or whatever. The CIDOC classes seem to be more a description of how to describe things -- with the proviso that the designs are often very different to the roads Wikidata has gone down. So for example, CIDOC has a distinct class E55_Type for detailed descriptions of what things are (ie the contents of thesauruses), which is separated from the very limited vocabulary of items it defines for broad classes of actual things. Wikidata makes no such distiction: we just hope that somewhere near the top of the subclass tree for each detailed class WD contains one may eventually reach something similar to one of the 88-or-so high-level CIDOC classes. And this may sometimes happen. However often (more often?) if one traces a long way up a subclass of (P279) chain from a Wikidata item, one may either reach something completely inappropriate, or alternatives that are very contradictory, such as finding the item is said to be simultaneously abstract and non-abstract. There is perhaps a lot to be said for the CIDOC approach of directly indicating a near top-level class on each item. But it is not the model Wikidata has chosen to follow.

Similarly, CIDOC's modelling of time (and approximate time) is very different to Wikidata's. CIDOC uses an approach directly based on triples, with a very limited number of connecting predicates (X has a thing; the thing is a birth; the thing has another thing; the other thing is a time; the other thing is AD 1327) -- which is rather different to Wikidata's way of doing things.

(Pinging @Vladimir Alexiev: here, who I think knows a lot more about CIDOC than I do).

I suppose this is what Sandra means about baking-in support into the very APIs of Structured Commons for extracting and re-organising such data into a more Wikidata/CommonsData form. I have a couple of reservations -- firstly, this is a very different thing to thesaurus mapping, and I'm not sure how helpful a thing it is to blur the two topics together. Secondly, I have a distrust of big shiny monolithic do-everything APIs. Better perhaps to think of small modular tools (the old UNIX model) for specific stages and specific jobs, that can be arranged in a pipeline behind a nice front end or machine interface; but which can also be used independently -- eg to reshape the data into a set of Wikidata statements that can be fed into QuickStatements. I would much prefer to have a set of tools that can reshape data from particular ontologies into a set of statements I can examine and assess, rather that a hermetically-sealed monolithic application that I shovel stuff into and hope for the best. (Of course, I don't object to the small tools being arranged into a simple turn-key application; but IMO it is important also to be able to break them out, and use them independently on their own -- there is a difference). Jheald (talk) 22:54, 20 April 2018 (UTC)[reply]

Importance. I doubt that converting metadata into Wikidata / CommonsData statements will ever be completely painless - I suspect there will always be a gap to cross. But the greater the number of standard cases we can identify and model, the easier we may be able to make it. But there is another reason for wanting to improve the quality of the hierarchical relationships of the items on Wikidata, and that is data retrieval. To pick up PKM's example below, one of the prospectuses for Structured Data on Commons is the ability to put 'handwear' into the structured search, and see in return examples of mittens, gloves, and any other sorts of any other relevant thing, plus suggestions of possible refinements to the search, eg to narrow it to 'mittens', 'gloves' or whatever. Both of those are only possible if there is a good hierarchical structure of classes -- which, as noted above, at present is often weak. So this is another reason why building up the coverage and accuracy of those hierarchical relationships on Wikidata is something really quite important.
Commons. There's a lot of focus above in matching Wikidata items to external thesauruses, and comparing their hierarchical structure. But there is another very important set of hierarchically arranged entities that Wikidata could learn a lot from, and which it is important for Wikidata to be able to match back to, and that is Commons categories. Typically, because Commons categories are organising pictures of actual concrete things, they give a more detailed and more organised classification of concrete things than Wikidata may have inherited from Wikipedias, so there Commons categories are a potentially very valuable source for Wikidata to learn about classes of things that ought to have items, and hierarchical relations between them.

But even more important than that, perhaps, is the relevance of the other direction of the matching: to be able to relate Wikidata items to Commons categories.

Uploaders to Commons are expected to properly categorise their uploads. High-volume uploaders who regularly fail to do so can and do get blocked by the community. I see no likelihood of that changing in the future, regardless of the progress (or otherwise) of the structured data experiment. GLAMs who do not properly categorise their uploads can expect to be asked to go away and think how to do so, and in the meantime stop their uploads.

So I would see it as critically important to think how to map metadata supplied as new CommonsData structured statements into appropriate Commons categories. Otherwise, I say again, uploaders will get blocked.

Recent-ish stats for links between Commons categories and Wikidata can be found here. Currently, about 1,740,000 Commons categories can be identified to Wikidata items; about 1,120,000 (in fact now more) are connected by sitelinks - User:Mike Peel has a bot request currently in that should take that most of the way to the 1,740,000. But that leaves about 5 million Commons categories with nowhere to store in any easily accessible form what they represent. For some of those there probably are Wikidata items that they should be matched to. More will be so-called 'intersection' categories, combining concepts that have their own Wikidata items. But at the moment there is nowhere to store that information, and the job of identification is so big that it needs to be done in chunks, with somewhere to accessibly store the information as we go along.

Having Wikidata items matched to Commons categories helps hugely in identifying good Commons categories to add to images. Having a thorough matching of Wikidata items to categories would help even more, by identifying when there is no current Commons category, so a new one should be created. Having structured descriptions of what intersection categories represent helps even more, by identifying how far the image can be trickled down out of primary categories into appropriate intersection categories. Being able to store the information also helps identify which categories are intersection categories, and so by subtraction which are not, and so potentially should be linked to Wikidata.

There are some fundamental challenges in matching Commons categories to Wikidata. One of those is scale: 6,500,000 categories is about 200 times the size of the AAT -- though we might be able to shrink that a lot, if we could identify and tag the intersection categories. But there's something else which also makes the problem harder to process automatically, namely that Commons and Wikidata use two fundamentally different query systems. The hierarchical links between Commons categories are stored in SQL tables. These also record which Commons categories have sitelinks (specifically) with Wikidata, though not any other marking of correspondence eg via property P373. However, from SQL there is no way to get information about what the category may represent, stored in Wikidata statements. In the other direction, it may just about be possible to execute MediaWiki API calls from inside WDQS, to get information about category hierarchy, though I am not sure it would extend to looking at very many such categories, so I am dubious that such an approach would scale at all. All of this makes it much harder to automatically identify categories which probably ought to have Wikidata links, but currently don't.

The key issue here though is what I said before -- the lack of anywhere easily accessible to store structured information about the nature and meaning of a category if it doesn't (and won't) have a corresponding Wikidata item. @SandraF (WMF): You asked how the Structured Data team could best support the matching process, and the improvement of hierarchical structure on Wikidata. My answer would be: somewhere to accessibly store structured information for the categories on Commons that are not going to be linked to Wikidata. Integrated querying with Wikidata would be a distinct plus. A federated wikibase would be ideal... Jheald (talk) 01:46, 21 April 2018 (UTC)[reply]

@Jheald: , thanks for pulling me in. I'm not sure how much I can add. My rough guess is that the matched EFV ids have been matched about 10-15% via Mix'n'Match by various editors and the rest by hand, mostly by me working with a Google spreadsheet of the EFV data over the last year. There are still a handful of items to be matched or added (some around theatrical costuming where we need a hierarchy, and a very few items which I am unsure about).

I don't have Java on my PC, so I have not worked with Open Refine.

I have just started using terms from IATE to assist with multilingual labeling of technical textile terminology ('vat dye', 'mordant') and matching/merging duplicate Wikidata items.

I expect there's more data at Europeana that I would use if I knew about it!

Specifically relating to Commons, I've done a bit of work building out categories that align with Wikidata's hierarchy of clothing items (based on AAT/EFV), but the existing Commons structures sometimes are different (i.e. in Commons 'mittens' are a subcategory of 'gloves', but in WD/EFV/AAT 'mittens' and 'gloves' are both parallel subclasses of 'handwear'), and in those cases I tend to leave things as they are for now. Again, as Jheald says, it's much easier to match at the leaf level than at trunks and branches. - PKM (talk) 22:38, 20 April 2018 (UTC)[reply]

@PKM: Thanks. And of course, in relation to the latter, there are also cases where WD has decided to go a different way on particular items than EFV or AAT. If I remember correctly, eg the Wikidata subclass modelling of helmets is rather different? In general one shouldn't expect an exact equivalence of hierarchy between the different sites -- not between WD and any single external thesaurus, not between WD and Commons. But so long as there's some solid hierarchy, then with luck 'handwear' ought to pull out most of both 'mittens' and 'gloves', regardless of whether 'mittens' are considered to be a subclass of 'gloves' or not. Jheald (talk) 23:05, 20 April 2018 (UTC)[reply]

@Jheald: . +1 to your comments on "findability". Re: AAT and helmets, yes, in at least one case treating the AAT hierarchy as a class tree runs into a circular relationship, and in general (IMHO) AAT's treatment of armor is better suited to the way armor pieces are displayed in museums and less suited to a hierarchy of pieces as they were worn across time (not surprising, really). I've had no luck finding an arms & armor specialist who is interested in Wikidata. - PKM (talk) 20:19, 21 April 2018 (UTC)[reply]

There's also the problem with Commons users categorizing based on words, not concepts. Terrific example: Category:Caning. - PKM (talk) 00:28, 22 April 2018 (UTC)[reply]

Note, I have split Category:Caning and Category:Caning (furniture). - PKM (talk) 21:09, 25 April 2018 (UTC)[reply]

Feedback from Jakob

1. Do you think this is a worthy undertaking? I think it is absolutely relevant to create and manage mappings between data models in Wikidata and main metadata standards, schemes, ontologies and vocabularies. The qu

2. If no: why not? If ontologies and vocabularies are too specialized, complex, and or large mapping should better be avoided unless there is an actual use case.

3. If yes, do you volunteer to work on this? Which are your main (sub-)interests? I already work in mapping vocabularies and ontologies with Wikidata and would like to collaborate! My main interest is how to maintain mappings.

4. Any suggestions on how to best organize this work? Better concentrate on less broad tasks such as technical aspects, selected types of vocabularies, specific use cases...!

5. In which way can the GLAM team of Structured Commons support this endeavour? Sorry, not sure about this. At least documentation would help!

6. Is the proposal itself accurately written and does it address the right issues? Which changes and updates would you propose? The proposal is too broad. Metadata schemes, formats, vocabularies and ontologies require different approaches, tools, and teams. You should better separate work on data models/schemes/ontologies on the one hand and vocabularies, taxonomies, authority files on the other. Work on the latter can be summarized as mapping SKOS vocabularies and Wikidata (see examples being collected in our research project). This must be separated from mapping ontologies!

-- JakobVoss (talk) 12:52, 24 April 2018 (UTC)[reply]

+1 on Jakob's last point. There are indeed two separate things here, and it makes sense to treat them as two different projects. The first is vocabularies etc, and it makes sense to set up an umbrella WikiProject on Wikidata to identify the highest priority vocabularies to match, to track progress, and to document tools, queries, and work methodologies for going about it. It is a specific well-defined task, and it makes sense for it to be the full focus of a self-contained project.

The other aspect is the broader structures of data in various metadata formats. This is much more diffuse and a quite different task. As Jakob says, it makes sense to set up a different umbrella project for work on this.

Yes, some standards (eg MARC) will have both aspects -- their own characteristic structures and their own vocabularies. But I do think it makes sense to split out the work on the vocabularies, and treat that with work on other vocabularies, that are a very similar challenge; separate from the work on mapping the organisation of the data, which is quite different.

I would also add that probably the focus of work at the moment on the structures side should primarily be in documenting the structures used, and how some of them can be mapped to Wikidata properties and qualifiers, with the expectation that there will still be quite a substantial amount of hand-crafted 'data-munging' required for most uploads.

I think it would also be realistic to dial down expectations of how much (or even whether) Structured Data may simplify data upload. In particular, any idea that one may be able to just press a button and everything will be effortlessly handled automatically seems to me to be magical thinking, that should be dismissed along with unicorns and Santa Claus -- as I think anyone who has worked on significant data upload to Wikidata would confirm.

Recently I've been working on creating particular Wikidata items for books and for authors, to support uploads of images extracted from those books. Matching identifiers like VIAFs or OCLCs can help, but doesn't get you the whole way, because there are authors that don't have VIAFs (at all), or have multiple VIAFs, and because OCLCs for works aren't unique, and because even when VIAFs or OCLCs do exist, the items already here may not have them. So, while matching identifiers can help match some entities to items, it doesn't match all -- so there are still many where one may have to create new items, or carefully make sure one would not be replicating old ones. This is not a huge amount different from trying to see whether there are existing Commons categories that already exist, or whether new ones need to be created. The resultant Wikidata/Structured Data records should be more useful, in very many ways; but the work needed to see whether they need to be created and then create them is not negligible -- and in fact creating new structured items (and avoiding duplicates) may even be more work that the corresponding process for file descriptions and categories, because the structured data structures are so much more involved than the essentially pretty simple category names and simple text template field entries.

So, in summary, I would say don't expect structured data to reduce the amount of pre-upload data preparation work needed, even with scripts to assist -- and it may even substantially increase it, for example when new items need further new items to be created to support them. Jheald (talk) 11:50, 7 May 2018 (UTC)[reply]

Peter Patel-Schneider

Is this effort supposed to be only for data about GLAM objects that are famous worldwide, such as the Venus de Milo or (a copy of) the United States Declaration of Independence or the Terracotta Army? Or is this effort supposed to also include data about GLAM objects that are only well known, such as the Nisga'a and Haida Crest Poles of the Royal Ontario Museum or Ötzi the Iceman? Or is it supposed to incorporate data about most or all objects on display in a GLAM institution? Or even about objects in research collections?

Different scopes of the effort will dictate different techniques. If it is only concerned with data about famous objects then the techniques can incorporate lots of manual fiddling to bridge the gap between the source data setup and Wikimedia. If it includes data about research-only objects then techniques that require very little manual fiddling per object are needed.

Similar questions should be asked about what impacts are desired. Is it views on the web? Visits to the institution? Donations? Research? Different impacts may require different techniques.

Nonetheless I think that this is a worthy undertaking. It should be much easier to move such information from where it currently resides to places like Commons and Wikidata or integrate such information with information in Commons or Wikidata.

I am interested in working on respresentation techniques that can cover even objects in research collections and can support the needs of researchers.

I think that one of the main ways that the Wikimedia Foundation can support this endeavour is by setting up mechanisms for two-way flows - both flows of data (i.e., making data in Wikidata better suited for and easier to use by GLAM institutions) and flows of users (i.e., supporting interfaces that let users easily see information from GLAM institutions).

I think that the proposal needs to be more clear on its scope and its goals.

Peter F. Patel-Schneider (talk) 14:04, 25 April 2018 (UTC)[reply]

Johnbod

Do you think this is a worthy undertaking? - Yes in theory, but I question WDs ability to do it adequately
If no: why not? - Looking at WD in the early days, I was struck by the lack of work done establishing vocabularies at the start, and often the the lack of understanding of the need for this. That will now be a huge problem. Like WD in general, I think it is likely to be fatally compromised by the poor quality of data taken from WP and Commons, or other sources. Recently I looked at 2 paintings on WP - the first (Arnolfini Portrait (Q220859)) had several different errors, including the wrong iconoclass, and the 2nd (Castello Roganzuolo Altarpiece (Q27899250)) was identified only as a human and male.
If yes, do you volunteer to work on this? - Certainly not
Which are your main (sub-)interests? - Art, art history
Any suggestions on how to best organize this work? - sorry no.
In which way can the GLAM team of Structured Commons (most specifically Sandra) support this endeavour? What do you need from the team?
Is the proposal itself accurately written and does it address the right issues? Which changes and updates would you propose? - it's not that clear or easy to understand at all. Johnbod (talk) 02:02, 28 April 2018 (UTC)[reply]

Feedback from Jane023

OK since I have been yammering on about SDoC before it even got its own acronym I feel like I have kept up with most discussions, so I will try to just keep to the questions asked:

Do you think this is a worthy undertaking? Answer: Yes.
If no: why not? Answer: Not applicable as I already drank the koolaid. There's no going back.
If yes, do you volunteer to work on this? Answer: Yes.
Which are your main (sub-)interests? Answer: Dutch 17th-century art, certain types of pre-1850 decorative art, Gendergap-related tracking for works, wiki presence, museum collections, etc.
Any suggestions on how to best organize this work? Answer: Sorry, no. I can't even organize my own interests, so I have no idea how to organize the work needed to get this done. That said, I do know a bit about how to setup challenges, and mapping the various corners of Commons to Wikidata for future work could be fun and rewarding, revealing gaps in our knowledge management due to basic structures of working from the age-old "encyclopedia model", which is mostly based on people and places. There is still so much work to be done on basic concepts! Figuring out how to model those in Wikidata is still quite challenging, let alone map them to files or file categories.
In which way can the GLAM team of Structured Commons (most specifically Sandra) support this endeavour? Answer: It would be nice to be able to have some basic periodic reports set up that show the number of mappings from Commons to Wikidata for various areas of interest. So e.g. how many areas of commons have situations where there is one item number and tons of images (like pictures of the Sagrada Familia etc)?
What do you need from the team? Answer: more reports, at more aggregated levels, and more suggestions for areas needing work.
Is the proposal itself accurately written and does it address the right issues? Answer: The proposal as written is written from the typical "Be Bold" perspective of "we want everything to do with everything, and as accurate as possible, so let's start with the way the stuff is modelled in the top institutions". I think that's OK from the English Wikipedian standpoint, but from the Wikimedian standpoint it's not always possible to think in those terms. Many of the examples listed are based on huge aggregates of data and are horribly biased in various ways. Certain types of academic knowledge change slightly across languages and distort the data as well. Therefore no ontology will ever be accurate or complete and no museum ever expects its collections to use 100% of the planned guidelines it is supposed to use. It's fine to want to support those ontologies, metadata standards and vocabularies, but that support is not any more important than supporting the concepts and other connected linked data that miss them. Just as the large aggregators are constantly adding to their databases, we need a clear way to express our inner needs and how to solve structural issues with our data. We need to share ideas with the WLM people in countries with poorly managed lists of heritage. Often museums hold objects that do not fit into any structured vocabulary, but may be matched to some object in the world that can still be seen in situ. Also how do we express art concepts in areas where the GLAM's photo may be the first photograph in the Wikiverse of this type of object? I am thinking here of the "cultures" problem that User:Pharos has expressed since the METification program started. We need a way forward for those institutions to feel welcome to join the conversation, without shoving some predetermined vocabulary or list of rules at them. The idea of using constraints vs suggestions is key here. We can set up constraints on data modelling and include suggestions, but we also need to be able to help with "mapping the unknowns".
Which changes and updates would you propose? Answer: I am not sure, but starting from the top is not a good idea, in my opinion - we want to start from the bottom, namely identifying what we have and are getting on a daily basis, and aggregate it ourselves into something mappable at the aggregator level. This is basically how we do people biographies on Wikipedia - we don't start with VIAF, we start with a person and look for VIAF later. Jane023 (talk) 12:09, 28 April 2018 (UTC)[reply]

The above discussion is preserved as an archive. Please do not modify it. Subsequent comments should be made in a new section.

Further comments and feedback after May 7, 2018

I'm sorry I joined this discussion late; I missed the opportunity to comment to specific contributions above. So I've opened a few sections below, and would be very glad if you add your own comments.

Cataloging Cultural Objects

An important omission in the list of standards mentioned at GLAM_metadata_and_ontologies_mapping is Cataloging Cultural Objects. CCO is the content standard that lays out best practices for describing artworks, and gives excellent examples. Compared to it, MARC, LIDO, CIDOC CRM are technical means whereas CCO describes the desired end.

As stated elsewhere, GLAM holdings are very diverse, and CCO doesn't cover everything (eg Numismatics has its own best practices). But it's crucially important for museums and galleries. So I believe we should start the mapping exercises from CCO:

Going through the CCO Examples and representing them as Wikidata is one step
Checking that some Wikidata artworks (eg from SoAP) are adequately described according to CCO is another

--Vladimir Alexiev (talk) 20:42, 8 May 2018 (UTC)[reply]

GLAM Ontologies

I've been asked by Canada's CHIN to recommend which ontologies they should use for their upcoming LOD-based national aggregation. I was at a bit of a loss, because I believe that right now there is no clear winner. I think these are viable choices:

Schema.org (with extensions similar to bib.schema.org), which has a very tolerant and pragmatic approach and is used by a huge number of websites world-wide
CIDOC CRM, which has a good foundation, and is used by various projects, particularly in Europe
http://linked.art, which is a CRM profile used by the American Art Collaborative
EDM, even though it's missing some important capabilities (eg the ability to describe specific contribution to an artwork)
and Wikidata, which has a fluent and collaborative way of creating properties and describing their application to specific domains

So yes, I believe that working on GLAM metadata mappings to Wikidata & Commons is a worthy goal!

--Vladimir Alexiev (talk) 20:42, 8 May 2018 (UTC)[reply]

Image vs Artwork

@SandraF (WMF): asked at Discussion_for_metadata_that_is_specific_or_important_for_GLAMs "Is this list of (broader types of) GLAM-specific metadata complete? If not, which elements do we miss?" I think the main GLAM-related consideration is to distinguish:

data about images (what is stored in an institutional DAMS) from
data about artworks (what is stored in a Collection management system)

Eg "accession number" is a property of the artwork, so attaching it directly to the image is not quite correct. An artwork often has many images, and these images have different DAMS IDs.

It's also important to distinguish what object an image represents, vs what images are shown on an artwork. This picture from CIDOC CRM may clarify the distinction: [1]. Here's an example:

The painting Mona Lisa (crm:E24_Physical_Man-Made_Thing) crm:P65_shows_visual_item the well-known image of that woman (crm:E38_Image).
The image is a concepual object that in addition to the painting, may be rendered on a photo, a T-shirt, a computer file, etc. It may also be modified in various ways, and still be recognizable as the same image
The image crm:P138_represents Lisa del Giocondo (a person).
As a shortcut, we can say that the painting crm:P62_depicts Lisa del Giocondo

If we take a photo of the painting: the photo (anoher crm:E24_Physical_Man-Made_Thing) crm:P65_shows_visual_item the same image, but it also crm:P62_depicts the painting. You can also say as a transitive shortcut that it crm:P62_depicts Lisa del Giocondo.

It seems to me that elements to relate image to artwork are missing in the original proposal, and are the main thing to discuss for this page "GLAM metadata and ontologies mapping"

--Vladimir Alexiev (talk)

Archives, Hierarchical Links

@Retrent: wrote at Commons_talk:Structured_data/Get_involved/Feedback_requests/Ontology#Discussion_for_metadata_that_is_specific_or_important_for_GLAMs "It would make sense that the (archival) collection as a whole would have a Wikidata entry".

This is not a given. Archives hold huge amounts of material, a lot of them never digitized (nor even cataloged down to the item level). I don't think it's the job of Commons or Wikidata to replicate big archives like NARA or aggregations like APEX (Archives Portal Europe). However, the ability to express hierarchical and lateral relations between materials is very important, eg:

the EAD hierarchy (12 or even more levels, such as class, collection, file, fonds, item, recordgrp, series, subfonds, subgrp, subseries)
images of artworks in a series
prints/images of different stages of an engraving
photos of the same object taken at different times

Some of the higher-level objects would not have associated images, so they'll be represented only at WD or WB@Commons.

--Vladimir Alexiev (talk)

Europeana

"Are there any statistics regarding the level of interlinking"

Eg Count of Europeana entities by type shows 55M artworks ("CHO"), 132M images or other representations, 2.7M Places, 72k persons (Agents)

There are only 76.8k WD objects with Europeana ID. From the other side, I don't believe Europeana contributors track WD IDs. I believe the intersection is at least 0.5-1M, so these 76k coreferenced objects is a low number.

As for "contextual entities" (places, agents), they are collected in the Europeana Entity Base, and the counts above include local (per-object) entities. For example, one of the Agent URLs (per-object) is http://hispana.mcu.es/lod/oai:galiciana-bibliotecadegalicia.xunta.es:10000156710#ent4 and it's declared sameAs a Galiciana authority http://www.galiciana.bibliotecadegalicia.xunta.es/aut/BDGA20140003808 (also VIAF, LOC, FAST). It's not easy to say how many unique persons & places are represented by the number above.

They're also not normalized, eg https://www.europeana.eu/portal/en/record/11629/_HERBAR_BGBM_GERMANY_B_10_0356587.html mentions http://www.geonames.org/maps/google_52.3166656494_13.1833333969.html, which should be resolved to http://sws.geonames.org/2832318/ (the Siethener Elsbruch forest).

And there are places without any notability or importance, eg iid:1666153/SP.1 (12, R, Torenstraat, Overasselt, Gelderland, Nederland), which is used by only one object https://www.europeana.eu/portal/en/record/2020704/DR_20181109.html

The issues of notability and lack of normalization concern the provider data. What is in the "Entity Collection" mentioned above is better. As a matter of fact int he example above (https://www.europeana.eu/portal/en/record/11629/_HERBAR_BGBM_GERMANY_B_10_0356587.html mentions http://www.geonames.org/maps/google_52.3166656494_13.1833333969.html) we do not even recognize a contextual entity, even a provider entity. It's just a URL. And yes we can't do quality control on that scale of detail - providers are very inventive in what they send us. Isaacantoine (talk) 14:40, 5 November 2018 (UTC)[reply]

"EDM is internally not very consistent"

I'd put it this way: different institutions use it in very different ways, and Europeana hasn't done enough to normalize it

--Vladimir Alexiev (talk) 20:42, 8 May 2018 (UTC)[reply]

Well it's a never-ending effort. That said I'd welcome some examples, to see if it's heterogeneity by design (i.e. use of properties that can accommodate different things) or basic inconsistencies (i.e. wrong use of properties). Isaacantoine (talk) 14:36, 5 November 2018 (UTC)[reply]

Summary and next steps

Hi everyone! My warmest thanks to everyone who contributed to this discussion. With some delay, here's an attempt at (very briefly) summarizing what I (Sandra) read in your comments above, adding some of my own thoughts and reflections to the mix. Feel free to comment!

Mapping GLAM metadata schemes and ontologies to structured data on Commons - is this a worthy undertaking?

Several people seem interested in working on this and think that a common and centralized effort makes sense, although the scope for this work needs to be better defined (see below).
It is probably good to start by taking a step back first: what impact do we want to achieve? More GLAM contributions to Commons? Better contributions? Less frustrating upload and contribution processes? ...
Interestingly: during a recent Wikidata workshop day at the Europeana offices, I (Sandra) got very clear feedback from some GLAM participants that they don't think we should put enormous amount of efforts in mapping GLAM metadata schemes in great detail; it would (according to them) be much better instead if we would work towards crystal-clear, well-documented and findable instructions, and towards standardized ways in which GLAMs should model their own data towards Commons. Although coming from a small group, I find this interesting input which I'd like to verify more broadly.

Better focus and prioritization needed

The original proposal was way too broadly and vaguely defined and seems to be very unclear to people not familiar with GLAM metadata and ontologies.
1. It is important to distinguish between ontologies and vocabularies, as these are very different things.
2. Looking at 'vocabularies', it's probably also good to distinguish between
  1. Thematic / topical data (subjects, concepts - example 'oak tree')
  2. Person names (photographers, artists, depicted people)
  3. Organization names (both organizations that contribute files to Commons, such as GLAMs, and other organizations that may be involved in the production of our media files, such as publishers, photo studios, etc.)
3. We also must clearly distinguish between metadata used for the description of artworks (as GLAMs do in their collection management systems, and which in our case will probably mostly be used to describe artworks on Wikidata) and of media files (as GLAMs do in digital asset management systems, and which in our case will be used to describe Commons files).
We need to prioritize our efforts: it is probably most worthwhile to work first on those metadata schemes and vocabularies that are very widely adopted in the GLAM sector.
I read some consensus that working on this will not produce a magic bullet, and converting GLAM metadata to Commons will always be painful. (While this is true, I think it's a worthy undertaking to work towards a process that makes it the least painful as possible.)

On Wikidata, we have already started working on some GLAM ontologies and vocabularies.

Some insights from that perspective:

We are mapping many vocabularies on Wikidata, including thesauri. We might want to include more information on Wikidata about the hierarchical relations in those, and we might want to work on mapping the SKOS format to Wikidata.
The Commons category system is also a hierarchical structure with a wealth of data that we don't want to get lost.
1. I (Sandra) recommend everyone to read the findings about categories in the context of GLAM uploads as part of the GLAM research earlier this year, where participants report having difficulties finding the right categories; from my own experience since 2012 with GLAM uploads - both performing and re-using files from them - I also notice these tend to be under-categorized, often with sub-optimal selection of categories.
Inside GLAM vocabularies and inside our own projects, there are still major knowledge gaps!

The longer term

We need to think about the longer term: maintainability and constant updates to mappings.

Technical integration

We need to think carefully how such mappings (if we work on them) are integrated in technical infrastructure. It's probably not a good idea to statically 'bake' them into APIs - perhaps code libraries make more sense, and we might want to encourage specific tool development in this direction? This also needs further investigation.

Follow up in June 2018 and beyond:

I (Sandra) have the feeling we need more input from GLAMs themselves, and I'm now thinking how to do this: whether this can be done in an informal survey or another type of consultation, and which questions need to be asked. Please let me know if you have ideas or suggestions here.
Categories. The core team working on Structured Data on Commons needs all currently allocated time and funds to give its full focus to the basic functionalities of structured data itself; extra work in technical support for transitioning categories is out of scope within the current timeline and budgets. I myself also can't give extra attention to category conversion from a practical perspective. Conversion of category data to structured data is, like data modelling and conversion itself, a task for the communities.
It would be helpful to make it easy for more people to contribute to the process of mapping GLAM metadata to Wikidata and Commons. More help is certainly welcome, and needed. I (Sandra) can try help support this by creating a set of GLAM info pages, part of the Structured Commons information site, including a better structured set of 'landing pages' on GLAM vocabularies and ontologies. These will point towards existing work on Wikidata's WikiProjects, be extensible by anyone, offer a first attempt at prioritization, and point to (if it exists) documentation. Help is welcome here!
Several members of the Structured Commons team will be present at the Wikimania hackathon, which is a good opportunity to talk to volunteer developers about ideas for technical integration of GLAM metadata mapping. It is probably quite relevant to tools that (will) support GLAM uploads to Commons and Wikidata, for instance Pattypan and OpenRefine 3.0.

Many greetings! SandraF (WMF) (talk) 15:17, 6 June 2018 (UTC)[reply]

Commons talk:Structured data/Get involved/Feedback requests/GLAM metadata and ontologies mapping

Contents

Europeana

Observations

Feedback from Jakob

Peter Patel-Schneider

Johnbod

Feedback from Jane023

Further comments and feedback after May 7, 2018

Cataloging Cultural Objects

GLAM Ontologies

Image vs Artwork

Archives, Hierarchical Links

Europeana

Summary and next steps

Mapping GLAM metadata schemes and ontologies to structured data on Commons - is this a worthy undertaking?

Better focus and prioritization needed

On Wikidata, we have already started working on some GLAM ontologies and vocabularies.

The longer term

Technical integration

Follow up in June 2018 and beyond:

Navigation menu

Commons talk:Structured data/Get involved/Feedback requests/GLAM metadata and ontologies mapping

Europeana

Observations

Feedback from Jakob

Peter Patel-Schneider

Johnbod

Feedback from Jane023

Further comments and feedback after May 7, 2018

Cataloging Cultural Objects

GLAM Ontologies

Image vs Artwork

Archives, Hierarchical Links

Europeana

Summary and next steps

Mapping GLAM metadata schemes and ontologies to structured data on Commons - is this a worthy undertaking?

Better focus and prioritization needed

On Wikidata, we have already started working on some GLAM ontologies and vocabularies.

The longer term

Technical integration

Follow up in June 2018 and beyond:

Navigation menu

Search