Commons talk:Structured data/GLAM/CIDOC CRM

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Scope[edit]

Hi @SandraF (WMF):, a couple of first quick first thoughts on this.

1) The Commons information templates that will ultimately mediate the structured data statements into a more human-friendly form will be expected to present a whole overview of the image, including information about what the image depicts. Yes, there is likely to be a section for information that is specific to the digital object (eg photograph date, photographer, file format etc), but this information is likely to be tucked away at the very bottom of the template. The present mapping appears to focus primarily on information about the digital object. But we also need to consider how to map CIDOC-CRM information there may be about the real object that the digital object is an image of. It would perhaps be useful to do this in a different section, and certainly in more detail. In the world of real objects there are far more CIDOC properties than those so far represented on this page. Jheald (talk) 14:23, 29 September 2018 (UTC)

Added: I have realised that I was reading this table in terms of how information in CIDOC structures could be interpreted for extraction and loading into Wikidata/CommonsData; whereas I now think the author's intention was perhaps more to think how data from proposed CommonsData/Wikidata structures can be interpreted into a CIDOC format or structure. It's a very revealing analysis. But we probably do also want to think how things would work going the other way. Jheald (talk) 16:04, 29 September 2018 (UTC)
There's (much) more immediately below on the distinction between digital objects and physical objects; but I do think that, both in our thinking and in our ultimate information modelling and in the presentation of this page, it would be useful to separate the material on this page relevant to one from the material relevant to the other. Jheald (talk) 17:46, 29 September 2018 (UTC)

Distinguish digital object as digital object[edit]

"It is extremely important to make a crisp distinction between the description of the digital object qua digital object, and the various information objects that it encodes/carries/incorporates."
(Note: "incorporates" above is being used in the sense of CIDOC p165 "incorporates". A CIDOC note discusses incorporation in the context of digitisation. Normal CIDOC usage is to say that a digital image (member of CIDOC-CRM class D1) incorporates (P165) the physical image (member of class E38 image) that it may be a scan of. The comment is therefore saying we need to be much clearer in distinguishing these).

2) We must come up with a clear decision about where information about the real objects is going to live -- and note that this may involve a chain of related items. I am currently working on a Commons upload of images of C18 maps and engravings, with related Wikidata items. Typically (with an eye on FRBR which may involve even a further level), we may distinguish

  • Digital image -> Physical copy -> Recognised state or edition of work -> Work

Where do the items for each of these belong? This is a question that has so far been repeatedly dodged for at least two years, but we must have a decision, if we are to think about what distinct items are going to exist, and what properties are going to connect them.

Information specific to a particular digital image will presumable belong on CommonsData.

At the moment for my maps and engravings, I am creating 'edition'-level items on Wikidata. This seems appropriate for editions of things which may exist in multiple copies, and is also what the catalogue entries I am working from attempt to identify.

But what about information specific to a single physical copy? Where should this live? Per the statement quoted at the top of this section, it would be cleanest if this lived in its own item, separate from the item for the digital image, and separate from the item for the edition. This would give the cleanest structure, with the cleanest separation of information relating to different notional things, the cleanest background against which to define properties to connect the items.

However, again: where should the items relating to physical objects live? Some may be notable in their own right, with their own Wikipedia articles, and therefore their own Wikidata items. For consistency, one might therefore want every physical object that has been imaged to have its own Wikidata item. But I think the Wikidata community would have a fit at that, as simply pushing the number of items on Wikidata far beyond what the community can cope with. Already, see the self-doubt in the WikiCite community, as to whether the number of items they have created for academic papers is starting to damage Wikidata, overwhelm systems, break search, and generally make Wikidata less usable for general items. Creating a distinct item on Wikidata for every physical thing that a Commons image might digitally represent -- ie every museum piece, every copy of every engraving, every plate in every copy of every book -- would be very clean, but would raise tenfold the issues WikiCite has already encountered. I don't think the community would be minded to allow it (see how hard it can be already to get Wikidata items agreed to correspond to Commons categories); and I see no evidence (at least not so far) that the office has the determination or the backbone to take on the community and require it or mandate it.

But if not on Wikidata, then where would the items for these objects live?

As far as I know, the Structured Data team has ruled out the idea of creating a namespace on CommonsData for generic items -- AFAIK only media-items, locked 1:1 to particular Commons files, are envisaged.

Trying to jam the information about the underlying physical objects into CommonsData media-items is sometimes suggested, but would seem to be a very bad plan. The whole force of the quote at the top of this section (to me anyway) reads like a strong statement not to do that, but instead to clearly separate the different conceptual entities into different items and be very clear what each one represents. The technical design of Wikibase also plays into this -- the choice that qualifiers cannot have qualifiers, that references always apply to a whole statement not to parts of it, these strongly work against any idea of statements within statements, and instead (valuably!) push towards designs where if something has a distinct conceptual identity then it should have its own identity in the triplestore.

So if not on CommonsData; and if Wikidata doesn't see itself as wanting to take on the role of being a union catalogue of every museum item in the world, then where does that leave us? WikiCite, facing many of the same questions, although on a much smaller scale with 'only' 10 million items so far, is considering a separate free-standing wikibase installation. Is something like that where one should put all these items for all these physical objects? But such an option also has its drawbacks, and it is far from certain that WikiCite will go for it.

But this is a question that does need a decision, really before we can start to think about properties: what different items will there be, and where will they live? Jheald (talk) 14:23, 29 September 2018 (UTC)

Hi,
That's pin point the issues I have raised in previous discussion, and the answers have confused me rather than enlight me. Regards, Yann (talk) 15:29, 29 September 2018 (UTC)

Date depicted / depicted period[edit]

It is very common to have eg an image of "Map of London as it was in 1880", or "View of Castle as it was in 1740", which may have no relation to when the map or view was actually created.

But, per the rubric quoted above, yes these may properly be statements that should live on an item for the map, or for the engraving, rather than the item for any digital image of it. Jheald (talk) 14:34, 29 September 2018 (UTC)

CIDOC I think might have an event "London existing in 1880" or "the Castle existing in 1740", to which would be attached a date and participants, and which the image would then depict. Wikidata doesn't bother with this; so where such a CIDOC event exists, that Wikidata would not recognise as a distinctive unity of concept, so where a separate Wikidata item would not be appropriate, the CIDOC item may need to be decomposed to find underlying things that may have Wikidata items that the image should be connected to.
This is a consequence of the rather more structured data-modelling on CIDOC, as opposed to a tendency towards parsimony of items on Wikidata, which is systematic of the two conceptions. There are probably many further patterns of examples to be found, of CIDOC items depicted, that would need to be decomposed to map over to Wikidata/CommonsData. Jheald (talk) 15:28, 29 September 2018 (UTC)

Types and classes[edit]

Venn B minus A.svg

Unlike CIDOC, Wikidata doesn't usually distinguish types (CIDOC E55) from classes of eg things, places, events, concepts.

I tend to think of the two models by analogising with a Venn diagram, eg right. The set "A" in the diagram has both contents (which might be part of wider sets), and a boundary (which might be a part of a wider category of boundaries, eg ones then identified by circles, rather than by squares or ovals).

Some ontologies treat the two, contents-of-A and boundary-defining-A, as separate items.

Wikidata typically uses a single Q-number for A, representing both.

On Wikidata, A may therefore simultaneously be subclass of (P279) some wider class (i.e. considering its contents), and instance of (P31) some type-class (i.e. considering its boundary/definition).

Danish Blue (Q165855) may thus be both subclass of (P279) cow's-milk cheese (Q3088299) and instance of (P31) type of cheese (Q3546121) -- the latter item is possibly redundant, but has links to Wikipedia articles. Jheald (talk) 15:10, 29 September 2018 (UTC)

In contrast, in CIDOC (which is heavily geared towards museum collections), an item may be a thing (class E18), which may be part of (^P46) a particular group of things (class E18 again); but what kind of thing it is, is conveyed by the property has type (P2), which takes a value in class E55, "type", which is a completely separate self-contained hierarchy of concepts, typically corresponding to entries in controlled vocabularies and thesauruses, primarily connected by the properties "broader term" (P127) and "narrower term" (^P127), a totally separated walled garden of terms and abstract definitions, quite removed from the world of actual concrete things. This is not really a distinction Wikidata maintains (and certainly doesn't maintain either consistently or reliably). But hence I think the concern about whether "file type" should actually indicate a group of real (if intangible) things on the one hand, or some of 'type' on the other. This is a very real distinction in CIDOC.
Apart from this, the comment about the degree to which "type of media" / "mime type" / "file format" actually represent distinct properties is well done. One could probably have a single Wikidata property combining all of this, that would point to a particular file format, which by its nature would indicate the "type of media" (eg audio, video, 2D image (vector/raster), representation of 3D object, dataset), and would have associated with it a mime type. Perhaps it does make sense to have "type of media" as an explicit direct property. But, like the author, I'm not so convinced that it makes sense to specify mime type as a separate thing, beyond format.
And (like upload date below), again, this data should all arguably be derived directly from the MediaWiki internals -- to what extent does it really make sense to be discussing any of them as editable CommonsData fields? Jheald (talk) 22:29, 29 September 2018 (UTC)

inception (P571) for date of file creation / scan date[edit]

I really dislike this, and would prefer a separate property for date of file creation / scan date.

I think it starts to become horribly confusing when both the scan and the underlying work have a P571, which are not the same. Better to distinguish them with a specific property for file creation / scan date. Jheald (talk) 15:37, 29 September 2018 (UTC)

Date of upload to Wikimedia Commons[edit]

Why is this a property at all? It certainly shouldn't be overloaded onto P571, a completely different concept.

I am not sure I see the use-case for this information being an editable field on CommonsData. Surely it should be accessed directly from MediaWiki?

If there is value in serialising the information, eg for RDF data dumps, it would seem appropriate to make a new specific property to represent it.

Yes, one probably does want to have an easy way to query from within WDQS what revisions there have been of an image, and to be able to access URLs (and IIIF manifests?) for each version of the image, together with associated data such as eg uploader, date of upload, image size, etc.

But the obvious way to provide this in WDQS would be by a SERVICE, drawing the data directly from the relevant MediaWiki SQL tables.

Whatever, overloading this onto P571, but then having a qualifier to say this isn't a 'real' P571, surely cannot be the right approach? Jheald (talk) 15:51, 29 September 2018 (UTC)