Commons talk:Structured data

From Wikimedia Commons, the free media repository
Jump to: navigation, search


"Wikimedia Commons holds a lot of (meta)data about the media files it hosts" - "(meta)data" is Q180160? --Fractaler (talk) 09:14, 19 February 2018 (UTC)

Tag namespace? or better "Category all images" namespace[edit]

I'm not sure my comment is in the scope of this discussion but I have this in my head since a long time. I think we should have a kind of "Category all images" namespace, let me try to explain at what I think.

This Category all images namespace would be only to display (not to edit) all the images in a category and in its sub-categories. A bit as the FastCCI tool option "All images", but at the difference that these are not requests with waiting time. I think to virtual categories, virtual categories automatically generated when you add/create a category that is part of our category tree. A bit as an automatic invisible over-categorisation. This namespace could be available from a category in a tab near to the discussion tab, example: Category | Discussion | All images . Of course the maintenance and diffusion categories should be excluded.

Take the same example as above, File:Canis latrans (Yosemite, 2009).jpg is in Category:Canis latrans lestes itself in Category:Canis latrans itself in Category:Canis ect ect.... until Category:Animalia.

But currently your image is only in Category:Canis latrans lestes, lets imagine now that when you add Category:Canis latrans lestes to your file then you add at the same time (automatically) and virtually Category all images:Canis latrans lestes, Category all images:Canis latrans, Category all images:Canis ect ect.... until Category all images:Animalia.

The result will be that when you will be in Category:Canis latrans, you will click on the tab "All images", and you will have all the images that are virtually categorized with Category all images:Canis latrans in summary all the images within Category:Canis latrans and its sub-categrories.

Of course I don't know the the technical method, but this should be possible if our categories are linked to Wikidata items. Again I'm sorry if I am out of topic but it's been a long time since it itches my brain. Regards, Christian Ferrer (talk) 19:50, 16 February 2018 (UTC)

Categories are the old system. Once data is properly structured you can just query and search in a normal way instead of having to wade through categories. Have you tried ? Just an example query of paintings depicting Carnivora:
SELECT ?item ?image WHERE {
  ?item wdt:P31 wd:Q3305213 .
  ?item wdt:P18 ?image .
  ?item wdt:P180/wdt:P171* wd:Q25306 .

Try it!

Of course the new query and search for Commons will probably much better and easier than this. Multichill (talk) 21:47, 16 February 2018 (UTC)
  • Thanks you. It seems that this kind of query has a brighter future than my laborious DIY. Great project. Christian Ferrer (talk) 22:10, 16 February 2018 (UTC)

Wikibase database physically close to mediawiki database and other ideas[edit]

Based on the diagram shown, it seems like a new wikibase server could live very close to the commonswiki data. This makes a lot of sense, both from the point of view of user "commons wiki editions for commons data", but also from a technical and performance perspective- "a new image should be either transactionaly edited, both data and its metadata". I know implementation has not yet started, but I would like to leave open the opportunity to have wikibase server separate from its client and mediawiki data, in the case that the structured data will be larger than its wiki data. In particular, grew larger than expected- and we should be ready to separate the structured data in case it grows as successful as wikidata. Not saying it should be like that, and definitely it shouldn't start like that, but having both services developed as potentially separate will allow not running into a wall in the future. Just my $0.02.

Regarding exif, nowadays, the image table has a bottleneck with some exif content being 1G per row; it would be nice to remove it from the image table, but also I mean it as a warning of how large that could grow when stored in json format (vs. one row per property).

Please have privacy concerns in mind when deploying features- I mean those that could be used for harrasing of bringing to the spot to "non-notable" (as in non-famous) people. People have expresed in the past concern about personal details of people being written on wikidata. While better classifying and discoverabilty is something we all want "all pictures of X" or "all pictures taken by Y"; we should also be aware of way technology could either invade privacy or be used for harrasing people. I am not giving specific cases, because I do not have specific actionables, but it is something it should be on top of the list priorities "how can people misuse X functionality".

--JCrespo (WMF) (talk) 18:37, 17 February 2018 (UTC)

@JCrespo (WMF): thank you for the thoughts, much appreciated! Keegan (WMF) (talk) 20:56, 20 February 2018 (UTC)

Categories, properties, & directed acyclic graphs[edit]

I think that in many ways our existing category system is suggestive of what people have found important; on the other hand, (1) the inability to easily intersect categories has led to rather arbitrary decisions as to when "intersection categories" should exist and (2) our folksonomy results in some loops. Presumably WikiBase is better for intersecting on the fly, and would allow us to transform at least a lot of our categorization into orthogonal (truly independent) properties that themselves can then be intersected at whatever level we want. For example, place should be completely independent of date/time and of (more complicated, not a single property) what is depicted. Categories we have now, like Category:December 2014 in Seattle could simply be an intersection of a date/time falling in December 2014 and a location falling in Seattle. For each of these independent properties we would want to be as specific as possible. It also probably needs to continue to be possible to have more than one value available for a given property: for example, there are places that are along the boundary of neighborhoods (or even incorporated communities that straddle counties), but each such hierarchy should still be acyclic. Thus, if someone wanted to find images that fit the successively more specific "Seattle in the 2010s", "2012 in Seattle", "2012 in Pike Place Market", "2012 in the Pike Place Market Main Arcade" that should be achievable without any need to set these up in advance. At the same time, things we think people are likely to search for over and over -- roughly our present "intersection categories" -- could each be implemented by a fixed query, and the hierarchy in multiple dimensions could also be made available (e.g. at the presentation layer, "2012 in Seattle" could still be a subcat of "2012 in Washington (state)" "2012 in the United States by city" (a metacategory: these would require a different, but still perfectly possible, query), etc.

Does that make sense? If needed could certainly take an hour or two to flesh it out a bit, but to be honest my involvement here is not mainly in a technical capacity. That's what I do for work, and I don't want to do large amounts of it on a volunteer basis. - Jmabel ! talk 01:46, 10 March 2018 (UTC)

While homonyms will be used (how to reduce homonymy in the modeling of the world, for a long time it is known long time) when modeling the world (using categories, WikiBase, etc.), there will be problems with loops. "December 2014 in Seattle" is a homonym: 1) "December 2014 in Seattle (location)", December 2014 in Seattle (time). --Fractaler (talk) 18:17, 10 March 2018 (UTC)
Huh? It's an intersection of a location (Seattle) and a time (December 2014), neither of which is ambiguous. A homonym would be something like "models" (people who pose vs. miniatures etc.) - Jmabel ! talk 19:22, 10 March 2018 (UTC)
So, we have two sets: 1) "time", 2) "location". A superset of this set is "time + location". "2014 in Seattle by month", "December 2014 in the United States by city", "December 2014 in Washington (state)" is "location + time"? It is about "time + location" (and "time", and "location", two variables)? It is only about "time" (one variable)? It is only about "location" (one variable)? If traced to the root, transitivity is observed? We have two sets:December 2014" (subsets: 1) 2014-12-01, 2) 2014-12-02, etc., and "Seattle" (subsets: 1) Transport in Seattle‎, 2)Visitor attractions in Seattle‎, 3)December 2014 in Seattle, etc. "). A superset "December 2014 + Seattle" is about "December 2014"? "Seattle"? And "December 2014", and "Seattle"? When two sets intersect, for supersets (when we go to the root,), are they both variables or just one? Then which one ("December 2014"? "Seattle"?)? --Fractaler (talk) 18:06, 11 March 2018 (UTC)
  • A Time + Location category is not a superset of a time category & a location category. It is an intersection.
  • If we are going to use set-theoretical terminology then, to use the example above, there is a set of images (mainly, but not exclusively, photographs) dating from 2014. There is another set of images located in Seattle. Category:2014 in Seattle pertain to the intersection of these sets. The intersection of these sets is not a superset of either of these; it is a subset of both of these. - Jmabel ! talk 23:05, 11 March 2018 (UTC)
  • In the head, we have a (mental) world model in the form of a taxonomy, a classification where transitivity is observed: the rule subset -> set -> superset, represented by the category tree (Commons, Wikidata, etc.). This model can be implemented (linked to, illustrated) by media objects (image, sound, etc.), for example, as on Commons (knowledgebase model). An intersection (when partial, incomplete inclusion of the set) does not allow us to observe such transitivity, directed acyclic graphs. So, now we have for the set "2014 in Seattle" 4 supersets: 1) "2014 in the United States by city", 2) "2014 in Washington (state)", 3) "Seattle, Washington in the 2010s" 4) "Seattle by year". --Fractaler (talk) 08:08, 12 March 2018 (UTC)
  • For 2 & 3, correct. But 1 & 4 are metacategories, strictly used to structure the category hierarchy. 1 & 4 don't describe a set of files/images (only a set of categories), and if we are going to implement this from more orthogonal properties the distinction is important. The supercategories in those directions up the hierarchy are "2014 in the United States" and "[located in] Seattle", respectively. - Jmabel ! talk 16:01, 12 March 2018 (UTC)
  • Set-theoretical terminology has the term "metacategory"? A set can have a superset. "2014 in Seattle" is a set. --Fractaler (talk) 17:03, 12 March 2018 (UTC)
  • No, metacategory is a commons term, not a set theory term. In set-theoretic terms, our metacategories are sets of categories, not sets of files/images. Different domain. - Jmabel ! talk 17:09, 12 March 2018 (UTC)
  • If we use only the set-theoretical terminology, "2014 in Seattle" is a set? "2014 in Seattle" has a superset? --Fractaler (talk) 18:15, 12 March 2018 (UTC)
  • If we use only set-theoretical terminology, we can't say anything about any concrete, real-world example, because the name of any set falls outside of that terminology.
  • But if we take the reasonable understanding that a category is, above all, a set, then there are several of set-theoretic ways to understand "Category:2014 in Seattle". The naive way is that it is the set of all images -- I'm going to keep this to images for the moment to keep the discussion simple, since that is about 98% of our content -- dating from 2014 (or some more specific date within that year) and depicting a location in Seattle. But because our categories typically contain both images and other categories, it's a little trickier than that: our current implementation is really the union of two disjoint sets: (1) more specific "child" categories (which each may be more specific in terms of location, date, or some other property), and (assuming we are strict about COM:OVERCAT and our hierarchy is acyclic) images that match that naive description and are not in the set of images for any of the (recursive) subcategories of "Category:2014 in Seattle". One thing that distinguishes a metacategory is that no images fall directly in the category.
  • I don't necessarily think this way of thinking about it is all that useful to anyone other than someone trying to write/negotiate specs or to implement this. I do think we need to decide how much of our current category structure we care to retain if there is a new property-based approach to categorization. Mostly what I'm trying to do here is to sketch out how we could still give the end user any benefit they get from the current categorization scheme even if we go to a more property-based approach. Above all, what I'm arguing is that if we want to give the user pretty much everything they get from the current category scheme (and hopefully more), the possible values for any given property are going to have to constitute a directed acyclic graph.
    • Strictly speaking, the rule would be a little more complicated than that: the directed acyclic graph for any given property would have a single root; the root might or might itself not be a valid property value; similarly some other values in the graph might not be valid property values, just like a metacategory cannot directly contain images. For example, we could have a property for a person's death date; the graph would looks something like:
   Root (not a valid value)
     |                   |                                     |
   Alive        Status unknown (could              Dead (not a valid value)........
                be dead or alive)                  |           |                  | 
                This might have some          Death date   Death date in         Death date known (not a
                descendant statuses (e.g.     unknown      a range (not a        valid value)
                'presumed dead')                           valid value)                  |
                                                               |                     (further hierarchy, however
                                                      (further hierarchy, however    we handle increasingly specific 
                                                      we handle date ranges)         dates)
Haven't though this through in enormous detail; "Death date in a range" and "Death date known" may or may not be a useful distinction, since it's not clear how precisely we have to know something to call it a single value rather than a range: e.g. if we know a year, but not more specific, is that a range or a date? How about if we know a decade? I'm not trying to solve the full modeling problem here, and we will probably want to follow some combination of how we do this in our existing hierarchies and how WikiData models it (presumably they've already thought a lot of this through, and presumably many, perhaps most, of their solutions should be acceptable ones). - Jmabel ! talk 23:39, 12 March 2018 (UTC)
Just a long remark about date. It is a categorizable attribute, in principle, like any other. First, it is possible to have an image illustrating more than one date. One kind of example are Tombstones and similar objects having a date inscribed on them – their images both illustrate objects’ ascribed dates and the current date where an image was taken. Second, I vehemently disagree that any image categorized under Seattle and produced in December 2014 is eligible to “December 2014 in Seattle” by a pure formally-logical inference. It can be, for example, a map of Seattle showing some conditions as of mid-2014 (albeit drawn in December). Incnis Mrsi (talk) 18:17, 12 March 2018 (UTC)
  • Certainly. And I don't think we would currently place a map that was simply produced in December 2014 in Seattle in the Category:December 2014 in Seattle, except perhaps indirectly if we had a subcat like Category:December 2014 in Seattle ''works'', which and a Wikibase approach would probably be related to using the property for Seattle in a different attribute than for images of Seattle. - Jmabel ! talk 23:39, 12 March 2018 (UTC)
Ok, then about the set "Person". For example, version 1.0 (set "2014 in Seattle" may be later) --Fractaler (talk) 11:52, 13 March 2018 (UTC)
By WP version, the set "2014 in Seattle" has: a superset "2014 in Washington (state)", a superset "2010s in Seattle", a superset "2014 in the United States by city". Commons' model: for the set: superset "2014 in the United States by city", superset "2014 in Washington (state)", superset "Seattle, Washington in the 2010s", superset "Seattle by year":
To display all parents click on the "▶":
To display all subcategories click on the "▶":
2014 in Seattle(5 C, 1 P)

WD does not have such item. Does the same world have different models of the world? --Fractaler (talk) 07:57, 15 March 2018 (UTC)