Commons talk:Machine-readable data

From Wikimedia Commons, the free media repository
Jump to: navigation, search

Very good[edit]

Very good addition. Microformats and other tags are sometimes added by people but as the template changes and pieces are cut and pasted than Machine-readable data can get out of wack. We could use some documentation of how to test correctness of those tags, and have regular check on some list of templates that suppose to have them. We could also use some more info on why it matters how can people benefit from it. See Template_talk:Artwork/Archiv/2011#Problem_with_microformats_in_artwork_template for related discussion of tags in other templates. --Jarekt (talk) 13:58, 6 March 2012 (UTC)

{{Book}}[edit]

See Template_talk:Book#Machine-readable_data Jean-Fred (talk) 17:54, 8 March 2012 (UTC)

Commons namespace?[edit]

Should this perhaps be moved to Commons:Machine-readable data, since the "Help:" namespace is usually (but not always) reserved for more general MediaWiki-related help? - dcljr (talk) 15:33, 9 March 2012 (UTC)

✓ Done, indeed, thanks. Jean-Fred (talk) 22:04, 14 March 2012 (UTC)

Other similar page[edit]

I stumpled upon Commons:Machine readability as I was looking for this page. That page is from 2008 and as far as I can see contains incorrect, not up to date information? Should we just replace it with a redirect to here, or is there something to be kept? /skagedaltalk 21:54, 27 March 2012 (UTC)

Symbol keep vote.svg Agree those should be merged --Jarekt (talk) 13:04, 6 September 2012 (UTC)
✓ Done redirect. I updated this page as well, since that one referenced some fields this page didn't have; if anything else looks useful to merge, feel free. Rd232 (talk) 13:26, 6 September 2012 (UTC)
I was thinking about adding information about {{Book}} template markup and parallel markup scheme using en:microformats. There is also discussion about adding <td> id tags to {{Creator}}. Once that is done we should document it too. --Jarekt (talk) 14:20, 6 September 2012 (UTC)

Readibility from other projects[edit]

Just to underline that any "machine-readable data" is happily exported into any wiki project and can be read into html of File: page just as other local data (i.e. without any trick to avoid AJAX limitations by "same origin policy"). I'm going to parse those data from it.wikisource, both to load them when creating new pages, and to use them to align data content of old pages. I opened two threads into wikitech-l and wikisource-l about. --Alex_brollo Talk|Contrib 15:36, 30 August 2012 (UTC)

Indeed. I addded so to the page. Jean-Fred (talk) 19:34, 30 August 2012 (UTC)

Template:Information#Microformat[edit]

Template:Information#Microformat section contains information about en:Microformat markup used by {{information}}, {{Creator}} and possibly other templates. I think this information should also be included here. --Jarekt (talk) 13:04, 6 September 2012 (UTC)

Yes. Rd232 (talk) 16:56, 6 September 2012 (UTC)

<td> id attributes[edit]

A strange thing about id attributes added to <td> HTML elements, is that they are added to the <td> cell with the field name not the one with the field content. This creates a problem for {{Creator}} and {{Institution}} templates that have some cells with field values but no field names ("name" parameter) and some cells sharing a single name field ("Date of birth/death" fields). What should be done to those? If I add the IDs to the value fields than some templates will have them one way and some the other which can be quite confusing. --Jarekt (talk) 02:50, 14 September 2012 (UTC)

I noticed that recently. It's very strange, and fixing it is a headache - changing it may break things that expect the ID to be attached to the field name. Really we need a whole new set of ID attributes, eg of the form FIELDNAME_value. Rd232 (talk) 05:46, 14 September 2012 (UTC)
That is a good idea. I will propose adding ID's with FIELDNAME_value names to Creator and Institution templates and than start discussions about adding those to {{Information}}, {{Artwork}} and {{Book}}. That is going to be a lot of discussions. --Jarekt (talk) 12:37, 17 September 2012 (UTC)
Just to tell thet I wrote a jQuery "parser" to get back data fron Information and Book templates, even if they contain one or more Creator templates. It gives back a js object, main key is Book/Information fileinfotpl_key, contents are a string or a nested object where keys are fileinfotpl_creator_key. This structure is needed since multiple Creator templates give omonym IDs, i.e. same IDs are produced for author and illustrator. I added too a more key name to Creator data, since - strange to say - this datum hasn't its ID. So, if r is the name of js object, r["aut"]["name"] gives the name of author, while r["book-illustrator"]["name"] gives the name of illustrator. By now, html is storet, some more parsing is needed to get "clean" data. You can find test sctipsts in it:s:Utente:Alex brollo/ParsingHproduct.js. --193.43.176.15 15:01, 24 September 2012 (UTC)
193.43.176.15 (user:Alex brollo I presume), You mentioned that Creator name does not have its ID. I can add that gender, nationality, occupations, sort-key, and probably other data does not have them either. That is because that data is not stored in a separate <td> cell. Is there some other way to ID them. By the way, Creator's name has vCard.fn tag, is that useful to you? --Jarekt (talk) 15:27, 24 September 2012 (UTC)

Relevant discussion[edit]

Please see Commons_talk:EXIF#Commons:Metadata_redirects_here. --Piotr Konieczny aka Prokonsul Piotrus Talk 16:55, 16 September 2012 (UTC)

Classes in description[edit]

Hi

Some input on Template_talk:Description#Fetching_description_in_a_given_language would be welcome.

On a related note: it would be nice if the code generated by using {{fr}}/{{en}} and {{Mld|fr|en}} would be the same.

Jean-Fred (talk) 16:40, 17 October 2012 (UTC)

What about creating a template for machine readable data of license templates[edit]

…so it looks like this:

{{Machine-readable-data/license
  | template_name= STRING
  | short        = STRING
  | long         = STRING
  | attr_req     = BOOL
  | attr         = STRING
  | link_req     = BOOL
  | link         = STRING
}}

instead of

<span style="display:none" class="licensetpl_STRING">
<span class="licensetpl_short">STRING</span>
<span class="licensetpl_long">STRING</span>
<span class="licensetpl_attr_req">BOOL</span>
<span class="licensetpl_attr">STRING</span>
<span class="licensetpl_link_req">BOOL</span>
<span class="licensetpl_link">STRING</span>
</span>

This would deobfuscate the whole situation, I believe. Thoughts? -- Rillke(q?) 23:15, 11 November 2012 (UTC)

List of templates that use machine readable markup[edit]

Is there a way of automatically constructing a list of templates that use, for example, fileinfotpl_aut? I'm thinking of user created templates such as http://commons.wikimedia.org/wiki/User:Biopics/infong

HYanWong (talk) 12:12, 28 February 2013 (UTC)

I do not know. The lists on this page were created by reading the template source code. --Jarekt (talk) 14:27, 28 February 2013 (UTC)

Extension development[edit]

Things are moving: mw:Requests for comment/Image information. Jean-Fred (talk) 14:24, 11 March 2013 (UTC)

Images with text, but no Artwork[edit]

In {{Artwork}} exist parameter inscriptions, this table row is marked with attribute fileinfotpl_art_inscriptions. We have many images with text outside of artwork, like monuments, plaques, and other with text is not readable for searching. So i have created an extension for Template:Information this add a new row labeled with "Inscription" (singular) and than does the same as {{Inscription}}. If this an god idea? I will transfer the template later form my userspace to Template:Inscription field.

Is it possible that inscriptions with this template in the future easy transfered to Wikidata d:Property:P438? Or should we mark it otherwise? Like mark inscription text direct with <span itemprop="inscription">...<span> --MatthiasDD (talk) 11:59, 10 November 2013 (UTC)

You can accomplish the same with:
{{Information
| Description  = Description
|other_fields_1=
{{Information_field |name={{i18n/inscription}} |value=
 {{inscription |1=monum. Latine |full form=monumenti Latine |position=bottom
   |transliteration=Transliteration |language=la 
   |de=Deutsche Übersetzung |en=English translation |fr=traduction française }}
}}
| Date         = 2013-10-22
| Source       = {{own}}
| Author       = Author
}}
Description

Description

Inscription bottom: monum. Latine [monumenti Latine] -Transliteration- [English translation]
Date
Source Own work
Author Author
So I do not think there is a need for a new template. Also I do not like the templates that you add to the end of the description field and they add a row. We had complains about those that they produce non-valid html which some browswers handle and some not. That is why we have other_fields_1 or other_fields fields so we do not have to do it. Finally I do not like copying of the content of the {{inscription}} template. You should not cut and paste a content of other templates when a call to the original template would be equally easy. It just complicates maintenance. --Jarekt (talk) 04:28, 11 November 2013 (UTC)

Yes, I knew this version, but I think it is not easy to use like a wiki, this code of nested templates is more like a programming language. My version have the advantage that we can change the template and the inscriprion is write in the description field - if we need this in the future. I have an idea to change the new Template so that the user can write: |other_fields_1={{Inscription field |1=monum. Latine |...}} This Template can call {{Information_field |class="inscription" |...}} and {{inscription |...}}. But this can mark the new row only with <td class="fileinfo-paramfield {{{class|}}}"> and not with <td id="fileinfotpl_art_inscriptions"> or other markers for machine readable data. --MatthiasDD (talk) 22:51, 14 November 2013 (UTC)

I would be fine with writing a template you described and you can make it add <td class="fileinfo-paramfield inscriptions"> field. --Jarekt (talk) 04:08, 15 November 2013 (UTC)

Now, i have changed the template (see my userspace). The class is <td class="fileinfo-paramfield inscription"> (singular). Is it right so? --MatthiasDD (talk) 21:28, 22 November 2013 (UTC)

Copyright information of underlying content[edit]

{{Copyright information}} is used to add licenses that refer to some underlying work the image was derived from (e.g. copyright of a statue which is visible on the photograph). This messes up the machine-readable data since there is no algorithmic way to tell that the license is not about the file but some parent work. This results in Bugzilla57465 in the extmetadata API.

Internally, {{Copyright information/row}} is used with the underlying=yes parameter; the solution would be to add some machine-readable markup when that parameter is present, so that the license can be ignored or interpreted in a more nuanced way.

As an example, something like this could work:

<div data-mrd-scope="restoration">[...license HTML...]</div>

--Tgr (WMF) (talk) 19:07, 22 July 2014 (UTC)

Microformat is dead[edit]

This system should be changed to utilize microdata and/or RDFa. Microformat has never been a proper standard and is now more or less dead. It is sort of usable as an internal solution, but if we want this to have real importance it should be made in some proper ways, and that is most likely microdata or RDFa. The first one is the one most similar to microformat, while RDFa is probably more flexible. Jeblad (talk) 21:34, 21 August 2014 (UTC)

99% of users (me included) only look on visible parts of the templates or pages. I am not aware of any tools or processes relying on machine-readable data, so as a result it is hard to tell who (if anybody) might be impacted by such changes. Machine-readable data should be designed by the people who might use it or other stake-holders, while me and other users maintaining templates will be happy to add any tags, or microformats which have consensus of the stakeholders. --Jarekt (talk) 22:20, 21 August 2014 (UTC)
Microformat was an attempt to make pages machine readable. Pages will be read by machines, that is part of indexing the pages. Use microdata and RDFa, that is the proper way to do it if you want to be part of the semantic web. Jeblad (talk) 05:45, 22 August 2014 (UTC)
Everything on that entire page is dead. Our entire methodology is completely braindead and any improvement should be with the sole purpose of improving it to a degree that we can write software to migrate it to a more sane system. That was always the plan, even when we were working on stock photo a couple of years ago. —TheDJ (Not WMF) (talkcontribs) 12:46, 26 August 2014 (UTC)
Commons:Structured data? --El Grafo (talk) 13:07, 26 August 2014 (UTC)

Template:Art Photo and MediaViewer[edit]

Hello, {{Artwork}} suggests to use {{Art photo}} for works where different licenses are required for the depicted artwork and the photograph of it. However, the current implementation of MediaViewer produces nonsensical output under some conditions (see more detailed report here), not mentioning the photographer as the copyright holder of the image. Not sure if this will be fixed with the upcoming version of MV, but It doesn't work in the recent design prototype as well, so it seems like a more complex problem. --El Grafo (talk) 09:51, 15 September 2014 (UTC)

Many works require you to specify different licenses for different aspects of the artwork. All sculptures require copyright tags for both sculptor and the photographer. Derivative works could require specifying licenses for both the original and the copy. One can envision scenarios where we are dealing with several authors each from different country or century, see Commons:Multi-license copyright tags for some usual approaches to deal with them. When we add to this that each author might require specifying licenses for the country of origin and the US, or that some recent photographs might be multi-licensed (for example CC and GFDL) and we might end up with a lot of different license templates in an image, while current MediaViewer can handle only one. I ma not sure what to do about it. Bug report? --Jarekt (talk) 16:26, 15 September 2014 (UTC)

Machine-readable data for non-free images[edit]

The machine-readable format described here is used on several other wikis (such as en.wikipedia) which allow non-free content; this usually means that any template informing about legal status will be marked up as a license template, and we end up with "licenses" such as fair use. To an extent this is OK (the confusion between license and legal status already exists on Commons, thanks to the PD templates, and legal statuses are usually displayed the same way as licenses), but it can be misleading when using machine-readable data to inform potential reusers.

To avoid this, the markup standard on COM:MRD should be extended to inform clients whether the image can be freely reused or only with limitations. A simplistic method for that could be to add a licensetpl_free field with the same syntax as licensetpl_attr_req. What do you think?

(Pinging @Guillaume as this could be a candidate for inclusion in the m:File metadata cleanup drive.)

--Tgr (WMF) (talk) 11:57, 15 October 2014 (UTC)

Sounds good to me. FYI, I've started m:File metadata cleanup drive/How to fix metadata, which notably includes a section about non-free media. Feedback is welcome (before I reach out to wikis with local uploads). Guillaume (WMF) (talk) 15:00, 15 October 2014 (UTC)

On second thought, maybe call it licensetpl_nonfree? That would make it clearer that the default is free (I imagine the majority of license templates / copyright tags are about free images). --Tgr (WMF) (talk) 20:39, 20 October 2014 (UTC)

✓ Done :) https://meta.wikimedia.org/w/index.php?diff=10281719 . Guillaume (WMF) (talk) 08:56, 22 October 2014 (UTC)

Identifying information-like templates[edit]

Currently {{Information}}, {{Artwork}}, {{Photograph}} and (to some extent) {{Book}} all emit the same machine-readable markup, so a client of the COM:MRD standard cannot easily tell them apart. This is problematic because e.g. the author of a photograph and the author of the statue that's visible on the photograph cannot be used interchangeably in most cases. This leads to outright copyright violations by clients in some cases (see Template talk:Art Photo#Issue with MediaViewer for lots of details). So there is a need to label these templates in a machine-readable way.

I propose putting the classes fileinfotpl-type-information, fileinfotpl-type-artwork, fileinfotpl-type-photograph and fileinfotpl-type-book on the top-level <table> elements of these templates. --Tgr (WMF) (talk) 11:19, 21 October 2014 (UTC)

Sounds good to me. Anything we can do to help resolve this particular issue is certainly welcome! Guillaume (WMF) (talk) 08:44, 22 October 2014 (UTC)

This properties page now needs xwiki pointers as it has become a Wikimedia default help page[edit]

@Guillaume (WMF), Bawolff, TheDJ:With the global change to categorisation of files, and the adaptation of classes as defined and utilised (here) now being applicable to all WMF wikis. We need to do some more about promoting the classes used, and assisting communities to update. While I have seen information about the categorisation changes, I have not seen obvious helpful information about how to fix.

It is unusual for Commons: to be the place for WMF-wide documentation, and/or the configuration of a WMF-wide standard, but now it is, presumably by weight of being the "file place". Noting that usually Meta hosts such information, or sometimes we find it at [[mw:|Mediawiki] if it is broader again.

So we do actually have a meta page with information and it is at m:File metadata cleanup drive, however, that is not a page, by its name, that entices people to go and visit for standards, it is its own project. And we do have mw:Extension:CommonsMetadata that is the extension that implements these changes but it is pretty generic, and doesn't assist in compliance.

We need to look to how we express a (new/now) universal Wikimedia standard, and have it widely available, easily readable, and easily findable. I also think that the data as expressed on the general page needs to be known to every wiki, and we should be looking to how that is to be done better now that we have introduced a new baseline. I am wondering what people see as the alternatives to make this happen.  — billinghurst sDrewth 01:56, 22 October 2014 (UTC)

billinghurst: I wrote a page about how to fix the metadata at m:File metadata cleanup drive/How to fix metadata; I haven't advertised it very widely yet because it's being translated. That page could easily be renamed when the cleanup drive is over and made into a more reference-like documentation page. This Commons page contains a lot of information that is Commons-specific and that most wikis don't need (e.g. Artwork-related classes), and the how-to-fix page on Meta focuses on the most common IDs and classes. Does that address your concern? Guillaume (WMF) (talk) 08:39, 22 October 2014 (UTC)

Machine readable data on MIT license template[edit]

Hi,

I added machine-readable data to {{MIT}}. Could someone please check whether I did appropriately − both on the tech side (though I am not really worried, data is now correctly displayed by TheDJ’s tool]) and on the license side? Thanks! Jean-Fred (talk) 14:23, 2 November 2014 (UTC)

Machine-readable markup for languages/language names[edit]

There are several ways of internationalizing content on Commons (and even more on other wikis) which output multiple languages and/or automatically put a language name before the text. A machine reading the page has to be able to 1) realize that the given machine-readable field contains the same information in multiple languages, 2) identify which piece of text belongs to which language, 3) identify which piece of text is a language name (which makes sense on the wiki page but should be hidden in some other contexts). Currently there is no standard for this; e.g. {{description}} and {{ls}} produce identical looks but wildly different markup:

{{description|en|foo}}
English: foo
<div class="description mw-content-ltr en" dir="ltr" lang="en" style="" xml:lang="en"><span class="language en" title=""><b>English:</b></span> foo</div>
{{ls|en|foo}}
English: foo
<div class="en lang-en" lang="en" style="margin:0.3em 0;line-height:1.2;direction:ltr;" xml:lang="en"><span class="langlabel-en" lang="en" style="font-weight:bold;" xml:lang="en">English:</span> foo</div>

CommonsMetadata currently understands {{description}} but not {{ls}}. Before fixing that, I would really like to see a standard way of marking up languages so that COM:MRD can be used as a reference when creating such templates/modules and clients don't have to identify and support a dozen competing and potentially unstable alternatives. --Tgr (WMF) (talk) 11:12, 11 November 2014 (UTC)

There are more templates that set the identify the language (or should identify the language): {{LangSwitch}}, {{Multilingual description}}, {{Translation table}}, etc. I have never heard of {{ls}} but I think it is related to {{Multilingual description}}, as they both rely on m:Meta:Language select. They all should mark the language in a similar way, except that {{Multilingual description}} and {{ls}} use <div class="multilingual"> marking are designed to hide description in languages you do not know (I never liked that approach, since pages using it, in the past not work correctly and hide wrong parts of the description, I do not know the current status). Another difference is that {{LangSwitch}} and {{Multilingual description}} do not visibly identify the language the way {{description}} does. But I agree the underlying Machine-readable data should be the same. --Jarekt (talk) 13:36, 11 November 2014 (UTC)
@Jarekt: {{ls}} is used by {{Multilingual description}}, yes. It does visibly identify the language though.
How about the following standard (made by merging some non-visual elements of {{description}} and {{ls}}):
  • the language-specific blocks should have a lang attribute (the plain HTML one, not xml:lang)
  • the language names should have class="language"
  • the whole multilingual block can be wrapped in a tag with class="multilingual" if it would not be otherwise clear where it starts/ends
None of those classes are used for styling (on Commons at least), {{description}} already conforms to this and {{ls}} can be made to conform with some trivial changes. --Tgr (WMF) (talk) 02:23, 25 November 2014 (UTC)
I know very little about Machine-readable tags so hopefully some more knowledgeable colleges will also take part in this discussion. I am fine with whatever changes seem appropriate that do not change the appearance and do not break existing tools and process (do we even know who uses Machine-readable tags). About class="multilingual", it looks very much like <div class="multilingual"> used by m:Meta:Language select to make some parts of the description to magically disappear (I do not understand the process). I do not thing we want text marked with {{description}} to disappear. You are also saying that the whole multilingual block can be wrapped in a tag with class="multilingual" - this might be advantageous but with several {{description}} blocks on a page I do not see a way to do it by changing existing templates. We have 17 M pages using {{description}} blocks without any starts/ends marking, so it would be hard to add it. Do you have any thoughts about {{LangSwitch}} - it shows only one language byt it could have Machine-readable tags in many. --Jarekt (talk) 04:15, 25 November 2014 (UTC)
The two tools using CMD:MRD that I am aware are the CommonsMetadata extension (and through that MediaViewer, the mobile media viewer and the OCG service) and the StockPhoto gadget.
You are right about multilingual - something that has a visible function should not be used for metadata. Maybe something like language-list then? Anyway, this would be optional, for fields if the Information template there are other ways to figure out where the language list starts/ends, so I would be fine with just the first two items from the list.
For LangSwitch there is no way to get the full list of the languages (for a machine using the HTML output of the page, anyway).
--Tgr (WMF) (talk) 20:51, 1 December 2014 (UTC)

Queries[edit]

We were asked to identify patterns, perhaps making some queries and lists helps? See Commons:Machine-readable data/Queries for a simple example.

When categorising stuff, it's often useful to go through uncategorised media by day categories because all files belonging to a same group upload tend to be together; we can probably identify similar useful divide et impera procedures. --Nemo 22:40, 11 December 2014 (UTC)

For example, some surgical editing of Template:Blason-fr-en can probably fix source information for almost 10k files. --Nemo 22:51, 11 December 2014 (UTC)

Suggestion for {{PermissionOTRS}}: Machine-readability for ticket link[edit]

I suggest adding class="otrs-permission-ticket-link" to make OTRS permission information machine-readable in Template:PermissionOTRS. Please give any inputs making this a new standard. – Kwj2772 (talk) 12:22, 12 December 2014 (UTC)

It is always easy to add machine-readable tags, but harder to change them latter since someone might be relying on them. So adding a tag should be fine, but others should say if the format is OK. --Jarekt (talk) 03:07, 13 December 2014 (UTC)