Commons talk:Structured data/Modeling/Source

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

These earlier notes and resources may be inspiring:

Multichill (talk) 18:36, 12 September 2019 (UTC)

Source from Wikimania 2019[edit]

Copy here for future reference Get the data! If we look at 100,000 random images, what is in the source field ?

Immediate source of image[edit]

  • Own work --> only for photos of people & places ?
    • distinguish photos/scans of their own artwork
    • created digital original drawings/artwork
    • created diagrams -- software used ?
Easy enough to create a Q-item for "original creation by uploader" (now created as original creation by uploader (Q66458942)) as value for a master "Source of file": property ---- NB: Temp property P828 used for this role. Will need to be replaced.
A bot might indentify cases that look dubious, and mark with a qualifier
BUT -- if we have this model with a top-level statement we can't have any second level of qualifers to clarify the nature of statements being made in first level qualifiers -- eg one might want "applies to part", or "sourcing circumstances", or to distinguish immediate vs ultimate source URL
  • From the internet
    Q-item for "file available on the internet", with qualifiers specifying detailed provenance
    Q-item for "user modification of file available on the internet"
    • Which property will point to this Q item? A new property: Source, taking a value indicating the nature of the source, with qualifiers adding further info. A "source" statement of this kind should become mandatory, with a limited closed vocabulary of possible values. Make upload wizard enforce the making of a choice.


    • Commons best practice: URL for image + URL for description by source --> two qualifiers for this ?
      ADDED: The "description by source" URL might be well handled by described at URL (P973) "described at url" as a separate main statement.
      ISSUE: We might *only* have the description page URL -- and it might no longer exist. So we might need to specify that an image used to exist at a particular institution (or website), but not be able to say what the URL used to be.
    • In practice might have:
      • Some url
      • Some url with a description
      • Some url with a source site (url + Flickr, or Europeana, Internet Archive Books)
        • Identifier properties are subclass of source url
          • --> Q. Do we want to start minting new properties for identifiers from such sites, or just use URLs as per others. What are pros and cons ? Is this a workaround for not being able to find URLs that start with .... in SPARQL (because the indexing isn't there?)
    • What sorts of free text do we find in the source fields ?
      -- maybe this is the last 20% we should try to capture, after we've got the easiest 80%, But how to assess/record completeness of extraction from source field ?
  • Sources which are offline, but eg which have been scanned
    • eg images from art books --> full bibliographic
  • See also other version section for derived works
    • Q-item as top-level value to indicate "derived from file or files on Commons" ?
      • Comment: "Other version" is only relevant if we host the work(s) that the file was derived from. But we may not. eg scan of a page from a book, diagrams based on a diagram in a book (simple enough so no copyright), photograph of a copyright-expired painting, a photo of a dress based on a Mondrian painting "based on" property

Also: some operations -- rotation, colour modification, cropping, etc may have been undertaken by user prior to upload.

-- so distinguish "scan of image" from "user-modified scan of image" in top-level source statement ?

Source of things shown within the image[edit]

(eg : a photo of a 2D collage of objects)

Esp. important because these things may have different copyright status
-- qualifiers below "depicts" statement ?
-- how to indicate things if there is no obvious Q-item for something in the image, but neverthess one wants to identify it & record information relating to it? Should a "depicts" = "somevalue" statement be created to record information about particular parts of the image ?

Will often be handled by the Q-items for the value(s) of the depicts statements


  • copyright checkers may be closely tied to source: should the statements be similarly related -- or is it enough to put verification info as a qualifier or reference on the copyright status. Will SDC even have/display references ?

Metadata has provenance too[edit]

eg {{BL cat credit}}

-- on Wikidata we would indicate this in references, statement by statement. But will Commons have references?

End of copy Multichill (talk) 18:36, 12 September 2019 (UTC)

Simple own work source[edit]

@Jheald: you and some others worked on this during Wikimania, right? I would like to focus on a specific case to see if we can solve that: Own work uploads. Like for example the files uploaded as part of Wiki Loves Monuments. Would it be as simple as "new propery: Source of file" -> original creation by uploader (Q66458942) ("original creation by uploader")? Combined with author and license it would mean we can start converting some data. Multichill (talk) 18:02, 17 September 2019 (UTC)

Hi @Multichill: thanks for pinging me. Yes, exactly. The strong conclusion I got from that workshop was the usefulness of a top-level property "Nature of source of material", taking as values a very small number of different generic types of origin, that a Commons file could have. All Commons files would eventually be expected to have a statement of this kind. For material with some kinds of origin (eg "taken from the internet"), one would then expect further statements to give details of where from and when, etc. But the simplest case would be own work, for which I created the value original creation by uploader (Q66458942); as used eg on File:Petra_Al-Kaznah_by_Night.jpg, using a has cause (P828) property as a stop-gap until the new property was proposed and created. Ideally, the statement should also have a reference (eg imported from: file description page, with date) -- statements like this need provenance, I think: we should say where they have come from, if we're doing a full-scale roll-out (other values might be eg "decared by author via Upload Wizard", etc).
Unfortunately I see that d:User:MisterSynergy has since deleted Q66458942, but I've asked him to restore it. Jheald (talk) 15:47, 18 September 2019 (UTC)
@Multichill: One question that might be worth a thought is whether the property should be just "Nature of source of material", or whether it would make sense to also combine in "Nature of material" -- so whether values should just be "original creation by uploader", or whether it would make sense for the value to be eg "original photograph by uploader" / "original drawing by uploader" / "original sculpture made and photographed by uploader" etc. I come and go between which of the two I prefer. On the one hand there is a certain discipline in trying to identify conceptual orthogonality and then represent it with orthogonal properties. On the other hand, the more specific declarations about the nature of the material may bring out more honest statements, and the greater specificity and concreteness may be easier for some people. In practical terms, by making all of the latter classes subclasses of "original creation by uploader", the same "nature of source" information would be easy enough to extract either way, under either approach, whether for templates or querying or whatever. I oscillate as to which of the two approaches would be better to go for. Jheald (talk) 08:51, 19 September 2019 (UTC)
@Jheald: I let this sink in a bit. If we look at the current situation, we care on Commons about the immediate source: Taken and uploader yourself, transfer from some other wiki, taken from Flickr, from some museum website, etc.
Right now, that's what I would like to model. I would probably like to call the property "source of file" to keep it generic as we do right now. Once we want to model immediate source and underlying source, we can just use some qualifiers. That way in easy situations we just have a clear statement, but we also keep the ability to model more complex situations. Do you agree? I'm probably just going to propose a new property to complete the basic information properties that are currently mandatory ({{No source}}, {{No author}} & {{No license}}). Multichill (talk) 17:46, 3 October 2019 (UTC)

Property proposal[edit]

See d:Wikidata:Property proposal/Source of file. Multichill (talk) 16:29, 13 October 2019 (UTC)

We now have source of file (P7482). Multichill (talk) 09:25, 27 October 2019 (UTC)

Files from the internet[edit]

@Jheald: maybe you can describe your proposal on how to model files found on the internet? I proposed:

I think your proposal is to do:

Correct? Multichill (talk) 09:48, 27 October 2019 (UTC)

Somewhere else was mentioned that described at URL (P973) is probably better than URL (P2699) because it's more specific and the link is usually not a deeplink to the file, but to a page containing the file. It's suggested that maybe Commons compatible image available at URL (P4765) could be added too in some cases to deeplink to the file.
I'm not getting any input so I'm just going to go ahead and implement the second proposal. I just created file available on the internet (Q74228490) for this. Multichill (talk) 15:27, 9 November 2019 (UTC)
Ok, test edit. I did the same thing on the other Geograph files in Category:Dornoch Firth. What do you think? (@Jheald:). Multichill (talk) 21:03, 9 November 2019 (UTC)
Looks good, especially P7384. I start getting used to Commons-style "statement groups".
It's just that somevalue/unknown isn't exactly the best supported feature around Wikidata and even more so here. Jura1 (talk) 00:09, 10 November 2019 (UTC)