Commons:Batch uploading/DPLA

From Wikimedia Commons, the free media repository
Jump to: navigation, search

DPLA[edit]

See Com:DPLA for an overview of the project. The DPLA has metadata for over 2 million records; sadly only a portion of these are PD. User:Bdcousineau is going through collection by collection to reveal PD materials. See Com:DPLA for the list.

  • Source to upload from:
    • Did you observe an URL pattern
    • Do you know whether the site as an API
The DPLA has an an API that is available for use, however, it is a metadata repository. The source files will be linked from the local website. See Commons:Bots/Requests/Smallbot (10) for sample templating, etc. Bot operator retired before upload was begun.
    • What else can ease uploading (is the site valid XHTML, WCM they use…)?
    • Did you contact the site owner?
Given the mission of the DPLA, there may be no need. The DPLA has representation on the project page. Project coordinator is happy to contact site owners as needed, if needed.
  • Describe the works to be uploaded in detail (audio files, images by …):
Jpg and tiff files.
  • Which license tag(s) should be applied?
Depending on the collection, either a {{PD-US}} tag, or a {{PD-USGov}}.
  • Is there a template that could be used on the file description pages? Do you think a special template should be created? Depending on the collection {{Artwork}}, {{Photo}}, {{Book}}. We've also created a preliminary institution tag that will be adjusted to reflect the owning institution.

Bdcousineau (talk) 00:59, 5 August 2013 (UTC)

Opinions[edit]

I'd be happy to assist with the task. We however need to establish a good way to handle this. Perhaps a specific template should be created that holds all the notes on the linked nom. This way we would have more control on the licensing should we desire to make slight updates. Also was the code to retrieve the files ever created before? -- とある白い猫 ちぃ? 10:31, 8 September 2013 (UTC)

Hi, thanks! Please know I'm not a techie, so what is "linked nom"? For the initial batch we were working with, we mimicked the templating created by prior uploads (see Commons:Bots/Requests/Smallbot (10) - there is a sample of the JSON source on that page as well. I can see however that an overarching template will be needed. As far as the licensing goes, all the materials we started to work with are {{PD-USGov}}, the others will be different.
I guess the big question is, where would you like to start? With where we left off? or with a smaller batch? A smaller batch makes the most sense, in that the templating and licensing can be in adjusted as you suggest. The Massachusetts Digital Commonwealth has a few smaller collections that are PD - and total approx 1500 items.
To be clear, disclaimer: I am a NARA employee - this project has nothing to do with my official duties, nor does it reflect official policy, etc etc. Bdcousineau (talk) 12:27, 8 September 2013 (UTC)
OK so perhaps we should do this like a Q&A to avoid mistakes. I meant the Commons:Bots/Requests/Smallbot (10) when I stated "linked nom".
  1. At this repository do we have a variety of licenses? If so is there a list of it? Can we easily distinguish the license of each file?
  2. Were any files copied to commons with a bot before? Or was that never the case? I'd rather avoid re-engineering code if one already exists. What exactly do you mean by "where we left off"
  3. Does this repository grow in size? If so how often do we need to update?
  4. Do you have a link to the API and example sample images?
-- とある白い猫 ちぃ? 13:04, 8 September 2013 (UTC)
Much easier, thanks!
  1. For this project, only PD materials are appropriate. Some will be {{PD-USGov}}, others {{PD-art/1923}}, and other will be {{PD-1923}}. In general, each collection will have the same license for each of their files - for example, the ARTStor files (10K files) will be {{PD-art/1923}}, and the MassDigCommonwealth will be {{PD-1923}}. Even though the DPLA has huge number of files, only a small percentage are PD. Yes, easily distinguishable. Also, I can generate any list you'll need collection by collection.
  2. No, no files copied to Commons yet. "Where we left off": group consensus asked previous uploader to upload small sample batch for further review. Task not completed. Most likely that previous work is useless, and should be ignored.
  3. Yes, DPLA grows in size, both inside each collection, and as new service hubs/partners are added. The is no consistent languaging for licensing, either, licensing developed at the donor level, and is wildly various. Last time I checked, searching by licensing field was not an option - PD mapping done by hand. Since the project has a DPLA contact (user:SJ), it might be possible to get better access to the rate at which material is added to DPLA. Can this be put off for a moment?Also, since the project does have a DPLA connection, it's reasonable that at some point he needs to be drawn in to consnesus process around templating, etc, especially if DPLA-specific templates are developed.
  4. You have to get a key for the API here. Sample images: DPLA is broken, can't get any search results. Will try again later today.
New: The DPLA Dev team and others associated with the DPLA were excited when we contacted them about this (April 2013)... so I am assuming we can get some support from them if needed. Bdcousineau (talk) 14:18, 8 September 2013 (UTC)
One possibility is them uploading to Flickr and I can use existing code to receive it. They can throttle their internet usage with this way too to prevent outages as the bot would be relentless (since I don't know their upload limits). They can for instance use http://www.flickr.com/tools/ . For the script to work they must release it with a free license. If they are willing to do this option, I wouldn't need to code. Or they can upload directly to commons of course. I just am curious if they are unwilling to do either. -- とある白い猫 ちぃ? 21:25, 8 September 2013 (UTC)
Hmmm... that level of support is unlikely, it'll be more like a thumbs-up/pat on the back/yes go for it. IMHO I don't think the DPLA is in the business of pushing the files out once the service hubs sign on, they are strictly a repository; while a great angle, this version of the plan is prolly a non-starter. It'll be up to Wikimedians to figure out a way to bring the files to Commons. BTW I really appreciate having this discussion, thanks. Bdcousineau (talk) 23:59, 8 September 2013 (UTC)
Well, I need sample images, urls etc to work with. -- とある白い猫 ちぃ? 20:42, 11 September 2013 (UTC)

Ok, will try by Saturday, surely by Sunday am. Tied up til then. Thanks so much. Bdcousineau (talk) 01:02, 12 September 2013 (UTC)

Please do not hurry, I am rather busy with real world affairs until more or less the end of this month. This is an issue that needs to be handled with time and care anyways. -- とある白い猫 ちぃ? 21:12, 13 September 2013 (UTC)
Great! Here are sample urls of declared PD materials:
All the metadata from the DPLA's API is PD. Let me know if this is useful, and what you needed. Bdcousineau (talk) 22:15, 15 September 2013 (UTC)
Assigned to Progress Bot name Category