Commons:Batch uploading/NYPL Digital Gallery

From Wikimedia Commons, the free media repository

Jump to: navigation, search

[edit] Images from NYPL Digital Gallery

Assigned to Progress Bot name
Dcoetzee Stale Dcoetzee

Will be great if we batch upload PD-images from NYPL Digital Gallery - http://digitalgallery.nypl.org/nypldigital/index.cfm NYPL Digital Gallery provides free and open access to over 685,000 images digitized from the The New York Public Library's vast collections, including illuminated manuscripts, historical maps, vintage posters, rare prints, photographs and more. --Butko (talk) 14:45, 14 April 2009 (UTC)

  • This collection turned out to be more promising than I supposed. They use LizardTech ContentServer to serve up their images, whose API is described here. Here's how you extract original TIFFs at full size: first use a "browse" query to obtain some XML including the image dimensions, like this one [1]. The folder name and image name can be obtained from URL of the zoom view. Then, use a getimage query like this one [2] to get the full size TIFF, specifying the dimensions from the previous query. Tada. Close examination shows no artifacts in the TIFF - these are original scans (internally, they are SID images). The first one I extracted was 3845 × 4947, about 60 MB as a TIFF, and 27 MB as a PNG (which you can preview here). They throttle you at 80 KB/s per transfer, but they do allow simultaneous transfers; any way you look at it though it would take a long time to fetch all the images we need. In light of the long download time per image, we're going to want to license filter before downloading. Dcoetzee (talk) 07:00, 15 April 2009 (UTC)
  • Update: their complete collection of high-resolution images is browsable here. This can be used to easily obtain a list of folder-name pairs. I'll presently begin downloading. Dcoetzee (talk) 06:23, 16 April 2009 (UTC)
  • Update: a better way to download these is to use the "getfile" function to get the raw .sid files, which are highly compressed (as in [3]) and then use LizardTech's command-line decoder to convert to TIFF ([4]). This is a quicker download and doesn't even require the dimensions. Dcoetzee (talk) 22:12, 16 April 2009 (UTC)
  • I'm still in the middle of grabbing these. Enumerating IDs turned out to be trickier than I thought, because the folders are so large the browse interface times out on them. I ended up enumerating them instead using wildcard searches on single letters. Even just looking at the high res images, it's a lot of data. All told we're talking at least 100 GB in PNGs, and I'm pretty sure all of the high-resolution images are public domain works, although that will require further confirmation. It's an excellent source. Dcoetzee (talk) 06:59, 22 April 2009 (UTC)
  • Update: I've enumerated about 65000 high-res images, and am in the process of downloading and converting them to PNGs, slow enough to not overwhelm their bandwidth. So far I've retrieved about 17250, occupying 323 GB. I'm also in the process of generating image descriptions of them based on NYPL metadata. I've created Category:New York Public Library Digital Gallery and plan to start uploading some of them soon. Dcoetzee (talk) 13:18, 16 May 2009 (UTC)
  • Update: I've had contact from a representative of the NYPL, who has been very helpful in furnishing IDs and sanctioning the sharing of their public domain images. He gave me a list of about 40,000 stereographs which I can begin uploading immediately as soon as I put together a suitable fully-automated upload tool for the task. Dcoetzee (talk) 21:43, 25 June 2009 (UTC)
    • Great work. I think this is good news and I'm very happy that someone over there is nice enough to help out.--Diaa abdelmoneim (talk) 20:46, 27 June 2009 (UTC)
      • I have just begun automated uploading of this collection of 40,000 images, which are being placed along with existing images in Category:Images from the New York Public Library. Each image and its metadata is being downloaded from NYPL on-the-fly. Dcoetzee (talk) 03:11, 28 June 2009 (UTC)
      • Update: I've estimated that at my present rate of upload, the current collection being uploaded (which actually contains 84000 images) will require about 7 weeks to upload, and will occupy about 500 GB. Dcoetzee (talk) 10:38, 28 June 2009 (UTC)

Nice upload, but I have a couple of points you should address:

  1. I don't like the two versions (png & jpg). Who cares about thumbnail size? Are you sure you want two upload two versions of every image? And why not upload the original tiffs for our restoration people?
  2. The files are uncategorized, please tag them with {{subst:unc}} right away.
  3. How are you going to get these files categorized? The images should probably all in a subcategory of Category:Stereo cards and in one or more topic categories
  4. Other versions field seems to be broken that was an easy fix. Multichill (talk) 11:30, 28 June 2009 (UTC)

Multichill (talk) 11:20, 28 June 2009 (UTC)

More to question:

  • Do u mean by 84000 images, 42000 png and 42000 jpg?
  • Why don't u merge the source template into the source field in the {{NYPL-image-full}} template?
  • Does the bot auto categorize?
  • What's the license of these images? why are they pd? I mean why is the original file before the scan pd?--Diaa abdelmoneim (talk) 12:08, 28 June 2009 (UTC)
    • They're all PD due to age ({{PD-1923}}), according to the NYPL, although some of them don't list a specific date on their page (for many of them, you have to click through to the original source description to verify the age). There was one date field that I was not grabbing, which I am currently modifying it to grab. The bot does not do autocategories (I don't have that functionality, and I don't trust autocategories anyway), but I am now automatically marking them as uncategorized. Uploading the TIFFs doesn't make any sense, because they are derived from MrSID files and contain exactly the same data as the PNG files (there is no metadata).
    • I also prefer not to have two versions, but thumbnail size is a very real concern, and unfortunately the software does not support JPEG thumbnails for PNG files. For example, a typical image of width 300 would be about 30 KB in size, which is prohibitive for modem users when many such images are used on a page. When the software adds a proper feature for this, they can all be deleted. Oh, and no, I mean 84000 PNG and 84000 JPEG.
    • Should I be putting these all in the root category Category:Stereo cards? Dcoetzee (talk) 17:25, 28 June 2009 (UTC)


Categorizing

  • I'm currently categorizing to the "Category:Robert N. Dennis collection of stereoscopic views"--Diaa abdelmoneim (talk) 17:22, 28 June 2009 (UTC)
    • I can take care of categorizing by source collection automatically if you wish - please don't go to unnecessary manual effort. :-) Dcoetzee (talk) 17:26, 28 June 2009 (UTC)
      • I started a bot that that does this for the first 1600 images. It would be good if u do this with all your upcoming uploads. And you said 84000 images as a first batch. How many more batches are there? If it is possible for me to assist in the upload I would be glad to do so. Multichil also has a university connection or a very high speed connection I'm sure if we ask him kindly he would help in the upload. If we work together we can upload this in a week. And please don't add the images in the stereo card root category. Just in the Category:Robert N. Dennis collection of stereoscopic views.--Diaa abdelmoneim (talk) 17:49, 28 June 2009 (UTC)
        • Unfortunately that may not be an option, depending on how fast the NYPL wants their servers hit. I can inquire about it. I can deal at least with the Robert N. Dennis collection right now, but other subcollections will have to wait until I see how many collections there are and how meaningful they are. Dcoetzee (talk) 17:53, 28 June 2009 (UTC)
          • So should I keep categorizing the first 1600 images of the batch? I don't want there to be a double category or something. How many images do u upload daily? And how big of a PD collection do they have?--Diaa abdelmoneim (talk) 18:00, 28 June 2009 (UTC)
            • No, I'll go back for them a bit later this week, don't worry. :-) And I'll check for any existing category so double categories will not occur. I upload roughly one image every 50 seconds or 1728 per day (this includes both the PNG and JPEG). I have no idea how large their complete PD collection is, and I don't think they do yet either. Dcoetzee (talk) 18:08, 28 June 2009 (UTC)
  • Could the bot also categorize to location? Like in File:Camping_out,_from_Robert_N._Dennis_collection_of_stereoscopic_views.jpg the location being Michigan? --Diaa abdelmoneim (talk) 18:31, 28 June 2009 (UTC)
  • The past couple of files have been very low res. Is this a mistake by the bot or are these really low res?--Diaa abdelmoneim (talk) 18:34, 28 June 2009 (UTC)
    • Some files do not have SID files available from the NYPL - for these I upload the highest available resolution, which is about 700px wide. And yes, I may be able to extract the rough location from the Original Source field. For now I must go away but back later. :-) Dcoetzee (talk) 18:44, 28 June 2009 (UTC)

Looks like all images are now tagged with Category:Robert N. Dennis collection of stereoscopic views and {{Uncategorized}}. This seems like a good starting point to me, but i rather have a dedicated uncategorized template just like with Barch and Fotothek. Could you please tag the images with {{Uncategorized-NYPL}}. I'll create the remaining structure later this week. This will prevent your uploads from flooding the regular tree and messages like this one. Multichill (talk) 20:07, 29 June 2009 (UTC)

Ok. The basics are there. If everyone agrees we only need to run a bot to change the old uploads (replace.py -lang:commons -family:commons -transcludes:NYPL-image-full -regex -nocase "\{\{Uncategorized\|" "{{Uncategorized-NYPL|" ). Multichill (talk) 20:21, 29 June 2009 (UTC)
No problem, I'll take care of everything. :-) Dcoetzee (talk) 23:07, 29 June 2009 (UTC)

subject Categories

Could u or Multichil create a bot that automatically adds a temporary subject category to each file that would be checked and if correct be moved into a permanent category like what has been done with Fotothek or BArchive? I'm not sure we should wait till the first 80,000 images are up and then start cating. BTW the NYPL has started receiving funds again from the city of New York so they might stop throttling downloads. It would be beneficial if u would inquire about that.--Diaa abdelmoneim (talk) 20:22, 30 June 2009 (UTC)

I'd be happy to do this but haven't seen this type of thing before - is there an example or description of this process somewhere? Many of these can (if nothing else) be automatically categorized into the category for the city where they were taken. Dcoetzee (talk) 22:15, 30 June 2009 (UTC)
Commons:Fotothek has categories assigned to their files based on the description. In "Original source: " it is mostly written at the end what the subject or where the photo was taken. Dividing the image in such categories would make further categorization easier. So for example File:Camping_out,_from_Robert_N._Dennis_collection_of_stereoscopic_views.jpg has "Original source: Robert N. Dennis collection of stereoscopic views. / United States. / States / Michigan / Stereoscopic views of Lake Superior Scenery." You could grab from there "Stereoscopic views of Lake Superior Scenery" cause it's after a slash and before a bracket. The category would later be reviewed and approved by a user. The temp category would be "NYPL_Stereoscopic views of Lake Superior Scenery" This would serve as preliminary categories.--Diaa abdelmoneim (talk) 22:23, 30 June 2009 (UTC)
That makes sense - incidentally, is there an easy way to merge a category into a different existing category? Will CommonsDelinker do this? For many of these the corresponding existing category is obvious, and automated merging would be desirable. Dcoetzee (talk) 22:40, 30 June 2009 (UTC)
I'm currently automatically subcategorizing the images and placing the categories in Category:Temporary categories for images from the New York Public Library. I'm also updating the uncategorized tags and Robert N. Dennis category on my initial uploads. Dcoetzee (talk) 01:55, 1 July 2009 (UTC)
See User:CommonsDelinker/commands/documentation#Categorize uncategorized images. Multichill (talk) 19:37, 1 July 2009 (UTC)
Is it possible to have a template like the one found on http://commons.wikimedia.org/wiki/Category:Images_from_the_Deutsche_Fotothek,_location_Dresden ? so that it makes categorizing easier?--Diaa abdelmoneim (talk) 09:37, 2 July 2009 (UTC)
That sounds like a good idea. However, I'd want to be sure first that CommonsDelinker recognizes the new Uncategorized-NYPL... Dcoetzee (talk) 10:55, 2 July 2009 (UTC)
Dcoetzee, you should probably only add Uncategorized-NYPL if you can't don't have a proper temp category. This way we can just use the normal category move bots to move images from a temp cat to a proper topic category. Multichill (talk) 11:01, 2 July 2009 (UTC)
Dcoetzee, can we delete a temp category once it's cleaned out or do you expect more images to go into these categories? Multichill (talk) 17:03, 2 July 2009 (UTC)
On the first point, already done, on the second - I have no idea. But it'll get recreated as necessary anyway. Dcoetzee (talk) 18:23, 2 July 2009 (UTC)

NYPL and PD-Scan

Dcoetzee I'm a little unhappy with the way our images are tagged as PD-Scan only. Many of the images don't have their original publish date and someone who looks on the picture can't be sure if it's PD as there is no clear sign of it. For example File:Arch_on_St._George_Avenue,_from_Robert_N._Dennis_collection_of_stereoscopic_views.png has only "Digital item published 5-5-2005; updated 2-12-2009." which doesn't assert PD-old. There is an NYPL page about the collection which may hold clues about why the collection is PD. I think after we clear why the collection is PD we should create a template stating why it is PD, which goes along the PD scan. --Diaa abdelmoneim (talk) 10:58, 4 July 2009 (UTC)

  • I agree, the NYPL image metadata does not generally contain sufficient metadata to clearly establish their copyright status. I have only the word of the NYPL that these are public domain, and they may not as be as conservative in evaluating copyright status as we are. I don't really want to filter them before upload though, because I'm fairly confident most of these actually are PD and are just missing the metadata to prove it. There are two things I can do here: I can fetch the "Imprint" date from the collection, and I can tag any images that do not have a clear indicator of copyright status for human review with Category:PD files for review. This could prove to be rather difficult though, because dates are specified in a variety of strange formats that are difficult to parse. Dcoetzee (talk) 22:15, 4 July 2009 (UTC)
    • Or just an OTRS confirmation, or a rights information page on their site saying "no known restrictions". Don't tag anything please. I'm sure all images are PD but only need a legal confirmation.--Diaa abdelmoneim (talk) 22:19, 4 July 2009 (UTC)
      • As far as I know OTRS is inappropriate for public domain images - that's for the copyright holder confirming that they've released a work, and NYPL is not the copyright holder. Their copyright status will need to be confirmed based on the available information, and PD review has already agreed to help me with kind of thing in the past. As for "no known restrictions", every one of these image description pages says that in its HTML metadata - their evaluation can't be trusted. Dcoetzee (talk) 23:30, 4 July 2009 (UTC)
Status

What's the status of this upload? Multichill (talk) 12:29, 17 September 2009 (UTC)

Sorry for the delay. I'm working on getting a Toolserver account so I can continue the upload with my existing tools and Mono, or with a rewrite of the tools. It should be able to pick up right where I left off. I don't have enough bandwidth at home to do the upload. Dcoetzee (talk) 08:48, 25 September 2009 (UTC)