Commons:Batch uploading/NYPL Digital Gallery

From Wikimedia Commons, the free media repository
Jump to: navigation, search

Images from NYPL Digital Gallery[edit]

Assigned to Progress Bot name
Dcoetzee Uploading Dcoetzee

Will be great if we batch upload PD-images from NYPL Digital Gallery - http://digitalgallery.nypl.org/nypldigital/index.cfm NYPL Digital Gallery provides free and open access to over 685,000 images digitized from the The New York Public Library's vast collections, including illuminated manuscripts, historical maps, vintage posters, rare prints, photographs and more. --Butko (talk) 14:45, 14 April 2009 (UTC)

  • This collection turned out to be more promising than I supposed. They use LizardTech ContentServer to serve up their images, whose API is described here. Here's how you extract original TIFFs at full size: first use a "browse" query to obtain some XML including the image dimensions, like this one [1]. The folder name and image name can be obtained from URL of the zoom view. Then, use a getimage query like this one [2] to get the full size TIFF, specifying the dimensions from the previous query. Tada. Close examination shows no artifacts in the TIFF - these are original scans (internally, they are SID images). The first one I extracted was 3845 × 4947, about 60 MB as a TIFF, and 27 MB as a PNG (which you can preview here). They throttle you at 80 KB/s per transfer, but they do allow simultaneous transfers; any way you look at it though it would take a long time to fetch all the images we need. In light of the long download time per image, we're going to want to license filter before downloading. Dcoetzee (talk) 07:00, 15 April 2009 (UTC)
  • Update: their complete collection of high-resolution images is browsable here. This can be used to easily obtain a list of folder-name pairs. I'll presently begin downloading. Dcoetzee (talk) 06:23, 16 April 2009 (UTC)
  • Update: a better way to download these is to use the "getfile" function to get the raw .sid files, which are highly compressed (as in [3]) and then use LizardTech's command-line decoder to convert to TIFF ([4]). This is a quicker download and doesn't even require the dimensions. Dcoetzee (talk) 22:12, 16 April 2009 (UTC)
  • I'm still in the middle of grabbing these. Enumerating IDs turned out to be trickier than I thought, because the folders are so large the browse interface times out on them. I ended up enumerating them instead using wildcard searches on single letters. Even just looking at the high res images, it's a lot of data. All told we're talking at least 100 GB in PNGs, and I'm pretty sure all of the high-resolution images are public domain works, although that will require further confirmation. It's an excellent source. Dcoetzee (talk) 06:59, 22 April 2009 (UTC)
  • Update: I've enumerated about 65000 high-res images, and am in the process of downloading and converting them to PNGs, slow enough to not overwhelm their bandwidth. So far I've retrieved about 17250, occupying 323 GB. I'm also in the process of generating image descriptions of them based on NYPL metadata. I've created Category:New York Public Library Digital Gallery and plan to start uploading some of them soon. Dcoetzee (talk) 13:18, 16 May 2009 (UTC)
  • Update: I've had contact from a representative of the NYPL, who has been very helpful in furnishing IDs and sanctioning the sharing of their public domain images. He gave me a list of about 40,000 stereographs which I can begin uploading immediately as soon as I put together a suitable fully-automated upload tool for the task. Dcoetzee (talk) 21:43, 25 June 2009 (UTC)
    • Great work. I think this is good news and I'm very happy that someone over there is nice enough to help out.--Diaa abdelmoneim (talk) 20:46, 27 June 2009 (UTC)
      • I have just begun automated uploading of this collection of 40,000 images, which are being placed along with existing images in Category:Images from the New York Public Library. Each image and its metadata is being downloaded from NYPL on-the-fly. Dcoetzee (talk) 03:11, 28 June 2009 (UTC)
      • Update: I've estimated that at my present rate of upload, the current collection being uploaded (which actually contains 84000 images) will require about 7 weeks to upload, and will occupy about 500 GB. Dcoetzee (talk) 10:38, 28 June 2009 (UTC)

Nice upload, but I have a couple of points you should address:

  1. I don't like the two versions (png & jpg). Who cares about thumbnail size? Are you sure you want two upload two versions of every image? And why not upload the original tiffs for our restoration people?
  2. The files are uncategorized, please tag them with {{subst:unc}} right away.
  3. How are you going to get these files categorized? The images should probably all in a subcategory of Category:Stereo cards and in one or more topic categories
  4. Other versions field seems to be broken that was an easy fix. Multichill (talk) 11:30, 28 June 2009 (UTC)

Multichill (talk) 11:20, 28 June 2009 (UTC)

More to question:

  • Do u mean by 84000 images, 42000 png and 42000 jpg?
  • Why don't u merge the source template into the source field in the {{NYPL-image-full}} template?
  • Does the bot auto categorize?
  • What's the license of these images? why are they pd? I mean why is the original file before the scan pd?--Diaa abdelmoneim (talk) 12:08, 28 June 2009 (UTC)
    • They're all PD due to age ({{PD-1923}}), according to the NYPL, although some of them don't list a specific date on their page (for many of them, you have to click through to the original source description to verify the age). There was one date field that I was not grabbing, which I am currently modifying it to grab. The bot does not do autocategories (I don't have that functionality, and I don't trust autocategories anyway), but I am now automatically marking them as uncategorized. Uploading the TIFFs doesn't make any sense, because they are derived from MrSID files and contain exactly the same data as the PNG files (there is no metadata).
    • I also prefer not to have two versions, but thumbnail size is a very real concern, and unfortunately the software does not support JPEG thumbnails for PNG files. For example, a typical image of width 300 would be about 30 KB in size, which is prohibitive for modem users when many such images are used on a page. When the software adds a proper feature for this, they can all be deleted. Oh, and no, I mean 84000 PNG and 84000 JPEG.
    • Should I be putting these all in the root category Category:Stereo cards? Dcoetzee (talk) 17:25, 28 June 2009 (UTC)


Categorizing

  • I'm currently categorizing to the "Category:Robert N. Dennis collection of stereoscopic views"--Diaa abdelmoneim (talk) 17:22, 28 June 2009 (UTC)
    • I can take care of categorizing by source collection automatically if you wish - please don't go to unnecessary manual effort. :-) Dcoetzee (talk) 17:26, 28 June 2009 (UTC)
      • I started a bot that that does this for the first 1600 images. It would be good if u do this with all your upcoming uploads. And you said 84000 images as a first batch. How many more batches are there? If it is possible for me to assist in the upload I would be glad to do so. Multichil also has a university connection or a very high speed connection I'm sure if we ask him kindly he would help in the upload. If we work together we can upload this in a week. And please don't add the images in the stereo card root category. Just in the Category:Robert N. Dennis collection of stereoscopic views.--Diaa abdelmoneim (talk) 17:49, 28 June 2009 (UTC)
        • Unfortunately that may not be an option, depending on how fast the NYPL wants their servers hit. I can inquire about it. I can deal at least with the Robert N. Dennis collection right now, but other subcollections will have to wait until I see how many collections there are and how meaningful they are. Dcoetzee (talk) 17:53, 28 June 2009 (UTC)
          • So should I keep categorizing the first 1600 images of the batch? I don't want there to be a double category or something. How many images do u upload daily? And how big of a PD collection do they have?--Diaa abdelmoneim (talk) 18:00, 28 June 2009 (UTC)
            • No, I'll go back for them a bit later this week, don't worry. :-) And I'll check for any existing category so double categories will not occur. I upload roughly one image every 50 seconds or 1728 per day (this includes both the PNG and JPEG). I have no idea how large their complete PD collection is, and I don't think they do yet either. Dcoetzee (talk) 18:08, 28 June 2009 (UTC)
  • Could the bot also categorize to location? Like in File:Camping_out,_from_Robert_N._Dennis_collection_of_stereoscopic_views.jpg the location being Michigan? --Diaa abdelmoneim (talk) 18:31, 28 June 2009 (UTC)
  • The past couple of files have been very low res. Is this a mistake by the bot or are these really low res?--Diaa abdelmoneim (talk) 18:34, 28 June 2009 (UTC)
    • Some files do not have SID files available from the NYPL - for these I upload the highest available resolution, which is about 700px wide. And yes, I may be able to extract the rough location from the Original Source field. For now I must go away but back later. :-) Dcoetzee (talk) 18:44, 28 June 2009 (UTC)

Looks like all images are now tagged with Category:Robert N. Dennis collection of stereoscopic views and {{Uncategorized}}. This seems like a good starting point to me, but i rather have a dedicated uncategorized template just like with Barch and Fotothek. Could you please tag the images with {{Uncategorized-NYPL}}. I'll create the remaining structure later this week. This will prevent your uploads from flooding the regular tree and messages like this one. Multichill (talk) 20:07, 29 June 2009 (UTC)

Ok. The basics are there. If everyone agrees we only need to run a bot to change the old uploads (replace.py -lang:commons -family:commons -transcludes:NYPL-image-full -regex -nocase "\{\{Uncategorized\|" "{{Uncategorized-NYPL|" ). Multichill (talk) 20:21, 29 June 2009 (UTC)
No problem, I'll take care of everything. :-) Dcoetzee (talk) 23:07, 29 June 2009 (UTC)

subject Categories

Could u or Multichil create a bot that automatically adds a temporary subject category to each file that would be checked and if correct be moved into a permanent category like what has been done with Fotothek or BArchive? I'm not sure we should wait till the first 80,000 images are up and then start cating. BTW the NYPL has started receiving funds again from the city of New York so they might stop throttling downloads. It would be beneficial if u would inquire about that.--Diaa abdelmoneim (talk) 20:22, 30 June 2009 (UTC)

I'd be happy to do this but haven't seen this type of thing before - is there an example or description of this process somewhere? Many of these can (if nothing else) be automatically categorized into the category for the city where they were taken. Dcoetzee (talk) 22:15, 30 June 2009 (UTC)
Commons:Fotothek has categories assigned to their files based on the description. In "Original source: " it is mostly written at the end what the subject or where the photo was taken. Dividing the image in such categories would make further categorization easier. So for example File:Camping_out,_from_Robert_N._Dennis_collection_of_stereoscopic_views.jpg has "Original source: Robert N. Dennis collection of stereoscopic views. / United States. / States / Michigan / Stereoscopic views of Lake Superior Scenery." You could grab from there "Stereoscopic views of Lake Superior Scenery" cause it's after a slash and before a bracket. The category would later be reviewed and approved by a user. The temp category would be "NYPL_Stereoscopic views of Lake Superior Scenery" This would serve as preliminary categories.--Diaa abdelmoneim (talk) 22:23, 30 June 2009 (UTC)
That makes sense - incidentally, is there an easy way to merge a category into a different existing category? Will CommonsDelinker do this? For many of these the corresponding existing category is obvious, and automated merging would be desirable. Dcoetzee (talk) 22:40, 30 June 2009 (UTC)
I'm currently automatically subcategorizing the images and placing the categories in Category:Temporary categories for images from the New York Public Library. I'm also updating the uncategorized tags and Robert N. Dennis category on my initial uploads. Dcoetzee (talk) 01:55, 1 July 2009 (UTC)
See User:CommonsDelinker/commands/documentation#Categorize uncategorized images. Multichill (talk) 19:37, 1 July 2009 (UTC)
Is it possible to have a template like the one found on http://commons.wikimedia.org/wiki/Category:Images_from_the_Deutsche_Fotothek,_location_Dresden ? so that it makes categorizing easier?--Diaa abdelmoneim (talk) 09:37, 2 July 2009 (UTC)
That sounds like a good idea. However, I'd want to be sure first that CommonsDelinker recognizes the new Uncategorized-NYPL... Dcoetzee (talk) 10:55, 2 July 2009 (UTC)
Dcoetzee, you should probably only add Uncategorized-NYPL if you can't don't have a proper temp category. This way we can just use the normal category move bots to move images from a temp cat to a proper topic category. Multichill (talk) 11:01, 2 July 2009 (UTC)
Dcoetzee, can we delete a temp category once it's cleaned out or do you expect more images to go into these categories? Multichill (talk) 17:03, 2 July 2009 (UTC)
On the first point, already done, on the second - I have no idea. But it'll get recreated as necessary anyway. Dcoetzee (talk) 18:23, 2 July 2009 (UTC)
  • For stereoscopic view #9466, I made a gallery with all 80 versions, i.e. 10 (files) * 2 (file types) * 2 (sterescopic) * 2 (it's "Mirror Lake"). I'm wondering if I should also put them into a specific category with 9466 in its name. -- User:Docu at 11:12, 25 April 2010 (UTC)
    • Ideally we would also have at least one (non-stereoscopic) image selected from them. -- User:Docu at 11:18, 25 April 2010 (UTC)
  • To make it possible to sort the files into topical categories without overwhelming them, I set the sortkey in the template. They now appear after other images, e.g. in Category:Mirror Lake (California). -- User:Docu at 11:12, 25 April 2010 (UTC)


NYPL and PD-Scan

Dcoetzee I'm a little unhappy with the way our images are tagged as PD-Scan only. Many of the images don't have their original publish date and someone who looks on the picture can't be sure if it's PD as there is no clear sign of it. For example File:Arch_on_St._George_Avenue,_from_Robert_N._Dennis_collection_of_stereoscopic_views.png has only "Digital item published 5-5-2005; updated 2-12-2009." which doesn't assert PD-old. There is an NYPL page about the collection which may hold clues about why the collection is PD. I think after we clear why the collection is PD we should create a template stating why it is PD, which goes along the PD scan. --Diaa abdelmoneim (talk) 10:58, 4 July 2009 (UTC)

  • I agree, the NYPL image metadata does not generally contain sufficient metadata to clearly establish their copyright status. I have only the word of the NYPL that these are public domain, and they may not as be as conservative in evaluating copyright status as we are. I don't really want to filter them before upload though, because I'm fairly confident most of these actually are PD and are just missing the metadata to prove it. There are two things I can do here: I can fetch the "Imprint" date from the collection, and I can tag any images that do not have a clear indicator of copyright status for human review with Category:PD files for review. This could prove to be rather difficult though, because dates are specified in a variety of strange formats that are difficult to parse. Dcoetzee (talk) 22:15, 4 July 2009 (UTC)
    • Or just an OTRS confirmation, or a rights information page on their site saying "no known restrictions". Don't tag anything please. I'm sure all images are PD but only need a legal confirmation.--Diaa abdelmoneim (talk) 22:19, 4 July 2009 (UTC)
      • As far as I know OTRS is inappropriate for public domain images - that's for the copyright holder confirming that they've released a work, and NYPL is not the copyright holder. Their copyright status will need to be confirmed based on the available information, and PD review has already agreed to help me with kind of thing in the past. As for "no known restrictions", every one of these image description pages says that in its HTML metadata - their evaluation can't be trusted. Dcoetzee (talk) 23:30, 4 July 2009 (UTC)
Status

What's the status of this upload? Multichill (talk) 12:29, 17 September 2009 (UTC)

Sorry for the delay. I'm working on getting a Toolserver account so I can continue the upload with my existing tools and Mono, or with a rewrite of the tools. It should be able to pick up right where I left off. I don't have enough bandwidth at home to do the upload. Dcoetzee (talk) 08:48, 25 September 2009 (UTC)
  • Any update ?--Diaa abdelmoneim (talk) 12:42, 11 December 2009 (UTC)
    • The NYPL upgraded their software and it's no longer possible with the new default settings to download the images in the same manner in which I originally did, so I've been forced to suspend progress on this. I asked Josh from NYPL about this and on Jan 15 and he said: "No progress on that front, but I might actually be able to open another door within the next month or so (might just be able to get you direct access to a batch of jpg full-res derivatives)...will follow up with details soon..." Dcoetzee (talk) 12:32, 31 January 2010 (UTC)
  • I just checked and at some point in the last few months the NYPL listened and re-enabled the SID interface, allowing this upload to continue, so I'm starting it back up. Dcoetzee (talk) 00:02, 16 April 2010 (UTC)
    • Finally!!! Please change the status to uploading when u do. =) Congrats.--Diaa abdelmoneim (talk) 07:54, 16 April 2010 (UTC)
      • Done :-) I can only upload at a fast rate when I'm at school since my upload bandwidth at home sucks - but I'm there pretty often and my updated tool uploads at a rate of about 5-6 image pairs per minute there. Dcoetzee (talk) 10:12, 16 April 2010 (UTC)
      • Another small update on this - it turns out I've only been uploading the fronts of these cards, and not the back. This is probably a good thing, since the backs are usually just blank with a bit of writing, and not nearly as useful for educational purposes. Because of this, there are actually only 42,000 images, not 84,000, in the stereographic collection. Dcoetzee (talk) 00:42, 17 April 2010 (UTC)
  • Still working on this upload. I'm bandwidth-limited at the moment so it's taking quite a while. It's probably more than half done. Dcoetzee (talk) 02:07, 9 November 2010 (UTC)
    • I've now finished all of the stereographic views from the New York Public Library that were supplied by Josh Greenberg. I will contact Josh to see what other images he has to offer. I'm open at this point to feedback about how I can improve the process (besides obviously uploading images more quickly - I think this is a good time to port the tool to Toolserver). I'm also considering uploading only high-quality JPEGs, instead of both a JPEG and a PNG version. Let me know what you think. Dcoetzee (talk) 07:19, 14 November 2010 (UTC)

Hello. What is the status? InverseHypercube (talk) 05:51, 2 April 2011 (UTC)

This project has probably seen its better days, but I found this through searching for NYPL images. I tried all the parameters given in the LizardTech Express 8 manual with this image http://digitalgallery.nypl.org/nypldigital/id?1527362, but I simply am not able to download it. Even doing what the manual says on downloading the file itself (getitem?cat=*&item=*.sid) or with the parameters of width and height, I only get "Invalid dimensions".
I'm especially interested in all the New York real estate maps, including the famous Sanborn Maps, a collection with unprecedented detail of buildings throughout the years.
The system NYPL uses is almost misanthropic. If the data is free and open, then it shouldn't be behind artifically restrictive systems. And then put a fee on their own file-acquisition service.
I'd be glad to help, but unfortunately I have no idea how to code a bot, and I have very little coding skills. As to file formats, I'm partial to retaining the highest possible quality. Barring TIFF's, a lossless PNG would be the next choice. ~ Nelg (talk) 23:25, 31 March 2013 (UTC)