Commons:Batch uploading/Images from LIFE

From Wikimedia Commons, the free media repository
Jump to: navigation, search

Images from LIFE[edit]

Google announced puting online 10 million images from LIFE. There are 1000s of very valuable public domain images hosted by Google. We can upload everything prior to 1889 when the photographer is not known, plus a lot of specific cases. See Commons:Village_pump#LIFE_Photo_Archive for the discussion. Yann (talk) 21:52, 8 April 2009 (UTC)

Please tag the image with {{LIFE}} if you upload images manually. Images will be in Category:Images from LIFE Photo Archive.

Opinions:

There are many problems here:
  1. Google doesn't like scraping (mostly because usually scraping there content is illegal, but secondly many people do it irresponsibly) and is actively trying to prevent it.
  2. For any query on the LIFE repository they only return a maximum of 200 images
  3. The high-res images contain huge ugly TIME watermarking in the rightbotttom corner.
My thoughts:
  • Enumerating all search results is not an issue.
  • The LIFE logo is translucent, and together with the low-res version, I believe we may be able to systematically eliminate it with image processing. I can look at this.
  • They ain't all that high res, maybe 1 megapixel, but still acceptable.
  • For the US works, everything up to 1922 is clear {{PD-1923}}. The author's death date is only a concern for works 1923-1938, or non-US works. (update: this statement only applies to the published works, not the unpublished works, which have to meet {{PD-US-unpublished}})
  • Google seems to know what's going on with the licenses - they've marked some "personal noncommercial use only" and others do not carry that restriction.
Dcoetzee (talk) 02:14, 9 April 2009 (UTC)
I've now created a short C# program that can remove the LIFE logo automatically (e-mail me if interested). It works on both color and black and white images. I'm going to get to retrieving these. I also see now that enumerating them may turn out to be difficult - I have a plan though. Dcoetzee (talk) 07:03, 9 April 2009 (UTC)
As it is said on the VP, many of these copyright claims are copyfraud, i.e. bogus. We need to determine the copyright status by ourselves. I am on Linux, so I can't use you program. I am currently collecting photos of Gandhi. One megapixel is already much better than what we usually have. Yann (talk) 07:57, 9 April 2009 (UTC)
I have a plan to collect all of them - it basically involves doing a breadth-first search of the label graph. I can reasonably assume the graph has enough edges that it has a single connected component. The tags are available via JSON POST lookups. Working on it now. Dcoetzee (talk) 08:01, 9 April 2009 (UTC)
Could you remove the watermark of this one File:Gandhi seated Juhu 1944.jpg please? Yann (talk) 08:02, 9 April 2009 (UTC)
No problem. :-) FYI, in the future please upload the original version before uploading cleaned up or cropped versions - my tool only works on the original version and we can't trust Google to keep them at a stable URL. I went ahead and grabbed the original and cleaned it and re-cropped it. My tool leaves a little "noise cloud" due to JPEG artifacts around the original watermark, but it should hopefully be acceptable. I'm now running my script to gather the IDs of all the images. :-) Dcoetzee (talk) 08:04, 9 April 2009 (UTC)

I'm still collecting IDs at this point with my label graph crawler - so far I have 8800. There's gonna be a lot of these, even if only a portion of them are public domain. Here are the steps I will follow:

  1. Get complete list of IDs.
  2. Retrieve full size versions of all images.
  3. Retrieve pages containing metadata for each image.
  4. Extract metadata and use to roughly filter PD from non-PD and to generate image descriptions.
  5. Do mass upload of initial unmodified versions.
  6. Run watermark removal tool on all images.
  7. Do mass upload of updated images.
  8. Manual categorization and placement.

Dcoetzee (talk) 11:39, 9 April 2009 (UTC)

It looks good. ;o) Just that I don't duplicate your work, what is your criteria(s) for public domain images? How long do you think it will take (I am impatient ;oD)? Yann (talk) 15:26, 9 April 2009 (UTC)
Whoa, this collection is much larger than I thought - the ID crawler has picked up over 54000 distinct images now and still running. I generally don't begin uploading until I've at least downloaded all images, to avoid setting off alarms at the source. The full-size image files are small, so hopefully it won't take too long, but we're looking at at least a week or so. As for license, that will be complex - everything before 1889 is in, but as far as I can tell they don't clearly mark publication dates. They do generally indicate location, so I will need to look into a lot of country-specific PD templates, especially for anonymous works. Dcoetzee (talk) 19:08, 9 April 2009 (UTC)
I am sure that this collection is huge. I can help for copyright rules. For European anonymous works, everything before 1938 is public domain. For Indian works, everything before 1948 is public domain. Yann (talk) 19:44, 9 April 2009 (UTC)
I don't know if this is true. Works published 1938 or earlier in the EU should be okay, but not merely created as far as I know. In fact, the works you already uploaded may not be valid public domain images (notice that your template refers to publication, not creation). We need more input on this, so I'm asking at Commons talk:Licensing. Please hold off on any more uploads from the LIFE photo archive until this is resolved. Dcoetzee (talk) 21:58, 9 April 2009 (UTC)
I have seen the images I uploaded in low resolution elsewhere, and they were not copies from LIFE. So they were obviously published. LIFE confirms what I suspected but I was not sure: that the photographer of these images is unknown. Yann (talk) 22:41, 9 April 2009 (UTC)
Unfortunately it's not really enough to know that they were published, you also need to know when they were published. The work may have been first published many years after it was created. This is a difficult situation and hopefully we can come up with an acceptable policy. Dcoetzee (talk) 22:47, 9 April 2009 (UTC)

And the news gets worse - sources say this collection is much, much larger than I imagined - on the order of 10 million images. At the current rate, it will require at least 2 months to enumerate them all. Dcoetzee (talk) 22:37, 9 April 2009 (UTC)

First of all, thanks a lot for your tremendous efforts. When I created the {{LIFE}} template and launched the discussion on the Village Pump, I never imagined people would react so quickly and so effectively!! In the absence of more detailed information from Google, we can only upload images based on their date of creation and/or the death date of their author. Here is a brief guide for determining PD status:
  • For US photos (marked as Location: US), any photograph hosted on Google and taken before 1889 is PD, unless the author is known to have died after 1939, which is extremely unlikely. It is safe to upload all of these using {{PD-US-unpublished}}.
  • For UK photos (marked as Location: United Kingdom), any anonymous photograph taken before 1939 is PD. Therefore, such photos should use {{PD-UK-unknown}} in addition to {{PD-US-unpublished}}. Remember that any work hosted on Commons should be PD in the US and not only in its source country, so {{PD-US-unpublished}} should be used for almost all images uploaded from Google.
  • For Canadian photos (marked as Location: Canada), all photographs taken before 1949 are PD, regardless of any other considerations such as publication date or author's death date (I love Canadian copyright law!). Therefore, such photos should use {{PD-Canada}} in addition to {{PD-US-unpublished}}, as explained above.
That's all for now. I'll do a bit of research regarding other countries in the next few days. Please also note that it is possible to upload author-based batches, not just date-based ones. This needs a little bit more research. We should try and come up with a list of photographers who died before 1939. Any photograph taken by such an author should be OK for the bot to upload. Regards. --BomBom (talk) 22:36, 9 April 2009 (UTC)
This seems a bit too optimistic to me. First, we have no information about which images were published or when. Consequently, some of the pre-1889 images may have been first published between 1939 and 2003 by LIFE, which means they're still in copyright. Additionally, I don't consider it unlikely for an author to live from 1889 to 1939, that's only 50 years (I generally use 1860 as a cutoff point for works of known authorship). Dcoetzee (talk) 22:42, 9 April 2009 (UTC)

I think we should not try to upload all images at once. It would be better to upload say 1000 images, so we can start categorization and review while the process continues. Yann (talk) 22:54, 9 April 2009 (UTC)

Although I understand your impatience, I consider this too risky. If Google discovers that their images are being disseminated, they may engage in technical measures to prevent me from acquiring the remaining images. There is no time limit. One possibility is that we may start filling in image description pages before uploading the images. The Flickr web tools do this, for example. I may also be able to speed up image enumeration if I hit them faster - right now I'm including pauses. Dcoetzee (talk) 23:07, 9 April 2009 (UTC)
That's unnecessary worry. Google has much more important work than checking if LIFE images are dissaminated. Yann (talk) 12:34, 10 April 2009 (UTC)
Well in any case I can't upload anything until the license situation is sorted out. Please see Commons_talk:Licensing#Google_LIFE_images. Dcoetzee (talk) 00:01, 12 April 2009 (UTC)
I answered there, but I think it would be better to keep the discussion in one place only. Yann (talk) 16:13, 12 April 2009 (UTC)

So far I've enumerated 163000 of the image IDs. I've started download the images, description pages, and removing watermarks in parallel. I'm not yet doing any license filtering. Dcoetzee (talk) 09:14, 10 April 2009 (UTC)

Update: Google is letting me blast down the images and IDs at full speed, so I've got about 12000 images so far, and about 234000 IDs. Dcoetzee (talk) 13:54, 10 April 2009 (UTC)
Lets hope their traffic analyzer doesn't report you before you finished this :D TheDJ (talk) 14:53, 10 April 2009 (UTC)
Update: I've begun gathering the metadata for the images from the HTML pages into a database, so that we can quickly isolate classes of PD images and do things like generate author lists. Dcoetzee (talk) 03:37, 12 April 2009 (UTC)

Statistics on the first 370,000 images:

Property Count Percent
Taken Before 1889 1882 0.51%
Taken 1889-1922 3385 0.91%
Taken 1923-1938 3925 1.06%
Anonymous 105118 28.4%
United Kingdom 4607 1.24%
India 1246 0.34%
Canada 878 0.24%
Germany 2665 0.72%
France 3884 1.05%
Country anonymous:
PD before
published:
PD before
unpublished:
PD before
Canada 1949 1949 1949
Germany see photographer
France 1939 see photographer
India 1949 1949 see photographer
South Africa 1959 1959 1959
United Kingdom 1939 see photographer
USA 1923 1889


There are 1224 distinct identified photographers. I can immediately rule out any who took any picture in 1939 or later. This leaves only 194 photographers with a total of 4192 images. Here's the complete list: Photographers.

Most of these did not in fact die before 1939, and many have very little information about them. They will require research.

Before enquiring about the photographer, it is better to check the location. In several places (Canada, South Africa, India, etc.), an image is in the public domain if it is published or taken before a certain date, regardless of when the photographer died. Yann (talk) 08:56, 13 April 2009 (UTC)
I'm disconcerted at the prospect of uploading new images that are still copyrighted in the United States. This has long been counter to policy and I refuse to do so without specific prior consensus. Dcoetzee (talk) 18:02, 13 April 2009 (UTC)

Update on licensing[edit]

Per discussion at Commons talk:Licensing, I will only be uploading images published before 1923, and I will only be uploading images taken in the US, Canada, India, South Africa, or other countries where all works published before 1923 are in the public domain regardless of author. The only exception is works by authors known to have died before 1939. This is a necessarily conservative strategy. If in the future you identify any other images that are public domain (in the source country and the US - no exceptions), just leave me a note describing them and I will upload them with an appropriate justification. Dcoetzee (talk) 06:45, 14 April 2009 (UTC)

Perhaps an idea to put the metadata database online in some form after the initial upload? Bbecause it might be possible for humans to identify works in the PD which are otherwise not easily identifiable ? Just thinking out loud here. TheDJ (talk) 09:47, 14 April 2009 (UTC)
Yes, it would be very useful to have access to the metadata. Yann (talk) 09:58, 14 April 2009 (UTC)
I'm extremely reluctant to do so because there are people here eager to upload files that are not public domain in the United States, and I would be assisting them in their policy violations by making my data public. Dcoetzee (talk) 10:14, 14 April 2009 (UTC)
That's why i said metadata. I'd never ask you to put the photo's themselves online as well. They would still have to ask you specifically for the photo. TheDJ (talk) 10:22, 14 April 2009 (UTC)
I considered that. Unfortunately, if they have access to the metadata, they can locate and download the images easily via the search interface. They're free to locate and upload images by themselves but I can't be a party to it. Dcoetzee (talk) 10:29, 14 April 2009 (UTC)
I would like to point to the fact that the copyright status in the source country should not be given undue importance. The source country of a work is the country where it was first published or the country of residence of its author, not simply the country where it was created. Just because a photograph was taken in France, Azerbaijan or Nicaragua does not in any way mean that it is subject to French, Azerbaijani or Nicaraguan copyright laws respectively. Since in the overwhelming majority of cases here we lack the author's name and/or a publication date, I think we should not give much consideration to the source country issue, and simply concern ourselves with US copyright law solely. Regards. --BomBom (talk) 13:03, 14 April 2009 (UTC)
Good remark. I think that we should assume that images were published in the photographer's country of residence, unless known otherwise. Then remains the issue of anonymous images from outside USA. For them, it is quite logical to assume that they were first published in the country they were taken. Yann (talk) 15:14, 14 April 2009 (UTC)
I'm really frustrated by all the license fuzziness with this collection. With an upload like the NPG collection, authors were available for almost all images, and death dates for almost all of the authors. With LIFE, even authors are unavailable for 1/3 of the images, and most of these authors can't be tracked down at all. It's a good indication of the serious problems with copyright law worldwide that in the absence of proper recordkeeping, it becomes infeasible to determine whether or not works are public domain. Sure it was taken in France in 1915, but was it first published in the US in 1950? Or was the photographer from Sweden and it was first published in 2008? Who the hell knows. The only reasonable thing I can do is presume that pre-1923 works were published in the US and the country where they were taken prior to 1923; and that absence of author information does not indicate an anonymous work. Others might choose other assumptions, but any way you look at it, you might be wrong. Dcoetzee (talk) 18:49, 14 April 2009 (UTC)
I think that this reasoning is much too restrictive, but I am not going to argue any more. I will wait for pre-1923 images to get uploaded, and we will see after that. Yann (talk) 20:27, 14 April 2009 (UTC)

On reconsideration I will publish my database as soon as it's complete, in CSV format. The potential for positive uses in this case outweighs those for negative use. Dcoetzee (talk) 20:41, 14 April 2009 (UTC)

On third thought, I will not be publishing the database because of potential copyright concerns. Dcoetzee (talk) 20:07, 17 April 2009 (UTC)
Indeed, OTRS literally just got an email about this from Time-Life (ticket # 2009041710051354 for those with access). Thanks for holding off.--Chaser (talk) 20:10, 17 April 2009 (UTC)
TIME-LIFE has also been in contact with me and I'm hoping to reach an agreement about the set of images we can disseminate. I will take no further action until that discussion is resolved. Dcoetzee (talk) 20:34, 17 April 2009 (UTC)

Wow, that is the second batch that is running into this issue. The National Gallery, London also complained this past week. I wonder if someone (from wikipediareview?) informed them. Both collections were discussed in the IRC channel.... TheDJ (talk) 09:45, 18 April 2009 (UTC)

I think they probably monitor how their images are downloaded, and if they see a mass download by script, they react. They don't need to be informed by any external entity. I think it is much better that the legal issue is discussed with them now, than two years later, when these images would be widely used accross all projects. In addition, the issue, whatever it will be, will make our copyright policy much more firmly safe. Yann (talk) 10:15, 18 April 2009 (UTC)
NPG was probably informed by a user we recently encountered sympathetic to their claims. As for LIFE, they probably just turned up this page in a Google search. I am sending a list of candidate files to them now. Dcoetzee (talk) 13:13, 19 April 2009 (UTC)
Dcoetzee, can you please tell us what is going on with the batch upload of LIFE images? There is no update for the last 6 weeks. Thanks, Yann (talk) 14:59, 7 June 2009 (UTC)
LIFE has stopped responding to my mails despite multiple reminders and so I can't close the matter. The only images they've approved are the ones from the National Archives, which comprise a tiny fraction (no more than 100 images out of the 500,000) which are available in higher resolution from elsewhere. Maybe 5% of the images they proved that they published after 1923 - the remaining 95% they labelled as being of copyright status because they have no evidence of the date of first publication, even though they can't prove that they first published them after 1923. In short, at the present time, I've got nothing. Dcoetzee (talk) 22:45, 7 June 2009 (UTC)
Assigned to Progress Bot name
Dcoetzee All 831799 images downloaded
Preparing for upload
Dcoetzee