Commons:Batch uploading/Great Images in NASA

Great Images in NASA

I was thinking that we should do an import of all the Great Images of NASA http://grin.hq.nasa.gov/

They are historic
They are often restored from the originals and much higher quality than the other NASA galleries.
They are easy to scrape.

User:Multichill has offered to assist in the actual uploading, and I'll write the HTML scraper to prepare the descriptions.

Example page: http://grin.hq.nasa.gov/ABSTRACTS/GPN-2000-000038.html

Scripts used

All my scripts are released as GPL. The upload itself is done by BotMultichillT.

First I created a scraper tool for the apache file listing here: http://grin.hq.nasa.gov/ABSTRACTS/

User:TheDJ/getfiles.pl

This scraper created a list of all the ID numbers used on the GRIN website.

User:TheDJ/gpn.txt

Then I wrote a scraper for the ABSTRACT pages of the grin website. It was loosely inspired by a script of prolineserver, but in the end I basically wrote an entire new script.

User:TheDJ/grin_importer.pl

It uses the following categories:

Category:Great Images in NASA - All images
Category:GRIN images detection errors - images that saw the detection of one of the fields fail
Category:GRIN images requiring copyright evaluation - images that are possibly not PD-USGov-NASA (usually these will be another government agency)
Category:GRIN possible dupes - these images are likely duplicate with already uploaded files.

It creates:

User:TheDJ/bot.sh - this is the primary output grin_importer.pl. It is a bash script that User:Multichill runs in order to upload all the images and their descriptions.
errorlog.txt - contains any encountered errors by grin_importer.pl.

A lot of work needs to be done after this. some licensing issue have to be verified, and a lot of category work of course. There will also be a lot of dupes.

User:TheDJ/grinfiles is a list of all files uploaded.

Based on this list, a check is run for exact duplicates, which will be deleted. The other duplicates (for various reasons) will be judged and handled by hand.

Lessons learned

This was my first scrape. I had little perl experience, but much C experience and it was fairly easy to do. I was also blessed with a fairly "simple" case that I wanted to scrape.

My perl script uses a trim() function to remove whitespace from the front and end of the scraped metadata strings.
Remember that your description string can contain " which may break the bash script. use $description =~ s/\"/\\\"/g ;
Filenames need to avoid the user of the following characters: # : / ` ".
My " in filenames were accepted because they were interpreted as a sort of concat.

However ` broke the script

# and : was automatically replaced with -

/ uploaded under the filename AFTER the /
When you scrape data, you might encounter patterns that cover multiple lines. Use match (m/) with the multiline modifier (/m): m/Creator\/Photographer:<\/B>([\s\S]*?)<LI>/mi
When scraping data, remember that . matches any character BUT '\n' you might have to use ([\s\S]*?) to scrape data.
Because this was a fairly "small" set of images (1400), that I knew contained quite a few images that might already be in Commons, I added a fairly simple detection routine for any possible dupes.

# Check for possible dupes
  my $searchquery = "http://commons.wikimedia.org/w/api.php?action=query&list=search&srwhat=text&srnamespace=6&format=xml&srsearch=".$gpnid;
  my $searchresult = $browser->get( $searchquery );
  my $duperesult = "";
  if( $searchresult->is_success ) {
    $searchresult = $searchresult->content();
    if( !($searchresult =~ m/<search\ \/>/i )) {
      print "possible DUPE\n";
      $duperesult = "\n[[Category:GRIN possible dupes]]";
    }
  }

It works by searching commons for the GPN id, and then checking if the result of the query is "empty". If it's not empty it adds a category, that I intend to empty after having added other_versions and {{Duplicate}} tagging of any real dupes. TheDJ (talk) 21:59, 8 April 2009 (UTC)[reply]

Note that Commons now detects exact duplicate images on upload using a hash comparison (as long as you don't have "ignore warnings" on). This can be one handy way of eliminating dups at upload time. Dcoetzee (talk) 02:16, 9 April 2009 (UTC)[reply]

The uploads have been completed. Now the various categories of errors and duplicates need to be checked (i'll likely do most of this myself). Everyone is welcomed to participate in categorizing though. :D TheDJ (talk) 16:54, 11 April 2009 (UTC)[reply]

Today I finally finished processing and filtering out all potential duplicates. This manual work was a drag, and I spread it out over several months. I finished the last 120 images in this past week. TheDJ (talk) 15:12, 24 August 2010 (UTC)[reply]

Assigned to	Progress	Bot name
User:TheDJ	Completed	BotMultichillT

Commons:Batch uploading/Great Images in NASA

Great Images in NASA

Scripts used

Lessons learned

Navigation menu

Search