The batch upload script has been converted to run automatically once a day, to upload the latest available ESA images to Commons. See petscan report for images uploaded in the last week.
European Space Agency
Based on a suggestion at the Village Pump, a specific upload script has been designed to take information from ESA catalogue pages, initially for Copernicus Sentinel Data images. The script is in Python using Pywikibot core and BeautifulSoup. In its initial form it checks for images and will skip video.
The copyright of files is checked as being "CC BY-SA 3.0 IGO" and this appears to apply to all catalogue text. The html of the catalogue pages includes a non-visible copyright meta tag of "All rights reserved." The presumption is that this is a default but not intended to overrule the visible copyright statement. A general statement about ESA copyright is available at http://www.esa.int/spaceinimages/ESA_Multimedia/Copyright_Notice_Images.
The sample url is http://www.esa.int/spaceinimages/content/search/(offset)/0?SearchText=Copernicus+Sentinel+Data&sortBy=published where the offset number (0) is the number of the first image in the search. Grid pages are 16 images per page.
Critical data for the import is:
- title, taken from og:title
- date, taken from release date
- Id, taken from Id in catalogue
- description, from the description in the catalogue
- copyright, from the copyright statement in the catalogue
For Commons the filename will be
"File:<title> ESA<id>.<ext>" where ext is expected to be jpg, gif, PNG or TIFF. The priority for selecting a format will be TIFF > PNG > gif > jpg. Due to a known issue with the potential poor display of TIFF thumbnails for reuse on Wikipedia, where both a TIFF and a ESA supplied jpg exists, both will be uploaded.
There are a number of tags, such as mission and keywords, which are added as additional values in the information template. As the initial run is for fewer than 300 images, the complexity of metadata import is being kept simple and categorization limited to a bucket category to be manually sorted later, if desirable.
- The domain for jpegs is a local link, the same as the search, but for tiffs it's a different domain so external.
- The TIFFs may be large, so local SHA values will not be used to check for existing duplicates on Commons in advance of attempting API upload; because this requires downloading the file locally anyway. If the IDs are unique, then searching Commons for any "ESA<id>.<ext>" matches should be sufficient as a weaker and quicker test.
- Direct uploading by url requires the relevant image hosting ESA domains to be whitelisted, this has been done on Phab:T164643.
Tested: 22:54, 5 May 2017 (UTC)
Out of the 228 images in the original suggested search, the following do not (currently) contain a CC BY-SA 3.0 IGO statement:
frequently can fail post-upload verification, for example:
- http://www.esa.int/spaceinimages/Images/2017/02/Drought_to_overflow (source)
- Successfully uploaded on a re-run File:Drought to overflow ESA373404.gif
- Successfully uploaded on a re-run File:Larsen crack ESA372459.gif
- Successfully uploaded on re-run File:Fire-scarred Madeira ESA365160.gif
- Successfully uploaded on re-run, see thumbnail on right.
The standard warning is "WARNING: API error stashfailed: This file did not pass file verification." The failure may be the rendering limits on animated GIFs, however repeating upload attempts seems to be bypassing the problem, so this may have been down to transient WMF operational issues.
Some digitally identical duplicates have been unavoidably created. These may occur when:
- A previously uploaded image makes no reference to the ESA image ID and is created under a different file name. The duplicate merge should retain the ESA image ID.
- An ESA image exists on the database in different languages, each with a unique image ID. Merging the duplicates should add the benefit of multiple language descriptions, though all ESA image IDs should be added to the description.
- Example: File:Αεροφωτογραφία της ESA ESA375724.jpg (Greek) and File:Aerial view of ESA’s technical centre ESA375662.jpg (English).
An SHA1 check using the Commons API can show these duplicates, but creating the SHA1 value means downloading a local copy of the image to create the checksum, which considering the size of some images would be too 'expensive' in bandwidth and time. The ESA catalogue does not separately publish an SHA1 value.
The API does flag uploads as duplicates during the upload-from-url process, however these errors are being suppressed in order to allow for the "exists-normalized" error to be skipped; i.e. to allow both jpeg and TIFF to be uploaded under the same base filename. At the time of writing the potential solution of using Pywikibot's error trapping feature using an array in site.upload does not work, instead always causing fatal HttpConnection timeout failures.
These can be discovered in https://petscan.wmflabs.org/?psid=1047668. As of 24 May 2017, only one catalogue entry was found with no descriptive text, by 25th this was 26 images. These can be manually fixed by adding a description.
TIFF MIME and tiffinfo errors
Several TIFFs have been rejected by the Commons API as having verification-error: Files of the MIME type "text/html" are not allowed to be uploaded others fail with verification-error: The uploaded file contains errors: tiffinfo command failed. The following are examples, the list may not be complete as it is ad-hoc cut & paste from a terminal:
Testing by downloading a sample locally and attempting to open in image software, appears to demonstrate file corruption, and that the problem is likely to be at source rather than something created during the upload process. Failures from tiffinfo may be a question of some TIFFs using more obscure compression or odd use of the format as a wrapper. There are previous WMF Phabricator tickets relating to tiffinfo (search), and these may be worth reexamining, potentially with a view to upgrading tiffinfo if that is the root cause of the upload failures or requesting changes to the error descriptions if TIFF transcoding is needed to avoid, for example, unacceptable file formats embedded within the TIFF wrapper.
Wrongly identified images
Some images may have been incorrectly identified, or the wrong image appears to have been added to the ESA database record.
22:54, 5 May 2017 (UTC) Dry run
- After adapting to PNG and GIF formats, only copyright failures have to be skipped.
07:32, 6 May 2017 (UTC) Full run populating Category:Copernicus Sentinel Satellite Imagery
- The domains are not whitelisted, so the upload run relies on client-side uploads. This means it's slow!
15:16, 6 May 2017 (UTC)
- Upload run adjusted so a parallel run handled jpg and png versions. All GIFs appear to be getting rejected due to verification errors. The tiff images may take a long time, it may not even complete before the domains are white-listed for url uploads.
- TIFFs much over 130 MB appear to fail using client-side uploading. This may be due to time-out or other connection problems, as the uploads take over 20 minutes before failing. This should not be an issue when the upload sites are white-listed as uploads will be a magnitude faster with server-side uploads.
09:30, 8 May 2017 (UTC)
- A full re-run does not appear to be uploading missing files. These are the larger TIFFs, which fail with errors like
APIMWException: internal_api_error_UploadChunkFileException.The errors are probably due to the low-ish bandwidth upload not being up to sustaining uploads of 300 MB to 600 MB files and are likely to be resolved by use of server side uploads. UK domestic broadband deliberately throttles upload speeds, especially since UK Gov getting interested in forcing all ISPs to do more about anti-piracy.
08:36, 24 May 2017 (UTC)
- With the ESA sites now white-listed, the code was re-run. This first uploaded the remaining 230MB+ size files previously rejected, then using a wider search for all CC-BY-SA-IGO licensed images. These less specific images are added to ESA images (review needed) for manual checks and categorization.
- A separate 'housekeeping' routine checks through the category for jpegs and TIFFs with the same filename and cross-links them by adding thumbnails to a gallery in the information table.
- During this upload the TIFF at http://www.esa.int/spaceinimages/Images/2012/03/Indonesian_islands was rejected by the API as a text/html MIME type. Later instances occurred, see Known Errors above.
- Jpegs are being downloaded locally, then uploaded to Commons. This leaves them vulnerable to connection time-outs, and urlretrieve had to be stuck in an error trap after this meant restarting twice. A second attempt on the problem files appears to invariably succeed and there is no special pattern, such as size, that may be a cause apart from internet connection issues. Example File:Features in the elongated crater ESA215938.jpg.