Scraping Zoomify objects can sometime lead to problems. This page will run through the most common problems.
How are Zoomify objects arranged?
Zoomify objects present on one page (let's say www.site.org/gallery/zoomify_page_1.html) draw on resources in another part of the site to construct the image. These resources are:
- The Zoomify tiles, little squarei mages that are pieced together to make the image you see in the Flash applet. The are divided into folders (TileGroups) of 256 tiles.
- The ImageProperties.xml file, which holds vital information used in constructing the image (i.e. width and height)
The image folders and the XML file are contained in a "base directory". This is an entirely separate web directory, and could even be on a different website! Let's say our example base directory is www.site.org/images/zoomify/1/. The XML file is then located at www.site.org/images/zoomify/1/ImageProperties.xml, and image tiles are at www.site.org/images/zoomify/1/TileGroup0/0-0-0.jpg and so on.
So, the base directory is the important location. Dezoomify will get all the data and images from there, and the display page is just the gateway page that Dezoomify uses to find the base directory to make your life easier.
The base directory cannot be determined from the display page
The normal method of using Dezoomify is:
python dezoomify.py -i http://www.site.org/gallery/zoomify_page_1.html -o C:\file.jpg
This will search for a piece of HTML on that page for something reading "zoomifyImagePath=/url/goes/here". This will tell Dezoomify where the base directory is.
However, not all usages of Zoomify are that simple, and some have a complex URL that doesn't get parsed properly by Dezoomify. Others use a modified Zoomify Flash applet that super-imposes markers or other information. Either way, Dezoomify cannot find the base directory correctly.
In these cases, the best solution is to determine the base directory manually.
You don't have enough RAM
Dezoomify constructs the image in your RAM as a big bitmap. This is very memory intensive, and it will fail if you don't have enough RAM in your computer. There is no solution to this other than upgrading your computer, asking a friend with a RAMful computer to do it for you, or rewriting the code to do this in-place on the hard-drive (which I'm not going to attempt, though patches are welcome).
You can also use a lower zoom level to get a smaller image that will fit in your RAM.
Determine the base directory manually
Easy, fallible way
If you need to determine the base directory manually, the first thing to do is check the page source and search for "zoomifyImagePath". This might reveal a relatively simple base directory location that Dezoomify couldn't work out.
For example, the page http://www.hampel-auctions.com/en/A84/78010029/onlinecatalog-zoom/ has the following code:
However, the actual Zoomify base directory is http://www.hampel-auctions.com/img/auktionen/A84/z/78010029, but Dezoomify wouldn't be able to guess that the "xml" suffix doesn't belong.
Harder, works-every-time way
The harder but reliable way is to check the incoming HTTP headers for the files that the Zoomify applet is demanding from the server. This sounds very techy, but it isn't that hard.
You need to get the Firefox browser, and the Firebug plugin. You then open the Firebug plugin on the Zoomify page (click the little insect logo in the bottom-right of your screen), and navigate to the "Net" panel. Reloading the page will cause the panel to fill with the reqested files, one of which is the XML info file (it will say "GET ImageProperies.xml", and hovering on it will pop up the full URL). From that you can get the base directory easily (it's the XML URL without the /ImageProperties.xml on the end).
Contributing to Dezoomify
I welcome any patches or improvements to Dezoomify. The things I am really looking for are:
- Ways to make the base directory deduction more reliable. If you have an example of page where it fails, please leave a note on my talk page, and I'll try to incorporate that page's structure into later versions.
- Ways to construct large images on-disk rather than in-RAM. Drop me a note with the suggested improvements, and I'll be happy to include them!
- A GUI!
Any other improvements you can think of are also welcome.