Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
In this chapter, I’ll describe a webbot that identifies and downloads all of the images on a web page. This webbot also stores images in a directory structure similar to the directory structure on the target website. This project will show how a seemingly simple webbot can be made more complex by addressing these common problems:
Finding the page base, or the address that defines the address from which all relative addresses are referenced
Dealing with changes to the page base, caused by page redirection
Converting relative addresses into fully resolved URLs
Replicating complex directory structures
Properly downloading image files with binary formats
In Chapter 17, you’ll expand on these concepts to develop a spider that downloads images from an entire website, not just one page.