Smoke and Mirrors

Recursive Mirroring

  1. mirror document
  2. is it HTML?
  3. extract links
  4. recurse on links

Trick is not to Recurse too Far

Script I.3.2: Mirroring a Document Tree

Recursively mirror an entire document tree, copying all pages at the same level or lower.

 #!/usr/local/bin/perl
 
 # File: mirrorTree.pl
 
 use LWP::UserAgent;
 use HTML::LinkExtor;
 use URI::URL;
 use File::Path;
 use File::Basename;
 %DONE    = ();
 
 my $URL = shift;
 
 $UA     = new LWP::UserAgent;
 $PARSER = HTML::LinkExtor->new();
 $TOP    = $UA->request(HTTP::Request->new(HEAD => $URL));
 $BASE   = $TOP->base;
 
 mirror(URI::URL->new($TOP->request->url));
 
 sub mirror {
     my $url = shift;
 
     # get rid of query string "?" and fragments "#"
     my $path = $url->path;
     my $fixed_url = URI::URL->new ($url->scheme . '://' . $url->netloc . $path);
 
     # make the URL relative
     my $rel = $fixed_url->rel($BASE);
     $rel .= 'index.html' if $rel=~m!/$! || length($rel) == 0;
 
     # skip it if we've already done it
     return if $DONE{$rel}++;
 
     # create the directory if it doesn't exist already
     my $dir = dirname($rel);
     mkpath([$dir]) unless -d $dir;
 
     # mirror the document
     my $doc = $UA->mirror($fixed_url,$rel);
     print STDERR "$rel: ",$doc->message,"\n";
     return if $doc->is_error;
 
     # Follow HTML documents
     return unless $rel=~/\.html?$/i;
     my $base = $doc->base;
     
     # pull out the links and call us recursively
     my @links = $PARSER->parse_file("$rel")->links;
     my @hrefs = map { url($_->[2],$base)->abs } @links;
 
     foreach (@hrefs) {
 	next unless is_child($BASE,$_);
 	mirror($_);
     }
 }
 
 sub is_child {
     my ($base,$url) = @_;
     my $rel = $url->rel($base);
     return ($rel ne $url) && ($rel !~ m!^[/.]!);
 }

What it looks like

% ../mirrorTree.pl http://prego/ImageMagick/index.html
index.html: OK
display.html: OK
formats.html: OK
miff.html: OK
ImageMagick.html: Not Found
README.html: OK
convert.html: OK
quantize.html: OK
montage.html: OK
identify.html: OK
animate.html: OK
import.html: OK
mogrify.html: OK
combine.html: OK
Magick.html: OK
xtp.html: OK

<< Previous Contents >> Next >>

Lincoln D. Stein, lstein@cshl.org
Cold Spring Harbor Laboratory
Last modified: Mon Aug 17 10:44:23 EDT 1998