WWW

Robot


SummaryPackage variablesSynopsisDescriptionGeneral documentationMethods

SummaryTop
WWW::Robot - configurable web traversal engine (for web robots & agents)

Package variablesTop
Privates (from "my" definitions)
%urls_list;
%ATTRIBUTE_DEFAULT = ( 'REQUEST_DELAY', 1, 'TRAVERSAL', 'depth', 'VERBOSE', 0, 'IGNORE_TEXT', 1)
@LINK_ELEMENTS = qw(a img body form input link frame applet area)
%ATTRIBUTES = ( 'NAME', 'Name of the Robot', 'VERSION', 'Version of the Robot, N.NNN', 'EMAIL', 'Contact email address for Robot owner', 'REQUEST_DELAY', 'Delay between requests to the same server', 'TRAVERSAL', 'traversal order - depth or breadth', 'VERBOSE', 'boolean flag for verbose reporting', 'IGNORE_TEXT', 'should we ignore text content of HTML?', 'SITEROOTROOT', 'the string to check if the site is inside the desired web domain', 'CACHEFILE', 'file which contains the cached results from previous run')
%SUPPORTED_HOOKS = ( 'restore-state', 'opportunity for client to restore state', 'invoke-on-all-url', 'invoked on all URLs, even those not visited', 'follow-url-test', 'return true if robot should visit the URL', 'invoke-on-followed-url', 'invoked on only those URLs which are visited', 'invoke-on-get-error', 'invoked when an HTTP request results in error', 'invoke-on-contents', 'invoked on the contents of each visited URL', 'invoke-on-link', 'invoked on all links seen on a page', 'continue-test', 'return true if robot should continue iterating', 'save-state', 'opportunity for client to save state after a run', 'generate-report', 'report for the run just finished', 'modified-since', 'returns a modified-since time for URL passed', 'invoke-after-get', 'invoked right after every GET request',)
@TMPDIR_OPTIONS = ("$ENV{PERLTREE_HOME}/tmp")
%seen_url;

Included modulesTop
Clair::Config
English
HTML::Parse
HTTP::Request
HTTP::Status
IO::File
LWP::RobotUA
SDBM_File
Time::Local
URI::Escape
URI::URL

SynopsisTop
   use WWW::Robot;
   
   $robot = new WWW::Robot('NAME'     => 'MyRobot',
			   'VERSION'  => '1.000',
			   'EMAIL'    => 'fred@foobar.com');
   
   # ... configure the robot's operation ...
   
   $robot->run('http://www.foobar.com/');

DescriptionTop
This module implements a configurable web traversal engine,
for a robot or other web agent.
Given an initial web page (URL),
the Robot will get the contents of that page,
and extract all links on the page, adding them to a list of URLs to visit.
Features of the Robot module include:

    *(1)

    Follows the Robot Exclusion Protocol.

    *(2)

    Supports the META element proposed extensions to the Protocol.

    *(3)

    Implements many of the Guidelines for Robot Writers.

    *(4)

    Configurable.

    *(5)

    Builds on standard Perl 5 modules for WWW, HTTP, HTML, etc.
A particular application (robot instance) has to configure
the engine using hooks, which are perl functions invoked by the Robot
engine at specific points in the control loop.
The robot engine obeys the Robot Exclusion protocol,
as well as a proposed addition.
See SEE ALSO for references to
documents describing the Robot Exclusion protocol and web robots.

MethodsTop
DESTROYNo descriptionCode
addHookDescriptionCode
addUrlDescriptionCode
check_protocolNo descriptionCode
create_agentNo descriptionCode
env_proxyNo descriptionCode
extract_linksNo descriptionCode
getAttributeDescriptionCode
get_urlNo descriptionCode
initialiseNo descriptionCode
invoke_hook_functionsNo descriptionCode
invoke_hook_proceduresNo descriptionCode
newNo descriptionCode
next_urlNo descriptionCode
no_proxyNo descriptionCode
pick_tmpdirNo descriptionCode
pre_run_checkNo descriptionCode
proxyDescriptionCode
retrieve_cached_urlsNo descriptionCode
runDescriptionCode
setAttributeDescriptionCode
set_attributeNo descriptionCode
verboseNo descriptionCode
warnNo descriptionCode

Methods description


addHookcode    nextTop
  $robot->addHook($hook_name, \&hook_function);
  
  sub hook_function { ... }
Register a hook function which should be invoked by the robot at
a specific point in the control flow. There are a number of
hook points in the robot, which are identified by a string.
For a list of hook points, see the SUPPORTED HOOKS section below.
If you provide more than one function for a particular hook,
then the hook functions will be invoked in the order they were added.
I.e. the first hook function called will be the first hook function
you added.

addUrlcodeprevnextTop
  $robot->addUrl( $url1, ..., $urlN );
Used to add one or more URLs to the queue for the robot.
Each URL can be passed as a simple string,
or as a URI::URL object.
Returns True (non-zero) if all URLs were successfully added,
False (zero) if at least one of the URLs could not be added.

getAttributecodeprevnextTop
  $value = $robot->getAttribute('attribute-name');
Queries a Robot for the value of an attribute.
For example, to query the version number of your robot,
you would get the VERSION attribute:
   $version = $robot->getAttribute('VERSION');
The supported attributes for the Robot module are listed below,
in the ROBOT ATTRIBUTES section.

proxy, no_proxy, env_proxycodeprevnextTop
These are convenience functions are setting proxy information on the
User agent being used to make the requests.
    $robot->proxy( protocol, proxy );
Used to specify a proxy for the given scheme.
The protocol argument can be a reference to a list of protocols.
    $robot->no_proxy(domain1, ... domainN);
Specifies that proxies should not be used for the specified
domains or hosts.
    $robot->env_proxy();
Load proxy settings from protocol_proxy environment variables:
ftp_proxy, http_proxy, no_proxy, etc.

runcodeprevnextTop
    $robot->run( LIST );
Invokes the robot, initially traversing the root URLs provided in LIST,
and any which have been provided with the addUrl() method before
invoking run().
If you have not correctly configured the robot, the method will
return undef.
The initial set of URLs can either be passed as arguments to the
run() method, or with the addUrl() method before you
invoke run().
Each URL can be specified either as a string,
or as a URI::URL object.
Before invoking this method, you should have provided at least some of
the hook functions.
See the example given in the EXAMPLES section below.
By default the run() method will iterate until there are no more
URLs in the queue.
You can override this behavior by providing a continue-test hook
function, which checks for the termination conditions.
This particular hook function, and use of hook functions in general,
are described below.

setAttributecodeprevnextTop
  $robot->setAttribute( ... attribute-value-pairs ... );
Change the value of one or more robot attributes.
Attributes are identified using a string, and take scalar values.
For example, to specify the name of your robot,
you set the NAME attribute:
   $robot->setAttribute('NAME' => 'WebStud');
The supported attributes for the Robot module are listed below,
in the ROBOT ATTRIBUTES section.

Methods code


DESTROYdescriptionprevnextTop
sub DESTROY {
    my $self = shift;


    unlink $self->{'WORKFILE'} if defined $self->{'WORKFILE'};
}

addHookdescriptionprevnextTop
sub addHook {
    my $self       = shift;
    my $hook_name  = shift;
    my $hook_fn    = shift;


    if (!exists $SUPPORTED_HOOKS{$hook_name})
    {
	$self->warn("unknown hook name passed to addHook(). Ignoring it!",
                    "Hook Name: $hook_name");
	return undef;
    }

    if (ref($hook_fn) ne 'CODE')
    {
	$self->warn("not function reference passed to addHook(). Ignoring.",
                    "Hook Name: $hook_name");
	return undef;
    }

    if (exists $self->{'HOOKS'}->{$hook_name})
    {
	push(@{ $self->{'HOOKS'}->{$hook_name} }, $hook_fn);
    }
    else
    {
	$self->{'HOOKS'}->{$hook_name} = [$hook_fn];
    }

    return 1;
}

addUrldescriptionprevnextTop
sub addUrl {
    my $self       = shift;
    my @list       = @ARG;

    my $status     = 1;
    my $url;
    my $urlObject;

    foreach my $newurl (@list)
    {
      $url = $newurl;

      # check if we already visited this URL
if (exists $seen_url{$url}) { next; } # check if the URL belongs to the domain
if (!($url =~ /$self->{'SITEROOTROOT'}/)) { print "external link, so skipped $url\n"; next; } # check if the URL extension is OK
if ($url =~ /https?:\/\/.*\/[^\/]*\.([^\/\.]+)\/*$/) { my $extension = $1; if ($extension =~ /^jpg$/i || $extension =~ /^zip$/i || $extension =~ /^gz$/i || $extension =~ /^gif$/i || $extension =~ /^pdf$/i || $extension =~ /^ps$/i || $extension =~ /^ppt$/i || $extension =~ /^png$/i || $extension =~ /^doc$/i || $extension =~ /^pps$/i || $extension =~ /^tar$/i || $extension =~ /^tgz$/i || $extension =~ /^mov$/i || $extension =~ /^avi$/i || $extension =~ /^mpg$/i || $extension =~ /^mp3$/i || $extension =~ /^mpeg$/i || $extension =~ /^wmv$/i || $extension =~ /^xls$/i) { print "extension is $extension, so skipped $url\n"; next; } } # check if the URL is cgi-scripted or anything like that
# does not work perfectly. add the unwanted urls into the
# forbidden_urls file if something goes wrong
my $base; if ($url =~ /^(.*?\?).+=/) { $base = $1; } if($base) { print "cgi: $url\n"; } # check if we visited 100 different copies of the scripted page before
if ($base && (exists $seen_url{$base}) && $seen_url{$base} >= 100) { print "cgi limit reached for $base\n"; next; } #---------------------------------------------------------------
# Mark the URL as having been seen by the robot, then add it
# to the list of URLs for the robot to visit. Doing it this way
# means we won't get duplicate URLs on the list.
#---------------------------------------------------------------
print "adding: $url\n"; # increment the count of the "base" of the scripted page
if ($base) { $seen_url{$base}++; } $seen_url{$url} = 1; $urls_list{$url} = 1 if defined $url; } return $status;
}

check_protocoldescriptionprevnextTop
sub check_protocol {
    my $self       = shift;
    my $structure  = shift;
    my $url        = shift;

    my $noindex    = 0;
    my $nofollow   = 0;


    #-------------------------------------------------------------------
# recursively traverse the page elements, looking for META with
# NAME=ROBOTS, then look for directives in the CONTENTS.
#-------------------------------------------------------------------
$structure->traverse(sub { my $node = shift; my $start_flag = shift; my $depth = shift; my $directive; my $name; my $content; return 1 unless $start_flag; return 1 if $node->tag() ne 'meta'; $name = $node->attr('name'); return 1 unless defined $name; return 1 unless lc($name) eq 'robots'; $content = lc($node->attr('content')); foreach $directive (split(/,/, $content)) { $nofollow = 1 if ($directive eq 'nofollow' || $directive eq 'none'); $noindex = 1 if ($directive eq 'noindex' || $directive eq 'none'); } return 0; }, 1); $self->verbose(" ROBOT EXCLUSION -- IGNORING LINKS\n") if $nofollow; $self->verbose(" ROBOT EXCLUSION -- IGNORING CONTENT\n") if $noindex; return ($noindex, $nofollow);
}

create_agentdescriptionprevnextTop
sub create_agent {
    my $self = shift;

    foreach my $i (@INC) {
	print STDERR "$i\n";
    }
    print STDERR "**\n";

    eval { $self->{'AGENT'} = new LWP::RobotUA('Poacher', $EMAIL) };
    if (!$self->{'AGENT'})
    {
	$self->warn("failed to create User Agent object.",
                    "Error: $EVAL_ERROR\n");
	return undef;
    }

    return 1;
}

env_proxydescriptionprevnextTop
sub env_proxy {
    my $self  = shift;


    return $self->{'AGENT'}->env_proxy();
}

extract_linksdescriptionprevnextTop
sub extract_links {
    my $self        = shift;
    my $url         = shift;
    my $response    = shift;
    my $structure   = shift;
    my $filename    = shift;

    my $link;
    my @link;
    my $element;				# of type HTTP::Element
my $link_url; my $tuple; my %seenLinkTo; my $ismap; my $usemap; foreach $tuple (@{ $structure->extract_links(@LINK_ELEMENTS) }) { ($link, $element) = @$tuple; #---------------------------------------------------------------
# If the element is an Anchor (<A HREF="...">...</A>), then we
# check to see if the content is an image, with ISMAP or USEMAP,
# i.e. an image-map. If so, then we ignore the Anchor element.
#---------------------------------------------------------------
if ($element->tag() eq 'a') { $usemap = $ismap = undef; $element->traverse(sub { my $node = shift; my $start_flag = shift; my $depth = shift; return 1 unless $start_flag; if ($node->tag() eq 'img') { $ismap = $node->attr('ismap'); $usemap = $node->attr('usemap'); return 0; } return 1; }, 1); next if defined $ismap && defined $usemap; } #---------------------------------------------------------------
# ignore any links to within the same page
#---------------------------------------------------------------
next if $link =~ m!^#!; #---------------------------------------------------------------
# strip off any markers to within a page
#---------------------------------------------------------------
$link =~ s!#.*$!!; #---------------------------------------------------------------
# should we strip off anything after "?" - i.e. args?
#---------------------------------------------------------------
# $link =~ s!\?.*$!!;
#---------------------------------------------------------------
# check that we haven't already traversed this page
# if we haven't, then mark it as seen, and continue
#---------------------------------------------------------------
next if exists $seenLinkTo{$link}; $seenLinkTo{$link} = 1; $link_url = eval { new URI::URL($link, $url) }; if ($EVAL_ERROR) { $self->warn("unable to create URL object for link.", "LINK: $link", "Error: $EVAL_ERROR\n"); next; } push(@link, $link_url->abs()); } return @link;
}

getAttributedescriptionprevnextTop
sub getAttribute {
    my $self       = shift;
    my $attribute  = shift;


    if (!exists $ATTRIBUTES{$attribute})
    {
	$self->warn("unknown attribute passed to getAttribute()",
                    "Attribute: $attribute",
                    "Returning: undef");
	return undef;
    }

    return $self->{$attribute};
}

get_urldescriptionprevnextTop
sub get_url {
    my $self       = shift;
    my $url        = shift;

    my $request = HTTP::Request->new('GET',$url);
    if (not defined $request) { print "REQUEST NOT DEFINED\n"; }

    my $filename   = $self->{'WORKFILE'};
    my $response;
    my $fh;
    my $structure;


    #---------------------------------------------------------------------
# Is there a modified-since hook?
#---------------------------------------------------------------------
if (exists $self->{'HOOKS'}->{'modified-since'}) { my $time = $self->invoke_hook_functions('modified-since', $url); if (defined $time && $time > 0) { $request->if_modified_since(int($time)); } } #---------------------------------------------------------------------
# make the request
#---------------------------------------------------------------------
$response = $self->{'AGENT'}->request($request); if (not defined $response) { print "RESPONSE NOT DEFINED\n"; } $request = undef; #-------------------------------------------------------------------
# If the request failed, or we get a 304 (not modified), then we
# can stop at this point.
#-------------------------------------------------------------------
if ($response->is_error ) { return ($response, undef, undef, "RESPONSE Error: ". $response->code ." " . $response->message); } elsif ($response->code == RC_NOT_MODIFIED) { return ($response, undef, undef, "RC_NOT_MODIFIED"); } #---------------------------------------------------------------------
# create a local copy of the URL's contents
#---------------------------------------------------------------------
##open (TEMPFILE,">$filename");
## unless (defined $fh)
# $fh = new IO::File(">$filename");
# if (!defined $fh)
# {
# $self->warn("failed to open work file for local copy of URL",
# "URL: $url",
# "Error: $OS_ERROR");
# return ($response, undef, undef, "FILE ERROR $OS_ERROR");
# }
#temporarily:
# else {
# $self->warn("failed to open work file for local copy of URL",
# "URL: $url",
# "Error: $OS_ERROR");
# return ($response, undef, undef);
# }
# print $fh $response->content;
# $fh->close;
##my $real_response = uri_escape($response->content);
##print TEMPFILE $real_response;
##close TEMPFILE;
#---------------------------------------------------------------------
# Parse the HTML into a structure which we can traverse, etc.
#---------------------------------------------------------------------
if ($response->content_type eq 'text/html') { $structure = parse_html($response->content); } return ($response, $structure, $filename);
}

initialisedescriptionprevnextTop
sub initialise {
    my $self     = shift;
    my $options  = shift;

    my $attribute;


    $self->create_agent() || return undef;

    #---------------------------------------------------------------------
# set attributes which are passed as arguments
#---------------------------------------------------------------------
foreach $attribute (keys %$options) { $self->setAttribute($attribute, $options->{$attribute}); } #---------------------------------------------------------------------
# set those attributes which have a default value,
# and which weren't set on creation.
#---------------------------------------------------------------------
foreach $attribute (keys %ATTRIBUTE_DEFAULT) { if (!exists $self->{$attribute}) { $self->{$attribute} = $ATTRIBUTE_DEFAULT{$attribute}; } } #---------------------------------------------------------------------
# TMPDIR is the directory to create any temporary files in.
# WORKFILE is where we put our local copy of URLs.
#---------------------------------------------------------------------
if (!exists $self->{'TMPDIR'}) { $self->{'TMPDIR'} = &pick_tmpdir($self, @TMPDIR_OPTIONS); } $self->{'WORKFILE'} = $self->{'TMPDIR'}.'/'.$self->{'NAME'}.$$; return $self;
}

invoke_hook_functionsdescriptionprevnextTop
sub invoke_hook_functions {
    my $self       = shift;
    my $hook_name  = shift;
    my @argv       = @ARG;

    my $result     = 0;
    my $hookfn;


    return $result unless exists $self->{'HOOKS'}->{$hook_name};

    foreach $hookfn (@{ $self->{'HOOKS'}->{$hook_name} })
    {
	$result |= &$hookfn($self, $hook_name, @argv);
    }

    return $result;
}

invoke_hook_proceduresdescriptionprevnextTop
sub invoke_hook_procedures {
    my $self       = shift;
    my $hook_name  = shift;
    my @argv       = @ARG;

    my $hookfn;


    return unless exists $self->{'HOOKS'}->{$hook_name};

    foreach $hookfn (@{ $self->{'HOOKS'}->{$hook_name} })
    {
	&$hookfn($self, $hook_name, @argv);
    }

    return;
}

newdescriptionprevnextTop
sub new {
    my $class    = shift;
    my %options  = @ARG;

    my $object;


    #-------------------------------------------------------------------
# The two argument version of bless() enables correct subclassing.
# See the "perlbot" and "perlmod" documentation in perl distribution.
#-------------------------------------------------------------------
$object = bless {}, $class; print "Robot2 - starting\n"; return $object->initialise(\%options);
}

next_urldescriptionprevnextTop
sub next_url {
    my $self    = shift;


    #-------------------------------------------------------------------
# We return 'undef' to signify no URLs on the list
#-------------------------------------------------------------------
my $urlObject; return 0 unless (keys %urls_list); my ($url, $value) = each %urls_list; delete $urls_list{$url} ; $urlObject = eval { new URI::URL($url) }; if ($EVAL_ERROR) { $self->warn("addUrl() unable to create URI::URL object", "URL: $url", "Error: $EVAL_ERROR"); next; } unless (defined $urlObject) { return undef; } # for now we're using a hash, so the search will be neither depth-first
# nor breadth-first. just whatever hash element comes up.
# hash keys are simple URL strings. But we return a URL object here
return $urlObject;
}

no_proxydescriptionprevnextTop
sub no_proxy {
    my $self  = shift;
    my @argv  = @ARG;


    return $self->{'AGENT'}->no_proxy(@argv);
}

pick_tmpdirdescriptionprevnextTop
sub pick_tmpdir {
    my $self     = shift;
    my @options  = @ARG;

    my $tmpdir;
    my $ROBOT    = $self->{'NAME'};


    unshift(@options, $ENV{'TMPDIR'}) if exists $ENV{'TMPDIR'};
    foreach $tmpdir (@options)
    {
	return $tmpdir if (-d $tmpdir && -w $tmpdir);
    }

    $self->warn("unable to find a temporary directory.",
                "I tried: ".join(' ', @options));
    return undef;
}

pre_run_checkdescriptionprevnextTop
sub pre_run_check {
    my $self = shift;


    #-------------------------------------------------------------------
# Check that mandatory attributes have been set
#-------------------------------------------------------------------
if (!exists $self->{'NAME'} || !exists $self->{'VERSION'} || !exists $self->{'EMAIL'}) { $self->warn("You haven't set all of the required robot attributes.", "They are: NAME, VERSION and EMAIL attributes"); return undef; } #-------------------------------------------------------------------
# The robot application must provide a follow-url-test hook
#-------------------------------------------------------------------
if (!exists $self->{'HOOKS'}->{'follow-url-test'}) { $self->warn("You must provide a `follow-url-test' hook."); return undef; } #-------------------------------------------------------------------
# You must provide at least one of the following hook functions
#-------------------------------------------------------------------
if (not ( exists $self->{'HOOKS'}->{'invoke-on-all-url'} || exists $self->{'HOOKS'}->{'invoke-on-followed-url'} || exists $self->{'HOOKS'}->{'invoke-on-contents'} || exists $self->{'HOOKS'}->{'invoke-on-link'})) { $self->warn("You must provide at least one invoke-on-* hook.", "Please see the documentation.\n"); return undef; } return 1;
}

proxydescriptionprevnextTop
sub proxy {
    my $self  = shift;
    my @argv  = @ARG;


    return $self->{'AGENT'}->proxy(@argv);
}

retrieve_cached_urlsdescriptionprevnextTop
sub retrieve_cached_urls {
  my $self = shift;

  my $filename = shift;

  my $processed;

  open (INPUT, "<$filename");

  while (<INPUT>) {
	chomp;

	if (/^http/) {
		print "$_\n";
		delete $urls_list{$_};
	}

	elsif (/^adding: (.*)\s*$/) {
		$self->addUrl($1);
	}

	elsif (/^processing: (.*)\s*$/) {
		if ($processed) {
		    delete $urls_list{$processed};
		}
		$processed = $1;
	}
  }

  close INPUT;
}

rundescriptionprevnextTop
sub run {
    my $self      = shift;
    my @url_list  = @ARG;			# optional list of URLs
my $url; my $filename; my $response; my @page_urls; my $link_url; my $structure; my $noindex; my $nofollow; # Added 01/14/2007 jgerrish
dbmopen %urls_list, "urls_list", 0666 or die "Couldn't open urls_list dbm"; dbmopen %seen_url, "seen_url", 0666 or die "Couldn't open seen_url dbm"; $self->pre_run_check() || return undef; if ($self->{'CACHEFILE'}) { $self->retrieve_cached_urls ($self->{'CACHEFILE'}); } else { $self->addUrl(@url_list); } $self->invoke_hook_procedures('restore-state'); my $count=0; #-------------------------------------------------------------------
# MAIN LOOP of the robot. Of course this is all obvious, so we won't
# go into it. Comment above describes the basic architecture.
#-------------------------------------------------------------------
while (my $nextUrl = $self->next_url()) { $url = $nextUrl; print "processing: $url\n"; local $@; eval{ $self->verbose($url, "\n"); $self->invoke_hook_procedures('invoke-on-all-url', $url); }; if($@) { print "Exception caught in block A with url $url\n$@"; next; } if (not($self->invoke_hook_functions('follow-url-test', $url))) { print "test failed: $url\n"; next; } eval{ $self->invoke_hook_procedures('invoke-on-followed-url', $url); }; if($@) { print "Exception caught in block B with url $url.\n$@"; next; } my $message; eval{ ($response, $structure, $filename, $message) = $self->get_url($url); }; if($@) { print "Exception caught in block C with url $url.\n$@"; next; } if ($message) { print "$message: $url\n"; next;} #---------------------------------------------------------------
# This hook function is for people who want to see the result
# of every GET, so they can deal with odd cases, or whatever
#---------------------------------------------------------------
eval{ $self->invoke_hook_procedures('invoke-after-get', $url, $response); }; if($@) { print "Exception caught in block D with url $url.\n$@"; next; } eval{ if ($response->is_error) { $self->invoke_hook_procedures('invoke-on-get-error', $url, $response); next; } }; if($@) { print "Exception caught in block D.5 with url $url.\n$@"; next; } if ($response->code == RC_NOT_MODIFIED) { print "response code RC_NOT_MODIFIED: $url\n"; next; } #---------------------------------------------------------------
# The response says we should use something else as the BASE
# from which to resolve any relative URLs. This might be from
# a BASE element in the HEAD, or just "foo" which should be "foo/"
#---------------------------------------------------------------
eval{ if ($response->base ne $url) { $url = new URI::URL($response->base); } }; if($@) { print "Exception caught in block E with url $url.\n$@"; next; } #---------------------------------------------------------------
# Check page for page specific robot exclusion commands
#---------------------------------------------------------------
eval{ if ($response->content_type eq 'text/html') { ($noindex, $nofollow) = $self->check_protocol($structure, $url); if ($nofollow == 0) { @page_urls = $self->extract_links($url, $response, $structure, $filename); } } }; if($@) { print "Exception caught in block F with url $url.\n$@"; next; } eval{ if ($noindex == 0) { # we invoke with oldurl, so robot app sees it
$self->invoke_hook_procedures('invoke-on-contents', $url, $response, $structure, $filename); } }; if($@) { print "Exception caught in block G with url $url.\n$@"; next; } eval{ $structure->delete() if defined $structure; }; if($@) { print "Exception caught in block H with url $url.\n$@"; next; } if (not $response->content_type eq 'text/html') { print "content not html: $url\n"; next; } eval{ foreach $link_url (@page_urls) { $self->invoke_hook_procedures('invoke-on-link', $url, $link_url); $self->addUrl($link_url); } }; if($@) { print "Exception caught in block I with url $url.\n$@"; next; } } continue { #------------------------------------------------------------------
# If there is no continue-test hook, then we will continue until
# there are no more URLs.
#------------------------------------------------------------------
last if (exists $self->{'HOOKS'}->{'continue-test'} && not $self->invoke_hook_functions('continue-test')); } $self->invoke_hook_procedures('save-state'); $self->invoke_hook_procedures('generate-report'); # Close the dbm files
dbmclose %seen_url; dbmclose %urls_list; return 1;
}

setAttributedescriptionprevnextTop
sub setAttribute {
    my $self   = shift;
    my @attrs  = @ARG;

    my $attribute;
    my $value;


    while (@attrs > 1)
    {
        $attribute = shift @attrs;
        $value     = shift @attrs;

	if (!exists $ATTRIBUTES{$attribute})
	{
	    $self->warn("unknown attribute in setAttribute() - ignoring it.",
                        "Attribute: $attribute");
	    next;
	}
	$self->set_attribute($attribute, $value);
    }
    $self->warn("odd number of arguments to setAttribute!") if @attrs > 0;
}

set_attributedescriptionprevnextTop
sub set_attribute {
    my $self       = shift;
    my $attribute  = shift;
    my $new_value  = shift;


    $self->{$attribute} = $new_value;

    if ($attribute eq 'IGNORE_TEXT')
    {
	#---------------------------------------------------------------
# when building structure of HTML, do we include content?
#---------------------------------------------------------------
$HTML::Parse::IGNORE_TEXT = $new_value; } elsif ($attribute eq 'TRAVERSAL') { #---------------------------------------------------------------
# check that TRAVERSAL is set to a legal value
#---------------------------------------------------------------
if ($new_value ne 'depth' && $new_value ne 'breadth') { $self->warn("ignoring unknown traversal method, using `depth'.", "Value: $new_value");; $self->{'TRAVERSAL'} = 'depth'; } } elsif ($attribute eq 'EMAIL') { $self->{'AGENT'}->from($new_value); } elsif (($attribute eq 'NAME' || $attribute eq 'VERSION') && defined $self->{'NAME'} && defined $self->{'VERSION'}) { $self->{'AGENT'}->agent($self->{'NAME'}.'/'.$self->{'VERSION'}); } elsif ($attribute eq 'REQUEST_DELAY') { $self->{'AGENT'}->delay($new_value); } $self->{'AGENT'}->use_sleep(0); $self->{'AGENT'}->timeout(10);
}

verbosedescriptionprevnextTop
sub verbose {
    my $self   = shift;
    my @lines  = @ARG;


    print STDERR @lines if $self->{'VERBOSE'};
}

warndescriptionprevnextTop
sub warn {
    my $self  = shift;
    my @lines = shift;

    my $me    = ref $self;
    my $line;


    print STDERR "$me: ", shift @lines, "\n";
    foreach $line (@lines)
    {
        print STDERR ' ' x (length($me) +2), $line, "\n";
    }
}

General documentation


QUESTIONSTop
This section contains a number of questions. I'm interested in hearing
what people think, and what you've done faced with similar questions.

    *(1)

    What style of API is preferable for setting attributes? Maybe
something like the following:
    $robot->verbose(1);
    $traversal = $robot->traversal();
    I.e. a method for setting and getting each attribute,
depending on whether you passed an argument?

    *(2)

    Should the robot module support a standard logging mechanism?
For example, an LOGFILE attribute, which is set to either a filename,
or a filehandle reference.
This would need a useful file format.

    *(3)

    Should the AGENT be an attribute, so you can set this to whatever
UserAgent object you want to use?
Then if the attribute is not set by the first time the run()
method is invoked, we'd fall back on the default.

    *(4)

    Should TMPDIR and WORKFILE be attributes? I don't see any big reason
why they should, but someone else's application might benefit?

    *(5)

    Should the module also support an ERRLOG attribute, with all warnings
and error messages sent there?

    *(6)

    At the moment the robot will print warnings and error messages to stderr,
as well as returning error status. Should this behaviour be configurable?
I.e. the ability to turn off warnings.
The basic architecture of the Robot is as follows:
    Hook: restore-state
    Get Next URL
        Hook: invoke-on-all-url
        Hook: follow-url-test
        Hook: invoke-on-follow-url
        Get contents of URL
        Hook: invoke-on-contents
        Skip if not HTML
        Foreach link on page:
            Hook: invoke-on-link
            Add link to robot's queue
    Continue? Hook: continue-test
    Hook: save-state
    Hook: generate-report
Each of the hook procedures and functions is described below.
A robot must provide a follow-url-test hook,
and at least one of the following:

    *(7)

invoke-on-all-url

    *(8)

invoke-on-followed-url

    *(9)

invoke-on-contents

    *(10)

invoke-on-link

CONSTRUCTORTop
   $robot = new WWW::Robot(  );
Create a new robot engine instance.
If the constructor fails for any reason, a warning message will be printed,
and undef will be returned.
Having created a new robot, it should be configured using the methods
described below.
Certain attributes of the Robot can be set during creation;
they can be (re)set after creation, using the setAttribute() method.
The attributes of the Robot are described below,
in the Robot Attributes section.

ROBOT ATTRIBUTESTop
This section lists the attributes used to configure a Robot object.
Attributes are set using the setAttribute() method,
and queried using the getAttribute() method.
Some of the attributes must be set before you start the Robot
(with the run() method).
These are marked as mandatory in the list below.

    NAME

    The name of the Robot.
This should be a sequence of alphanumeric characters,
and is used to identify your Robot.
This is used to set the User-Agent field of HTTP requests,
and so will appear in server logs.
mandatory

    VERSION

    The version number of your Robot.
This should be a floating point number,
in the format N.NNN.
mandatory

    EMAIL

    A valid email address which can be used to contact the Robot's owner,
for example by someone who wishes to complain about the behavior of
your robot.
mandatory

    VERBOSE

    A boolean flag which specifies whether the Robot should display verbose
status information as it runs.
    Default: 0 (false)

    TRAVERSAL

    Specifies what traversal style should be adopted by the Robot.
Valid values are depth and breadth.
    Default: depth

    REQUEST_DELAY

    Specifies whether the delay (in minutes) between successive GETs
from the same server.
    Default: 1

    IGNORE_TEXT

    Specifies whether the HTML structure passed to the invoke-on-contents
hook function should include the textual content of the page,
or just the HTML elements.
    Default: 1 (true)

SUPPORTED HOOKSTop
This section lists the hooks which are supported by the WWW::Robot module.
The first two arguments passed to a hook function are always the Robot
object followed by the name of the hook being invoked. I.e. the start of
a hook function should look something like: