| Summary | Package variables | Synopsis | Description | General documentation | Methods |
| Summary | Top |
| WWW::Robot - configurable web traversal engine (for web robots & agents) |
| Package variables | Top |
| |
| %urls_list; | |
| %ATTRIBUTE_DEFAULT = ( 'REQUEST_DELAY', 1, 'TRAVERSAL', 'depth', 'VERBOSE', 0, 'IGNORE_TEXT', 1) | |
| @LINK_ELEMENTS = qw(a img body form input link frame applet area) | |
| %ATTRIBUTES = ( 'NAME', 'Name of the Robot', 'VERSION', 'Version of the Robot, N.NNN', 'EMAIL', 'Contact email address for Robot owner', 'REQUEST_DELAY', 'Delay between requests to the same server', 'TRAVERSAL', 'traversal order - depth or breadth', 'VERBOSE', 'boolean flag for verbose reporting', 'IGNORE_TEXT', 'should we ignore text content of HTML?', 'SITEROOTROOT', 'the string to check if the site is inside the desired web domain', 'CACHEFILE', 'file which contains the cached results from previous run') | |
| %SUPPORTED_HOOKS = ( 'restore-state', 'opportunity for client to restore state', 'invoke-on-all-url', 'invoked on all URLs, even those not visited', 'follow-url-test', 'return true if robot should visit the URL', 'invoke-on-followed-url', 'invoked on only those URLs which are visited', 'invoke-on-get-error', 'invoked when an HTTP request results in error', 'invoke-on-contents', 'invoked on the contents of each visited URL', 'invoke-on-link', 'invoked on all links seen on a page', 'continue-test', 'return true if robot should continue iterating', 'save-state', 'opportunity for client to save state after a run', 'generate-report', 'report for the run just finished', 'modified-since', 'returns a modified-since time for URL passed', 'invoke-after-get', 'invoked right after every GET request',) | |
| @TMPDIR_OPTIONS = ("$ENV{PERLTREE_HOME}/tmp") | |
| %seen_url; |
| Included modules | Top |
| Clair::Config |
| English |
| HTML::Parse |
| HTTP::Request |
| HTTP::Status |
| IO::File |
| LWP::RobotUA |
| SDBM_File |
| Time::Local |
| URI::Escape |
| URI::URL |
| Synopsis | Top |
use WWW::Robot;
$robot = new WWW::Robot('NAME' => 'MyRobot',
'VERSION' => '1.000',
'EMAIL' => 'fred@foobar.com');
# ... configure the robot's operation ...
$robot->run('http://www.foobar.com/'); |
| Description | Top |
| This module implements a configurable web traversal engine, for a robot or other web agent. Given an initial web page (URL), the Robot will get the contents of that page, and extract all links on the page, adding them to a list of URLs to visit. Features of the Robot module include: *(1) Follows the Robot Exclusion Protocol.*(2) Supports the META element proposed extensions to the Protocol.*(3) Implements many of the Guidelines for Robot Writers.*(4) Configurable.*(5) Builds on standard Perl 5 modules for WWW, HTTP, HTML, etc.A particular application (robot instance) has to configure the engine using hooks, which are perl functions invoked by the Robot engine at specific points in the control loop. The robot engine obeys the Robot Exclusion protocol, as well as a proposed addition. See SEE ALSO for references to documents describing the Robot Exclusion protocol and web robots. |
| Methods | Top |
| DESTROY | No description | Code |
| addHook | Description | Code |
| addUrl | Description | Code |
| check_protocol | No description | Code |
| create_agent | No description | Code |
| env_proxy | No description | Code |
| extract_links | No description | Code |
| getAttribute | Description | Code |
| get_url | No description | Code |
| initialise | No description | Code |
| invoke_hook_functions | No description | Code |
| invoke_hook_procedures | No description | Code |
| new | No description | Code |
| next_url | No description | Code |
| no_proxy | No description | Code |
| pick_tmpdir | No description | Code |
| pre_run_check | No description | Code |
| proxy | Description | Code |
| retrieve_cached_urls | No description | Code |
| run | Description | Code |
| setAttribute | Description | Code |
| set_attribute | No description | Code |
| verbose | No description | Code |
| warn | No description | Code |
| addHook | code | next | Top |
$robot->addHook($hook_name, \&hook_function);
sub hook_function { ... }
|
| addUrl | code | prev | next | Top |
$robot->addUrl( $url1, ..., $urlN );Used to add one or more URLs to the queue for the robot. |
| getAttribute | code | prev | next | Top |
$value = $robot->getAttribute('attribute-name');
|
| proxy, no_proxy, env_proxy | code | prev | next | Top |
| These are convenience functions are setting proxy information on the User agent being used to make the requests. $robot->proxy( protocol, proxy );Used to specify a proxy for the given scheme. |
| run | code | prev | next | Top |
$robot->run( LIST );Invokes the robot, initially traversing the root URLs provided in LIST, |
| setAttribute | code | prev | next | Top |
$robot->setAttribute( ... attribute-value-pairs ... );Change the value of one or more robot attributes. |
| DESTROY | description | prev | next | Top |
sub DESTROY
{
my $self = shift;
unlink $self->{'WORKFILE'} if defined $self->{'WORKFILE'};} |
| addHook | description | prev | next | Top |
sub addHook
{
my $self = shift;
my $hook_name = shift;
my $hook_fn = shift;
if (!exists $SUPPORTED_HOOKS{$hook_name})
{
$self->warn("unknown hook name passed to addHook(). Ignoring it!",
"Hook Name: $hook_name");
return undef;
}
if (ref($hook_fn) ne 'CODE')
{
$self->warn("not function reference passed to addHook(). Ignoring.",
"Hook Name: $hook_name");
return undef;
}
if (exists $self->{'HOOKS'}->{$hook_name})
{
push(@{ $self->{'HOOKS'}->{$hook_name} }, $hook_fn);
}
else
{
$self->{'HOOKS'}->{$hook_name} = [$hook_fn];
}
return 1;} |
| addUrl | description | prev | next | Top |
sub addUrl
{
my $self = shift;
my @list = @ARG;
my $status = 1;
my $url;
my $urlObject;
foreach my $newurl (@list)
{
$url = $newurl;
# check if we already visited this URL} |
| check_protocol | description | prev | next | Top |
sub check_protocol
{
my $self = shift;
my $structure = shift;
my $url = shift;
my $noindex = 0;
my $nofollow = 0;
#-------------------------------------------------------------------} |
| create_agent | description | prev | next | Top |
sub create_agent
{
my $self = shift;
foreach my $i (@INC) {
print STDERR "$i\n";
}
print STDERR "**\n";
eval { $self->{'AGENT'} = new LWP::RobotUA('Poacher', $EMAIL) };
if (!$self->{'AGENT'})
{
$self->warn("failed to create User Agent object.",
"Error: $EVAL_ERROR\n");
return undef;
}
return 1;} |
| env_proxy | description | prev | next | Top |
sub env_proxy
{
my $self = shift;
return $self->{'AGENT'}->env_proxy();} |
| extract_links | description | prev | next | Top |
sub extract_links
{
my $self = shift;
my $url = shift;
my $response = shift;
my $structure = shift;
my $filename = shift;
my $link;
my @link;
my $element; # of type HTTP::Element} |
| getAttribute | description | prev | next | Top |
sub getAttribute
{
my $self = shift;
my $attribute = shift;
if (!exists $ATTRIBUTES{$attribute})
{
$self->warn("unknown attribute passed to getAttribute()",
"Attribute: $attribute",
"Returning: undef");
return undef;
}
return $self->{$attribute};} |
| get_url | description | prev | next | Top |
sub get_url
{
my $self = shift;
my $url = shift;
my $request = HTTP::Request->new('GET',$url);
if (not defined $request) { print "REQUEST NOT DEFINED\n"; }
my $filename = $self->{'WORKFILE'};
my $response;
my $fh;
my $structure;
#---------------------------------------------------------------------} |
| initialise | description | prev | next | Top |
sub initialise
{
my $self = shift;
my $options = shift;
my $attribute;
$self->create_agent() || return undef;
#---------------------------------------------------------------------} |
| invoke_hook_functions | description | prev | next | Top |
sub invoke_hook_functions
{
my $self = shift;
my $hook_name = shift;
my @argv = @ARG;
my $result = 0;
my $hookfn;
return $result unless exists $self->{'HOOKS'}->{$hook_name};
foreach $hookfn (@{ $self->{'HOOKS'}->{$hook_name} })
{
$result |= &$hookfn($self, $hook_name, @argv);
}
return $result;} |
| invoke_hook_procedures | description | prev | next | Top |
sub invoke_hook_procedures
{
my $self = shift;
my $hook_name = shift;
my @argv = @ARG;
my $hookfn;
return unless exists $self->{'HOOKS'}->{$hook_name};
foreach $hookfn (@{ $self->{'HOOKS'}->{$hook_name} })
{
&$hookfn($self, $hook_name, @argv);
}
return;} |
| new | description | prev | next | Top |
sub new
{
my $class = shift;
my %options = @ARG;
my $object;
#-------------------------------------------------------------------} |
| next_url | description | prev | next | Top |
sub next_url
{
my $self = shift;
#-------------------------------------------------------------------} |
| no_proxy | description | prev | next | Top |
sub no_proxy
{
my $self = shift;
my @argv = @ARG;
return $self->{'AGENT'}->no_proxy(@argv);} |
| pick_tmpdir | description | prev | next | Top |
sub pick_tmpdir
{
my $self = shift;
my @options = @ARG;
my $tmpdir;
my $ROBOT = $self->{'NAME'};
unshift(@options, $ENV{'TMPDIR'}) if exists $ENV{'TMPDIR'};
foreach $tmpdir (@options)
{
return $tmpdir if (-d $tmpdir && -w $tmpdir);
}
$self->warn("unable to find a temporary directory.",
"I tried: ".join(' ', @options));
return undef;} |
| pre_run_check | description | prev | next | Top |
sub pre_run_check
{
my $self = shift;
#-------------------------------------------------------------------} |
| proxy | description | prev | next | Top |
sub proxy
{
my $self = shift;
my @argv = @ARG;
return $self->{'AGENT'}->proxy(@argv);} |
| retrieve_cached_urls | description | prev | next | Top |
sub retrieve_cached_urls
{ my $self = shift;
my $filename = shift;
my $processed;
open (INPUT, "<$filename");
while (<INPUT>) {
chomp;
if (/^http/) {
print "$_\n";
delete $urls_list{$_};
}
elsif (/^adding: (.*)\s*$/) {
$self->addUrl($1);
}
elsif (/^processing: (.*)\s*$/) {
if ($processed) {
delete $urls_list{$processed};
}
$processed = $1;
}
}
close INPUT;} |
| run | description | prev | next | Top |
sub run
{
my $self = shift;
my @url_list = @ARG; # optional list of URLs} |
| setAttribute | description | prev | next | Top |
sub setAttribute
{
my $self = shift;
my @attrs = @ARG;
my $attribute;
my $value;
while (@attrs > 1)
{
$attribute = shift @attrs;
$value = shift @attrs;
if (!exists $ATTRIBUTES{$attribute})
{
$self->warn("unknown attribute in setAttribute() - ignoring it.",
"Attribute: $attribute");
next;
}
$self->set_attribute($attribute, $value);
}
$self->warn("odd number of arguments to setAttribute!") if @attrs > 0;} |
| set_attribute | description | prev | next | Top |
sub set_attribute
{
my $self = shift;
my $attribute = shift;
my $new_value = shift;
$self->{$attribute} = $new_value;
if ($attribute eq 'IGNORE_TEXT')
{
#---------------------------------------------------------------} |
| verbose | description | prev | next | Top |
sub verbose
{
my $self = shift;
my @lines = @ARG;
print STDERR @lines if $self->{'VERBOSE'};} |
| warn | description | prev | next | Top |
sub warn
{
my $self = shift;
my @lines = shift;
my $me = ref $self;
my $line;
print STDERR "$me: ", shift @lines, "\n";
foreach $line (@lines)
{
print STDERR ' ' x (length($me) +2), $line, "\n";
}} |
| QUESTIONS | Top |
| This section contains a number of questions. I'm interested in hearing what people think, and what you've done faced with similar questions. *(1) What style of API is preferable for setting attributes? Maybesomething like the following: $robot->verbose(1);
$traversal = $robot->traversal();
|
| CONSTRUCTOR | Top |
$robot = new WWW::Robot( |
| ROBOT ATTRIBUTES | Top |
| This section lists the attributes used to configure a Robot object. Attributes are set using the setAttribute() method, and queried using the getAttribute() method. Some of the attributes must be set before you start the Robot (with the run() method). These are marked as mandatory in the list below. NAME The name of the Robot.This should be a sequence of alphanumeric characters, and is used to identify your Robot. This is used to set the User-Agent field of HTTP requests, and so will appear in server logs. mandatory VERSION The version number of your Robot.This should be a floating point number, in the format N.NNN. mandatory for example by someone who wishes to complain about the behavior of your robot. mandatory VERBOSE A boolean flag which specifies whether the Robot should display verbosestatus information as it runs. Default: 0 (false) TRAVERSAL Specifies what traversal style should be adopted by the Robot.Valid values are depth and breadth. Default: depth REQUEST_DELAY Specifies whether the delay (in minutes) between successive GETsfrom the same server. Default: 1 IGNORE_TEXT Specifies whether the HTML structure passed to the invoke-on-contentshook function should include the textual content of the page, or just the HTML elements. Default: 1 (true) |
| SUPPORTED HOOKS | Top |
| This section lists the hooks which are supported by the WWW::Robot module. The first two arguments passed to a hook function are always the Robot object followed by the name of the hook being invoked. I.e. the start of a hook function should look something like: |