Clair::Utils

ALE


SummaryIncluded librariesPackage variablesSynopsisDescriptionGeneral documentation

SummaryTop
ALE - The Automatic Link Extrapolator

Package variablesTop
No package variables defined.

InheritTop
Exporter

SynopsisTop
ALE is a collection of tools and Perl libraries providing easy database access for indexing information about the links in HTML documents and retreiving information from those indices.
The basic process used is to give a series of documents to the ALE indexer, then ask questions with the command-line search tool or the Perl modules.

DescriptionTop
To use the ALE classes in your program, you'll need to first tell Perl
where they are, with a line like this:
    use lib '/clair4/projects/crawl/wget/prog/ale';
After that, you can just use them like any other modules.
The only module you should use directly is Clair::ALE::Search.
That module will return Clair::ALE::Conn objects, which contain
one or more Clair::ALE::Link objects, which contain two
Clair::ALE::URL objects.
Internal modules you might be interested in if you are extending ALE
are Clair::ALE::Stemmer and Clair::ALE::_SQL.
The easiest way to begin using ALE is to pull in the environment variables from /clair4/projects/crawl/profile using a Bourne-like shell (sh, ksh, bash, zsh, etc.). You can do that with a command like:
    . /clair4/projects/crawl/profile
That will add the ALE tools to your path, and set other environment variables necessary to use ALE.
All ALE programs and libraries recognize a few environment variables which tell them where to store and look for their data. These can be set directly in the environment or by importing %ALE::ALE_ENV and setting them there, with the exception of MYSQL_UNIX_PORT.

    ALESPACE

    is the subdirectory where all data should be stored, and a prefix for
all directory names. If you are working with data independent of other
projects, you should try to set ALESPACE to something unique, perhaps
starting with your username. It defaults to ``default''.

    ALECACHEBASE

    determines the root of the location where ALE can find the documents
its working with, in wget format. It defaults to $ALEBASE/cache.
    In addition, ALE is built on a MySQL backend. Several MySQL
environment variables can further influence ALE's behavior.

    MYSQL_UNIX_PORT

    gives the path to the UNIX socket where the MySQL database ALE should
use is running on.
aleget is a tool for fetching files to index from the Web. It is a
thin front-end to wget, which instructs wget to stores files in the
place you specified in your environment variables. It gives some default command-line options to wget,
and you can also use any other switches documented in wget. alext is the ALE indexer. It takes one or more HTML files to index
on its command line, extracts the links from them, and puts them into
its index.
It expects all files to be in the $ALESPACE subdirectory of the
$ALECACHEBASE directory. If a filename starts with ``./'' it is
assumed to be a relative path and located in the proper directory, and
otherwise it is assumed to be an absolute path which should be located
in the proper directory. If you fetched your files with aleget, you
won't have to worry much about this.
You can use alext -z to ``zap'' the tables in $ALESPACE,
removing all data stored there.
You usually will use alext in conjunction with find and xargs,
to easily pass it a large number of files to index. If you are using
GNU xargs, you can use the -P option to run multiple copies of
alext in parallel. For more information on using these commands, see
find and xargs.
alext recognizes the standard environment variables; for more information, see Environment Variables.
Searching from the command-line
ale is the command-line searching tool. It takes many command-line
parameters; you can get a list of all of them by running ale
--help. Some of the more useful ones are:

    --source_url

    Only show links with this source URL. Also --no_source_url.

    --dest_url

    Only show links with this destination URL. Also --no_dest_url.

    --link_word

    Only show links with this word as part of the text that creates the link.

    --source1_url, --dest2_url, --link3_word, etc.

    Requests multi-link paths, with the first link having the specified
source URL, the second link having the specified destination URL, the
third link being created by the specified word, and so forth. These
queries have to look at a lot of links, and so can be much slower than
other queries.

    --limit

    Return at most the given number of results. Defaults to 10; use the
string ``none'' to retreive all links.
ale recognizes the standard environment variables. The Perl modules do the same searches as the command-line tool ale,
but return the data in a native Perl format instead of as text. In
fact, the command-line tool is built on top of the Perl modules.
The Perl modules are well-documented. A good starting place to learn
more about them is ALE.

Methods description


None available.

Methods code


No methods available.

General documentation


EXAMPLESTop
Here's an example of indexing the links on the CLAIR Web site, and
asking a few questions about the links.
First, we log on to tangra and start up a Bourne-like shell (if you're
using bash, you don't have to do anything special).
Once we're logged on, we set up the ALE environment:
    . /clair4/projects/crawl/profile
and set up an ALESPACE environment variable so we are working in
our own private space
    ALESPACE=gifford_clair
export ALESPACE
Now let's get the CLAIR Web site:
    aleget -r http://tangra.si.umich.edu/clair/index.html \
-X /clair/nsir -D tangra.si.umich.edu
(as is generally true when using wget to crawl the Web, some
experimentation will be required to figure out what needs to be
excluded). This downloads about 20MB and takes 2.5 minutes.
With the Web pages in our local cache, we can now build an ALE index on it:
    cd /clair4/projects/crawl/var/alecache/gifford_clair
find . -type f -print0 |
xargs -P 5 -n 20 -0 nofail alext >/tmp/alext.out 2>&1
This takes about 5 minutes.
Now, we can ask questions using the command-line tool:
Search for all links containing the word ``mead'':
    ale --link1_word='mead' --limit=none
Search for all links that contain the word ``Jahna'', display up to 10:
    ale --link1_word='jahna'
Search for all links to www.aclweb.org, display up to 10:
    ale --dest_url 'http://www.aclweb.org'
Display all links from the Projects page:
    ale --source_url http://tangra.si.umich.edu/clair/home/projects.htm \
--limit=none

SEE ALSOTop
You may also want to look at ALE, wget, find, xargs, and
mysql.

AUTHORSTop
ALE was written primarily by Scott Gifford, with input and assistance
from Dragomir Radev, Adam Winkel, and other members of the CLAIR group
at the University of Michigan School of Information.