To use the ALE classes in your program, you'll need to first tell Perl where they are, with a line like this:
use lib '/clair4/projects/crawl/wget/prog/ale';
After that, you can just use them like any other modules. The only module you should use directly is Clair::ALE::Search. That module will return Clair::ALE::Conn objects, which contain one or more Clair::ALE::Link objects, which contain two Clair::ALE::URL objects. Internal modules you might be interested in if you are extending ALE are Clair::ALE::Stemmer and Clair::ALE::_SQL. The easiest way to begin using ALE is to pull in the environment variables from /clair4/projects/crawl/profile using a Bourne-like shell (sh, ksh, bash, zsh, etc.). You can do that with a command like:
. /clair4/projects/crawl/profile
That will add the ALE tools to your path, and set other environment variables necessary to use ALE.
All ALE programs and libraries recognize a few environment variables which tell them where to store and look for their data. These can be set directly in the environment or by importing %ALE::ALE_ENV and setting them there, with the exception of MYSQL_UNIX_PORT.
ALESPACE is the subdirectory where all data should be stored, and a prefix for all directory names. If you are working with data independent of other projects, you should try to set ALESPACE to something unique, perhaps starting with your username. It defaults to ``default''.
ALECACHEBASE determines the root of the location where ALE can find the documents its working with, in wget format. It defaults to $ALEBASE/cache. In addition, ALE is built on a MySQL backend. Several MySQL environment variables can further influence ALE's behavior.
MYSQL_UNIX_PORT gives the path to the UNIX socket where the MySQL database ALE should use is running on.
aleget is a tool for fetching files to index from the Web. It is a thin front-end to wget, which instructs wget to stores files in the place you specified in your environment variables. It gives some default command-line options to wget, and you can also use any other switches documented in wget.
alext is the ALE indexer. It takes one or more HTML files to index on its command line, extracts the links from them, and puts them into its index. It expects all files to be in the $ALESPACE subdirectory of the $ALECACHEBASE directory. If a filename starts with ``./'' it is assumed to be a relative path and located in the proper directory, and otherwise it is assumed to be an absolute path which should be located in the proper directory. If you fetched your files with aleget, you won't have to worry much about this. You can use alext -z to ``zap'' the tables in $ALESPACE, removing all data stored there. You usually will use alext in conjunction with find and xargs, to easily pass it a large number of files to index. If you are using GNU xargs, you can use the -P option to run multiple copies of alext in parallel. For more information on using these commands, see find and xargs. alext recognizes the standard environment variables; for more information, see Environment Variables. Searching from the command-line ale is the command-line searching tool. It takes many command-line parameters; you can get a list of all of them by running ale --help. Some of the more useful ones are:
--source_url Only show links with this source URL. Also --no_source_url.
--dest_url Only show links with this destination URL. Also --no_dest_url.
--link_word Only show links with this word as part of the text that creates the link.
--source1_url, --dest2_url, --link3_word, etc. Requests multi-link paths, with the first link having the specified source URL, the second link having the specified destination URL, the third link being created by the specified word, and so forth. These queries have to look at a lot of links, and so can be much slower than other queries.
--limit Return at most the given number of results. Defaults to 10; use the string ``none'' to retreive all links. ale recognizes the standard environment variables.
The Perl modules do the same searches as the command-line tool ale, but return the data in a native Perl format instead of as text. In fact, the command-line tool is built on top of the Perl modules. The Perl modules are well-documented. A good starting place to learn more about them is ALE.
|