| Summary | Package variables | Synopsis | Description | General documentation | Methods |
| Summary | Top |
| package Clair::Index Creates various indexes from supplied Clair::GenericDoc objects. |
| Package variables | Top |
| No package variables defined. |
| Included modules | Top |
| Clair::Config |
| Clair::Debug |
| Clair::GenericDoc |
| Data::Dumper |
| File::Path |
| Synopsis | Top |
| This is the module that builds positional inverted index for documents. The inverted index uses the terms in the document as the "key" for looking up documents that contain them. Once the index is built, it can be used for various IR purposes. To build the index, you require the following calls: use Clair::Index; my $idx = new Clair::Index(DEBUG => $DEBUG, stop_word_list => $stop_word_list); my $gdoc = new Clair::GenericDoc( DEBUG => $DEBUG, content => "/some/doc"); $idx->insert($gdoc); ... insert more ... $idx->build();By default, it will choose "mldbm" to store the constructed index hashes. |
| Description | Top |
| This package also uses runtime loaded sub-modules to implement index writing and reading. The index writing should take the perl hash structure and layout the contents in the file system in module-specific way. Similarly, the index reading should make it transparent to the API user on how the index content is read from the filesystem. See ./Index/mldbm.pm for example. This Index.pm module also supports a list of stop words when constructing the index. In case the list of file containing stop words is supplied, all the words that appear in that list will be excluded from the index construction. The inverted index is created from extracted content from Clair::GenericDoc objects. The document_id is assigned with auto-increment counter, and the positions of each stemmed word will be registered for each document in the index. Thus, this module implements the construction of full positional inverted index. |
| Methods | Top |
| _add_to_index | Description | Code |
| _load_rw_module | Description | Code |
| build | Description | Code |
| clean | Description | Code |
| index_read | Description | Code |
| index_write | Description | Code |
| init | Description | Code |
| insert | Description | Code |
| new | Description | Code |
| sync | Description | Code |
| _add_to_index | code | next | Top |
This subroutine where the actual index construction happens. For each subdocument returned by the extract function of Clair::GenericDoc object, it takes the contents and builds the internal hash structure. The internal hash structures are: |
| _load_rw_module | code | prev | next | Top |
A privation function that loads the necessary index R/W modules at runtime. |
| build | code | prev | next | Top |
This subroutine loops through the $self-{documents}> array, and for each |
| clean | code | prev | next | Top |
Cleans out the index directory specified under $self-{index_root}>. |
| index_read | code | prev | next | Top |
A wrapper function that loads a submodule at runtime and reads the necessary indexed files. The returned value is a hash. There is a third parameter that acts as a boolean flag that tells the submodules whether you are reading in a meta index or a regular inverted index. |
| index_write | code | prev | next | Top |
A wrapper function that loads a submodule at runtime and passes the $self object to the underlying submodule routine that implements the actual writing to disk. |
| init | code | prev | next | Top |
Initializes a number of indexes by means of sub-module index_read call. The specified index file is fetched from disk and mapped into an internal hash structure. This is how you can take the contents on disk and read them into memory to speed up your queries later on. |
| insert | code | prev | next | Top |
Takes the instantiated Clair::GenericDoc objects and stores them into the internal array. It ensures that you are passing in the object that is blessed with the Clair::GenericDoc name. The internal array of Clair::GenericDoc objects is later used to construct various index hashes. |
| new | code | prev | next | Top |
The constructor understands the following significant hash key-values: |
| sync | code | prev | next | Top |
Simple wrapper around index_write, which in turn will call submodule implementation of index writing. After the index has been written, it will save the current_doc_id in order to support incremental index writing. |
| _add_to_index | description | prev | next | Top |
sub _add_to_index
{my ($self, $subdocs) = @_; # Inverted Index looks like:} |
| _load_rw_module | description | prev | next | Top |
sub _load_rw_module
{
my ($self, $modname) = @_;
unless($self->{loaded_modules}->{$modname})
{
my $modfile = "$self->{rw_modules_root}/$modname.pm";
$self->debugmsg("loading $modfile for r/w operation", 1);
eval { require $modfile; };
$self->errmsg("failed to load $modfile: $@", 1) if $@;
$self->{loaded_modules}->{$modname} = $modname;
}
return $modname;} |
| build | description | prev | next | Top |
sub build
{
my ($self) = @_;
$self->{current_doc_id} = 1;
# if we are adding onto the same index, we need the last doc_id.} |
| clean | description | prev | next | Top |
sub clean
{
my ($self, $rootdir) = @_;
$rootdir = $self->{index_root} unless($rootdir);
return unless(-d $rootdir);
rmtree($rootdir, 0 ,1);} |
| index_read | description | prev | next | Top |
sub index_read
{
my ($self, $modname, $token, $is_meta) = @_;
$modname = $self->{index_file_format} unless($modname);
my $modobj = $self->_load_rw_module($modname);
return $modobj->index_read($token, $is_meta, $self);} |
| index_write | description | prev | next | Top |
sub index_write
{
my ($self, $modname) = @_;
$modname = $self->{index_file_format} unless($modname);
my $modobj = $self->_load_rw_module($modname);
$modobj->index_write($self); # $self contains all the info we need} |
| init | description | prev | next | Top |
sub init
{my ($self, %indexlist) = @_; # initializing the index specified in the %indexlist hash.} |
| insert | description | prev | next | Top |
sub insert
{
my ($self, $doc_obj) = @_;
my $refname = ref $doc_obj;
unless($refname eq "Clair::GenericDoc")
{
$self->errmsg("passed in object is not Clair::GenericDoc: $refname", 1);
}
if($DEBUG)
{
my $src = $doc_obj->{content};
# my $length = length $doc_obj->{content};} |
| new | description | prev | next | Top |
sub new
{
my ($proto, %args) = @_;
my $class = ref $proto || $proto;
my $self = bless {}, $class;
$DEBUG = $args{DEBUG} || $ENV{MYDEBUG};
$self->{stem_docs} = 1;
$self->{documents} = [];
$self->{last_doc_id_filename} = "last_doc_id";
# indexes} |
| sync | description | prev | next | Top |
sub sync
{
my ($self) = @_;
my $mod = $self->{index_file_format};
unless(scalar keys %{ $self->{inverted_index} })
{
$self->errmsg("nothing to sync to disk - no inverted index found", 1);
}
$self->index_write($mod);
# save the last doc_id} |
| AUTHOR | Top |
| JB Kim
jbremnant@gmail.com 20070407 |
| TODO | Top |
Write more submodules to output different index file layout. |