| Summary | Package variables | Synopsis | Description | General documentation | Methods |
| Summary | Top |
| package Clair::Info::Query A module that implements different types of queries. |
| Package variables | Top |
| No package variables defined. |
| Included modules | Top |
| Clair::Debug |
| Clair::GenericDoc |
| Clair::Index |
| Clair::StringManip |
| Data::Dumper |
| Synopsis | Top |
| This module contains bulk of the query retrieval algorithm using the inverted index. At the core, it initializes the necessary indexes from the Index.pm object, and keeps the perl hash data structures in memory for query processing. if the instantiated Index.pm object is not supplied in the constructor, it will attempt to instantiate it. The constructor initializes these three indexes by default (can be overriden): document_index, document_meta_index The significant flags for constructor are:required_indexes - an array reference containing the names of indexes to initialize. default_query_logic - defaults to fuzzy_or_merge, this is the name of the subroutine. Once the Clair::Info::Query object is instantiated, various queries are possible. For example:use Clair::Info::Query; |
| Description | Top |
| while some query functions can be taken off of this module and placed elsewhere, this module implement standard query retrieval functions from inverted index. One highlight of this module is the fact that it supports N-gram tokens (phrases) search. |
| Methods | Top |
| _document_info | Description | Code |
| _load_index_for_word | Description | Code |
| _match_word_positions | Description | Code |
| _return_doc_for_ngram | Description | Code |
| _return_doc_for_token | Description | Code |
| document_content | Description | Code |
| document_frequency | Description | Code |
| document_title | Description | Code |
| fuzzy_or_merge | Description | Code |
| new | Description | Code |
| normalize_input | Description | Code |
| process_query | Description | Code |
| result_logic | Description | Code |
| term_frequency | Description | Code |
| words_frequency | Description | Code |
| _document_info | code | next | Top |
A private subroutine that actually looks up the doc meta data against the document_meta_index. |
| _load_index_for_word | code | prev | next | Top |
This function is used to load a chunk of of the entire inverted index on |
| _match_word_positions | code | prev | next | Top |
This subroutine has the heart of the n-gram match algorithm. The input parameters |
| _return_doc_for_ngram | code | prev | next | Top |
This private routine does several things to narrow down and speed up our |
| _return_doc_for_token | code | prev | next | Top |
This subroutine, by default, handles combinations of single word queries. If it |
| document_content | code | prev | next | Top |
Given a document ID, returns the document content, either stemmed or unstemmed. |
| document_frequency | code | prev | next | Top |
Given user input, either a single term or a phrase, returns the number of documents |
| document_title | code | prev | next | Top |
Given a document ID, returns the document title. |
| fuzzy_or_merge | code | prev | next | Top |
Implements "fuzzy or" logic by returning all documents pertaining to query |
| new | code | prev | next | Top |
The constructor. It instantiates the Clair::Index.pm object by default and initializes |
| normalize_input | code | prev | next | Top |
Just a wrapper around the real subroutine implemented under StringManip package. |
| process_query | code | prev | next | Top |
|
| result_logic | code | prev | next | Top |
Again, a wrapper subroutine that runs one of the underlying subroutines |
| term_frequency | code | prev | next | Top |
Given user input, returns the number of time a particular term occurs in a document. |
| words_frequency | code | prev | next | Top |
Given user input, either a single term or a phrase, determines the number of times the queried string |
| _document_info | description | prev | next | Top |
sub _document_info
{
my ($self, $input) = @_;
my @tokens = $input =~ m/(!{0,1}\w+|!{0,1}"[\w\s]+")/gs;
$_ =~ s/["']//g for @tokens;
$_ =~ s/^\s*|\s*$//g for @tokens;
my $doc_id = shift @tokens;
unless(exists $self->{document_meta_index}->{$doc_id})
{
return (0, "document with id '$doc_id' does not exist");
}
my $document_info = $self->{document_meta_index}->{$doc_id};
$self->debugmsg("document info for doc_id '$doc_id':", 1);
$self->debugmsg($document_info, 1);
return (1, $document_info);} |
| _load_index_for_word | description | prev | next | Top |
sub _load_index_for_word
{
my ($self, $word) = @_;
if(exists $self->{inverted_index}->{$word})
{
$self->debugmsg("already loaded index for word '$word'", 1);
return { $word => $self->{inverted_index}->{$word} };
}
my $index_chunk = $self->{index_object}->index_read($self->{index_object}->{index_file_format}, $word);
# add onto our index.} |
| _match_word_positions | description | prev | next | Top |
sub _match_word_positions
{my ($self, $words, $docs) = @_; my $word_count = scalar @$words; my $last_index = scalar @$words - 1; my %pos_matrix; # the main algorithm for n-gram positional matching} |
| _return_doc_for_ngram | description | prev | next | Top |
sub _return_doc_for_ngram
{my ($self, $words) = @_; # get the document list from the least frequent word} |
| _return_doc_for_token | description | prev | next | Top |
sub _return_doc_for_token
{
my ($self, $token, $negation) = @_;
$self->debugmsg("searching '$token' in the index:", 1);
my %docs = ();
# my @words = split /\s+/, $token;} |
| document_content | description | prev | next | Top |
sub document_content
{my ($self, $input, $strip_and_stem) = @_; my ($errcode, $di) = $self->_document_info($input); return $di unless($errcode); # $di in case of error is the errormsg;} |
| document_frequency | description | prev | next | Top |
sub document_frequency
{my ($self, $input) = @_; my $tokens = $self->normalize_input($input); $input = $tokens->[0]; my $char = substr $input, 0, 1; my $index = $self->_load_index_for_word($input); my $docs = $self->_return_doc_for_token($input, $index); $self->debugmsg($docs, 2); return [ "document frequency: " . scalar keys %$docs ];} |
| document_title | description | prev | next | Top |
sub document_title
{my ($self, $input) = @_; my ($errcode, $di) = $self->_document_info($input); return $di unless($errcode); # $di in case of error is the errormsg;} |
| fuzzy_or_merge | description | prev | next | Top |
sub fuzzy_or_merge
{
my ($self, $collection) = @_;
my %scored;
for my $tok (keys %$collection)
{
my $docs = $collection->{$tok}->{results};
$scored{$_} += $docs->{$_} for (keys %$docs); # merge scores} |
| new | description | prev | next | Top |
sub new
{
my ($proto, %args) = @_;
my $class = ref $proto || $proto;
my $self = bless {}, $class;
$DEBUG = $args{DEBUG} || $ENV{MYDEBUG};
$self->{index_obj} = "";
$self->{required_indexes} = [ qw/document_index document_meta_index/ ];
# word_index now deprecated} |
| normalize_input | description | prev | next | Top |
sub normalize_input
{
my ($self, $input, $no_stem) = @_;
my $strmanip = new Clair::StringManip(DEBUG => $DEBUG);
my $tokens = $strmanip->normalize_input($input, $no_stem);
if(UNIVERSAL::isa($self->{stop_word_list_stemmed}, "HASH"))
{
my @tmp = grep { ! $self->{stop_word_list_stemmed}->{$_} } @$tokens;
$tokens =\@ tmp;
}
return $tokens;} |
| process_query | description | prev | next | Top |
sub process_query
{my ($self, $input, $return_hash) = @_; my $tokens = $self->normalize_input($input); my %collection; # for every token} |
| result_logic | description | prev | next | Top |
sub result_logic
{my ($self, $method, $collection) = @_; # implements different ways to score and prioritize query result} |
| term_frequency | description | prev | next | Top |
sub term_frequency
{
my ($self, $input) = @_;
my ($doc, $term) = split /\s+/, $input;
my $tokens = $self->normalize_input($term);
$term = shift @$tokens;
return [ "provide doc_id and term (eg: tf 4562 rat)" ] if(! $doc || ! $term);
my $d_index = $self->{document_index};
# print Dumper($d_index);} |
| words_frequency | description | prev | next | Top |
sub words_frequency
{
my ($self, $input) = @_;
my $tokens = $self->normalize_input($input);
$input = $tokens->[0];
my $char = substr $input, 0, 1;
my $index = $self->_load_index_for_word($input);
my $docs = $self->_return_doc_for_token($input, $index);
$self->debugmsg($docs, 2);
my $freq_count = 0;
for my $d (keys %$docs)
{
$freq_count += $docs->{$d};
}
return [ "frequency of token '$input': $freq_count" ];} |
| AUTHOR | Top |
| JB Kim jbremnant@gmail.com 20070407 |