Clair

StringManip


SummaryPackage variablesSynopsisDescriptionGeneral documentationMethods

SummaryTop
package Clair::StringManip
Majority of the string manipulation routines required by other packages
are implemented here.

Package variablesTop
No package variables defined.

Included modulesTop
Clair::Debug
Data::Dumper
Lingua::Stem

SynopsisTop
Necessary string manipulations such as stripping of meta characters, and
word stemming is implemented here. You can try putting in arbitrary string
and see how it works by:
	use Clair::StringManip;

	my $strmanip = new Clair::StringManip();
	my $return $strmanip->stem("operational operations operator");
	print $return . "\n";

DescriptionTop
Other string-related functions will be implemented here. The subroutines should
be able to handle both SCALAR or ARRAY-ref as input param and return values
should also be arbitrated between SCALAR and ARRAY-ref.

MethodsTop
lowercaseDescriptionCode
newDescriptionCode
normalize_inputDescriptionCode
stemDescriptionCode
stripDescriptionCode
tokenizeDescriptionCode

Methods description


lowercasecode    nextTop
Lowercases the string.

newcodeprevnextTop
The constructor. As with other modules, make sure you specify the DEBUG flag
for standardized debug printing:
	my $obj = new StringManip(DEBUG => $DEBUG);

normalize_inputcodeprevnextTop
Used for user query string processing. It parses and tokenizes the query
string into appropriate segments.

stemcodeprevnextTop
Takes either the string or the arrayref and stems the tokens (words)
using Lingua::Stem module. Return value can be either string or arrayref
based on the last parameter.

stripcodeprevnextTop
Strips meta charcters from the string.

tokenizecodeprevnextTop
Tokenizes the words, effectively getting rid of all the extra empty spaces.
return values can be either string or arrayref depending on the last input param.

Methods code


lowercasedescriptionprevnextTop
sub lowercase {
	my ($self, $string) = @_;
	
	return lc $string;
}

newdescriptionprevnextTop
sub new {
	my ($proto, %args) = @_;
	my $class = ref $proto || $proto;

	my $self = bless {}, $class;
	$DEBUG = $args{DEBUG} || $ENV{MYDEBUG};

	$self->{lowercase} = 1;
	$self->{tokenize} = 1;
	$self->{stem} = 1;

	# overrides
while ( my($k, $v) = each %args ) { $self->{$k} = $v if(defined $v); } return $self;
}

normalize_inputdescriptionprevnextTop
sub normalize_input {
	my ($self, $input, $no_stem) = @_;

	my @tokens = $input =~ m/(!{0,1}\w+|!{0,1}"[\w\s]+")/gs;
	$_ =~ s/["']//g for @tokens;
	$_ =~ s/^\s*|\s*$//g for @tokens;

	# parse the query and then stem
unless($no_stem) { my @prepend = (); my @tokens_no_neg = (); for my $t (@tokens) { my $first = substr $t, 0, 1; my $rest = substr $t, 1; # my $prepend = ($first eq '!') ? '!' : '';
push @prepend, ($first eq '!') ? '!' : ''; push @tokens_no_neg, ($first eq '!') ? $rest : $t; } @tokens_no_neg = @{ $self->stem(\@tokens_no_neg, 1) }; for my $i (0..$#tokens_no_neg) { $tokens[$i] = $prepend[$i] . $tokens_no_neg[$i]; } $self->debugmsg("normalized query input after stemming:", 1); $self->debugmsg(\@tokens, 1); } return\@ tokens;
}

stemdescriptionprevnextTop
sub stem {
	my ($self, $items, $return_array) = @_;
    
	# stem the words
my $stemmer = Lingua::Stem->new(-locale => 'EN-US'); $stemmer->stem_caching({ -level => 2 }); my @words; if(UNIVERSAL::isa($items, "ARRAY")) { @words = @$items; } else { @words = split /\s+/, $items; } my @stemmed = @{$stemmer->stem(@words)}; undef @words; # conserv mem
@stemmed = grep { ! /^\s*$/ } @stemmed; return ($return_array) ?\@ stemmed : join " ", @stemmed;
}

stripdescriptionprevnextTop
sub strip {
	my ($self, $string) = @_;

	# strip all special chars - anything other than alpha-numeric or spaces
$string =~ s/[^\w\s]//gs; return $string;
}

tokenizedescriptionprevnextTop
sub tokenize {
	my ($self, $string, $return_array) = @_;

	# tokenize all the words - split by empty spaces
$string =~ s/\s+/ /gs; if($return_array) { my @tokens = split /\s+/, $string; return\@ tokens; } else { return $string; }
}

General documentation


AUTHORTop
JB Kim
jbremnant@gmail.com
20070407

TODOSTop

    Migrate the input normalizing function from Info::Query into this module.