| Summary | Package variables | Synopsis | Description | General documentation | Methods |
| Summary | Top |
package Clair::Features |
| Package variables | Top |
| No package variables defined. |
| Included modules | Top |
| Clair::Debug |
| Clair::GenericDoc |
| Data::Dumper |
| File::Path |
| Synopsis | Top |
We want to receive a collection of Clair::GenericDoc objects and convert the parsed and stemmed words as feature vectors. In addition, it should carry out feature selection using Chi-squared algorithm. use Clair::Features; my $fea = new Clair::Features(DEBUG => $DEBUG); my $gdoc = new Clair::GenericDoc( DEBUG => $DEBUG, content => "/some/doc"); $fea->register($gdoc); ... insert more ... $fea->select(); |
| Description | Top |
This module should also provide the ability to output a feature_file containing the chi-square scores of all the words. One caveat about generating feature list with their associated weights is that unique id's need to be constructed for each feature. Afterwards, these id's need to be retained across both the training data and the test data. In other words, the test data should refer to the same feature as the training set when processing the generated feature id's. |
| Methods | Top |
| _chi_squared_binary | Description | Code |
| _input_svm_light_format | Description | Code |
| _output_svm_light_format | Description | Code |
| chi_squared | Description | Code |
| input | Description | Code |
| new | Description | Code |
| output | Description | Code |
| register | Description | Code |
| save_features | Description | Code |
| select | Description | Code |
| _chi_squared_binary | code | next | Top |
An implementation of Chi-squared computation assuming the binary classification. This subroutine is called by chi_squared public subroutine. Another private routine of this type can be implemented for multivariate chi-squared feature weight calculation. |
| _input_svm_light_format | code | prev | next | Top |
The exact opposite of output method. It reads in the svm_light data and constructs a perl data structure. |
| _output_svm_light_format | code | prev | next | Top |
Prints the lines in this format: |
| chi_squared | code | prev | next | Top |
Implements Chi-squared feature selection algorithm. Here are the definitions
for the values in the contingency table:
k00 = number of docs in class 0 not containing term t
k01 = number of docs in class 0 containing term t
k10 = number of docs in class 1 not containing term t
k11 = number of docs in class 1 containing term t
The contingency table per feature (word).
I_t
| 0 1
------------
C 0 | k00 k01
1 | k10 k11
The following routine loops through the nested hashes in $self->{features_global}
and constructs the variables mentioned above. |
| input | code | prev | next | Top |
Reads in the document feature vector file generated |
| new | code | prev | next | Top |
The constructor. Initializes several container hashes for later use.
In case of $self->{mode} eq "test", it will attempt to read in the
features file and create a mapping between the feature id and the
actual word associated with it. |
| output | code | prev | next | Top |
This subroutine outputs the necessary feature vectors into specified text files. Default method is to use the SVM light format. In case of test dataset, it will use the prior feature name => id mapping from the train data to make the feature id's consistent. |
| register | code | prev | next | Top |
Takes the instantiated GenericDoc objects and stores the extracted features into
internal data structures. It ensures that you are passing in the object that is
blessed with the GenericDoc name.
If the $self->{document_limit} variable is set, the subroutine will simply return
without adding the content to the internal hashes when the document registration
limit is reached. |
| save_features | code | prev | next | Top |
For training mode, you need to save the features into a file so that the mapping of features to numeric ID's can be retained for the test data. This subroutine drops a file for later use. Each line number is the id for the feature. |
| select | code | prev | next | Top |
Takes the internal data structures and then extracts desired features using default (Chi-squared) feature selection algorithm. |
| _chi_squared_binary | description | prev | next | Top |
sub _chi_squared_binary
{
my ($self, $k_ref, $n, $feature) = @_;
unless(UNIVERSAL::isa($k_ref, "HASH"))
{
$self->errmsg("the first param has to be a hash ref containing values of the contingency table", 1);
}
my %k = %{$k_ref};
my $numerator = $n * ( $k{1}{1} * $k{0}{0} - $k{1}{0} * $k{0}{1} ) ** 2;
my $denominator = ($k{1}{1} + $k{1}{0}) * ($k{0}{1} + $k{0}{0}) *
($k{1}{1} + $k{0}{1}) * ($k{1}{0} + $k{0}{0});
# this means ($k{1}{0} + $k{0}{0}) == 0. In other words, all documents of both classes have this word. } |
| _input_svm_light_format | description | prev | next | Top |
sub _input_svm_light_format
{
my ($self, $file) = @_;
open INF, "< $file" or $self->errmsg("cannot open '$file: $!", 1);
my @vectors = <INF>;
close INF;
chomp @vectors;
my @data = ();
for my $v (@vectors)
{
if($v =~ /^([^#]+)\s*(#{0,1}.*)$/)
{
my ($dataline, $comment) = ($1, $2);
my ($class_id, @feature_value) = split /\s+/, $dataline;
my %hash;
for my $fv (@feature_value)
{
my ($feature_id, $score) = split ":", $fv;
$hash{$feature_id} = $score;
}
push @data, { class => $class_id, comment => $comment, features =>\% hash };
}
}
$self->debugmsg(\@data, 1);
return\@ data;} |
| _output_svm_light_format | description | prev | next | Top |
sub _output_svm_light_format
{my ($self, $file, $features_map) = @_; # print "entering output with $file\n";} |
| chi_squared | description | prev | next | Top |
sub chi_squared
{
my ($self, $limit) = @_;
unless($self->{features_global})
{
$self->errmsg("necessary\$ self->{features_global} struct does not exist. Please 'register()' Clair::GenericDoc objects", 1);
}
my @classes = sort keys %{ $self->{features_global} };
my %counts = ();
my %k_val = (); # will contain values of the contingency table for all features} |
| input | description | prev | next | Top |
sub input
{
my ($self, $file, $algo) = @_;
$algo = "_input_svm_light_format" unless($algo);
unless($self->can($algo))
{
$self->errmsg("necessary func '$algo()' does not exist in this module", 1);
}
$self->$algo($file);} |
| new | description | prev | next | Top |
sub new
{
my ($proto, %args) = @_;
my $class = ref $proto || $proto;
my $self = bless {}, $class;
$DEBUG = $args{DEBUG} || $ENV{MYDEBUG};
# necessary data struct} |
| output | description | prev | next | Top |
sub output
{
my ($self, $file, $features_map, $algo) = @_;
$algo = "_output_svm_light_format" unless($algo);
unless($self->can($algo))
{
$self->errmsg("necessary func '$algo()' does not exist in this module", 1);
}
$self->$algo($file, $features_map);} |
| register | description | prev | next | Top |
sub register
{
my ($self, $doc_obj, $n) = @_;
my $refname = ref $doc_obj;
unless($refname eq "Clair::GenericDoc")
{
$self->errmsg("passed in object is not Clair::GenericDoc: $refname", 1);
}
$self->debugmsg("extracting content for Clair::GenericDoc object", 2);
my $h = $doc_obj->extract()->[0];
my @words = split /\s+/, $h->{parsed_content};
my %features = ();
map { $features{$_}++ } @words;
my $group = $h->{GROUP};
my $source = $h->{content_source};
# skip if we are over the limit - inefficient since we have to parse the data anyway} |
| save_features | description | prev | next | Top |
sub save_features
{
my ($self, $features, $features_file) = @_;
unless($features)
{
$self->errmsg("requires arrayref of features", 1);
}
$features_file = $self->{features_file} unless($features_file);
open F, "> $features_file" or $self->errmsg("cannot open '$features_file' for writing: $!", 1);
print F "$_\n" for (@$features);
close F;
my $i = 1;
$self->{features_map} = { map { $_ => $i++ } @$features };
return $self->{features_map};} |
| select | description | prev | next | Top |
sub select
{
my ($self, $limit, $algo) = @_;
$algo = "chi_squared" unless($algo);
unless($self->can($algo))
{
$self->errmsg("necessary func '$algo()' does not exist in this module", 1);
}
$self->{feature_scores} = $self->$algo();
my @ordered_features = reverse
sort { $self->{feature_scores}->{$a} <=> $self->{feature_scores}->{$b} }
keys %{$self->{feature_scores}};
if($limit)
{
splice @ordered_features, $limit;
}
# in case of train mode, you will need to save the features.} |
| AUTHOR | Top |
JB Kim jbremnant@gmail.com |