| Summary | Included libraries | Package variables | Synopsis | Description | General documentation | Methods |
| Summary | Top |
| Ngram - extract and prune N-grams from documents |
| Package variables | Top |
| No package variables defined. |
| Included modules | Top |
| Carp |
| Clair::Utils::TFIDFUtils qw ( split_words lc_words ) |
| Storable qw ( nstore retrieve ) |
| Inherit | Top |
| Exporter |
| Synopsis | Top |
use Clair::Cluster |
| Description | Top |
| The Ngram package provides functionality for the extraction of N-grams from text and HTML documents. The resulting N-gram dictionary can optionally be pruned of low-frequency N-grams before being written to a human-readable text file and/or serialized to a network-ordered Storable file. |
| Methods | Top |
| delete_belowthresholds | No description | Code |
| dump_ngramdict | No description | Code |
| enforce_count_thresholds | No description | Code |
| extract_ngrams | No description | Code |
| load_ngramdict | No description | Code |
| sort_ngrams | No description | Code |
| traverse_ngramdict | No description | Code |
| write_ngram_counts | No description | Code |
| write_ngrams_fromdict | No description | Code |
| delete_belowthresholds | description | prev | next | Top |
sub delete_belowthresholds
{ my $r_ngramdict = shift;
my $r_ngram = shift;
my $r_args = shift;
my ($mincount, $topcount, $r_counts) = @$r_args;
foreach my $term (keys %$r_ngramdict) {
if (defined $mincount and $mincount > 1 and $r_ngramdict->{$term} < $mincount) {
delete $r_ngramdict->{$term};
} elsif (defined $topcount) {
push @$r_counts, $r_ngramdict->{$term}
}
}} |
| dump_ngramdict | description | prev | next | Top |
sub dump_ngramdict
{
my %params = @_;
my $N = $params{N};
my $r_ngramdict = $params{ngramdict};
my $file = $params{outfile};
nstore([$N, $r_ngramdict], $file) or croak "\nUnable to serialize n-gram dictionary to file $file";} |
| enforce_count_thresholds | description | prev | next | Top |
sub enforce_count_thresholds
{ my %params = @_;
my $N = $params{N};
my $r_ngramdict = $params{ngramdict};
my $mincount = $params{mincount};
my $topcount = $params{topcount};
my @counts;
# Prune away N-grams below minmum count threshold} |
| extract_ngrams | description | prev | next | Top |
sub extract_ngrams
{
my %params = @_;
my $r_cluster = $params{cluster};
my $r_ngramdict = $params{ngramdict};
my $N = $params{N};
my $format = $params{format};
my $stem = $params{stem};
my $segment = $params{segment};
my $verbose = $params{verbose};
print "Stripping html markup ...\n" if ($verbose);
$r_cluster->strip_all_documents() if ($verbose and ($format eq "html"));
print "Stemming ...\n" if ($verbose and $stem);
$r_cluster->stem_all_documents() if $stem;
print $r_cluster->count_elements . " documents in cluster\n";
my $cnt = 0;
foreach my $r_doc (values %{$r_cluster->{documents}}) {
#print "Extracting $N-grams from ", $r_doc->get_id(), " ...\n" if ($verbose);} |
| load_ngramdict | description | prev | next | Top |
sub load_ngramdict
{
my %params = @_;
my $file = $params{infile};
my ($N, $r_ngramdict) = @{retrieve($file)} or croak "\nUnable to restore serialized n-gram dictionary from file $file";
return ($N, $r_ngramdict);} |
| sort_ngrams | description | prev | next | Top |
sub sort_ngrams
{
my $r_ngramdict = shift;
my $r_ngram = shift;
my $r_args = shift;
my ($r_ngrams) = @$r_args;
foreach my $term (keys %$r_ngramdict) {
push @$r_ngrams, [join(" ", (@$r_ngram, $term)), $r_ngramdict->{$term}];
}} |
| traverse_ngramdict | description | prev | next | Top |
sub traverse_ngramdict
{
my $N = shift;
my $r_ngramdict = shift;
my $r_ngram = shift;
my $r_hook = shift;
my $r_args = shift;
# At inner levels ...} |
| write_ngram_counts | description | prev | next | Top |
sub write_ngram_counts
{
my %params = @_;
my $r_ngramdict = $params{ngramdict};
my $N = $params{N};
my $file = $params{outfile};
my $sort = $params{sort};
open(local *fh, '>', $file) or croak "\nUnable to open $file for writing.";
my @ngram;
if ($sort) {
my @ngrams;
# Get list of N-gram counts} |
| write_ngrams_fromdict | description | prev | next | Top |
sub write_ngrams_fromdict
{ my $r_ngramdict = shift;
my $r_ngram = shift;
my $r_args = shift;
my ($r_fh) = @$r_args;
my $buffer = "";
foreach my $term (keys %$r_ngramdict) {
$buffer .= (join(" ", (@$r_ngram, $term, $r_ngramdict->{$term})) . $/);
}
print $r_fh $buffer;} |
| VERSION | Top |
| This documentation refers to Clair::LM::Ngram version 1.0. |
| FUNCTIONS | Top |
extract_ngrams(cluster => I<CLUSTERREF>, N => I<INTEGER>, ngramdict => HASHREF, Extracts N-grams from the cluster of documents referenced by CLUSTERREF, storing themin an N-level-deep hash referenced by HASHREF. The documents' format can be HTML ('html'), in which case the documents are stripped of HTML markup, or text (the default). Setting stem to 1 turns stemming on; setting segment to 1 turns sentence segmentation on. With sentence segmentation on, the text of document is split into sentences prior to each individual word's being lowercased and (optionally) stemmed. If sentence segmentation is specified, then terms denoting sentence boundaries occur in N-grams straddling sentence boundaries and are denoted by <s>. The first N-gram in a document then contains N - 1 sentence boundary terms, followed by the first term occurring in the document itself. The last N-gram in a document contains the last term occurring in the document itself, followed by N - 1 sentence boundary terms. Such padded N-grams are counted with sentence segmentation in order that, from a generative standpoint, the probabilies of occurrence from all possible documents generated from the extracted N-gram language model sum to 1. write_ngram_counts(N => I<INTEGER>, ngramdict => I<NGRAMDICTREF>, outfile => I<SCALAR>, sort => BOOL) Writes the N-gram dictionary referenced by NGRAMDICTREF to file SCALAR. If BOOL is true, thenthe N-grams are written in decreasing order by number of occurrences. (I<SCALAR>, I<HASHREF>) = load_ngramdict(infile => SCALAR) Restores the N-gram dictionary in (network-ordered) Storable file SCALAR. Sets SCALAR equal to Nand stores a reference to the restored dictionary in HASHREF. dump_ngramdict(N => I<INTEGER>, ngramdict => I<NGRAMDICTREF>, outfile => SCALAR) Serializes the N-gram dictionary referenced by NGRAMDICTREF, together with the value of N,to (network-ordered) Storable file SCALAR. enforce_count_thresholds(N => I<INTEGER>, ngramdict => NGRAMDICTREF, Prunes the N-gram dictionary referenced by NGRAMDICTREF of all N-grams not among the top INTEGER_2in occurrences or having fewer than INTEGER_1 occurrences. The order of application of these two constraints is immaterial. |
| DEPENDENCIES | Top |
| Clair::Cluster, Carp, Exporter, Storable |
| BUGS AND LIMITATIONS | Top |
| There are no known bugs in this module. Please report problems to Dragomir Radev << <radev at umich.edu> >>. Patches are welcome. |
| AUTHOR | Top |
| Jonathan DePeri << <jmd2118 at columbia.edu> >> |
| LICENSE AND COPYRIGHT | Top |
| This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Copyright 2007 the Clair group, all rights reserved. |