| Summary | Package variables | Synopsis | Description | General documentation | Methods |
| Summary | Top |
package Clair::GenericDoc |
| Package variables | Top |
| No package variables defined. |
| Included modules | Top |
| Clair::Config |
| Clair::Debug |
| Clair::StringManip |
| Data::Dumper |
| File::Basename |
| File::Path |
| File::Type |
| XML::Simple |
| Synopsis | Top |
This module is designed to take in any text-oriented document and parse it based on its MIME type. The parsing is made modular by the use of sub-modules which will be dynamically loaded at runtime. Furthermore, the document is converted into perl hash representation and can be dumped to disk in XML format. Once you instantiate the object, all you have to do is to invoke one subroutine to take parsing to effect: use Clair::GenericDoc; my $gdoc = new Clair::GenericDoc(content => "/path/to/your/file", stem => 1); my $hash = $gdoc->extract(); This module is an alternate interface to Clair::Document. Whereas Clair::Document focuses on extracting information out of documents, this interface focuses on parsing and its modularity. |
| Description | Top |
The module will try to do the "smart thing" and determine the file type for you. You can force feed the parsing sub-module: my $gdoc = new Clair::GenericDoc( content => "/path/to/your/file", stem => 1, use_parser_module => shakespear.pm, ); Assuming that "shakespear.pm" exists under ./GenericDoc sub-directory. |
| Methods | Top |
| _determine_module | Description | Code |
| _validate_extracted_hash_content | Description | Code |
| document_type | Description | Code |
| extract | Description | Code |
| from_xml | Description | Code |
| load_parser | Description | Code |
| makestr | Description | Code |
| morph | Description | Code |
| new | Description | Code |
| newcast | Description | Code |
| save_xml | Description | Code |
| to_xml | Description | Code |
| _determine_module | code | next | Top |
This subroutine takes the mime type string and tries to match up an
appropriate sub-module in the $self->{module_root} directory. It does
so by listing the available modules under that dir and then match
the substring of the $type parameter passed in against the name of
the sub-module.
When creating a parser sub-module, one should be conscious of the
name he/she picks for that module. In case $self->{use_parser_module}
exists, it blindedly returns that module to be later loaded. |
| _validate_extracted_hash_content | code | prev | next | Top |
NOTE: unimplemented yet. Should take care of validating the data structure returned by
the sub-module. |
| document_type | code | prev | next | Top |
Determines the mime content type from a file or a string, and returns the content type token. |
| extract | code | prev | next | Top |
This is the wrapper for other crucial routines that determine the content type and runtime loading of the necessary parser sub-module. Once the runtime loading of the sub-module is successful, it runs the functions called extract() within it - overloading |
| from_xml | code | prev | next | Top |
Takes an xml string or file and converts it back to a perl hash. |
| load_parser | code | prev | next | Top |
After the content/document type is determined, this subroutine tries to use the appropriate sub-module. Obviously, if the sub-module to handle the content is not available, this subroutine will exit gracefully after printing the reason via $self-errmsg()>. |
| makestr | code | prev | next | Top |
If the supplied "content" is a file, it slurps in the content and converts it
into a string.
TODO: make this portion more modular to operate on urls and other content types
such as gzip-ed/tar-ed files. |
| morph | code | prev | next | Top |
Morph the existing object into Clair::Document object. This subroutine serves as both convenience and compatibility functions. This function works after you've instantiated the Clair::Genericdoc object and all the proper constructor parameters have been initialized. The extract() function is invoked to parse the content, and then subsequently the Clair::Document will be constructed with necessary fields pre-populated. |
| new | code | prev | next | Top |
The constructor. Most of the internal flags are overriden. The significant ones are: =over 4 |
| newcast | code | prev | next | Top |
This function understands how to create Clair::Document from arguments passed in via this constructor. |
| save_xml | code | prev | next | Top |
Simply dumps the xml string into a file. It makes sure that the subdirectory
specified in $self->{xml_outputdir} is created before the file is written
to disk. |
| to_xml | code | prev | next | Top |
Takes a hash and converts it into xaml string. |
| _determine_module | description | prev | next | Top |
sub _determine_module
{
my ($self, $type) = @_;
my $modulename;
my $modpath;
if($self->{use_parser_module})
{
$modpath = "$self->{module_root}/$self->{use_parser_module}.pm";
$modulename = $self->{use_parser_module};
$self->errmsg("parser module: $modpath does't exist", 1) unless(-f $modpath);
}
else
{
opendir D, $self->{module_root};
my @files = grep { ! /^\./ && -f "$self->{module_root}/$_" } readdir D;
closedir D;
chomp @files;
my @names = map { s/\.pm//; $_; } @files;
my @type_tok = split '/', $type;
my $type_name = pop @type_tok;
# my $type_name = $type;} |
| _validate_extracted_hash_content | description | prev | next | Top |
sub _validate_extracted_hash_content
{my ($self, $hash) = @_; return;} |
| document_type | description | prev | next | Top |
sub document_type
{
my ($self, $content) = @_;
$self->debugmsg("determining the type of document", 3);
my $type = "";
if($self->{use_system_file_cmd} && -f $content)
{
my $str = `file -i $content`;
chomp $str;
my @a = split /\s+/, $str;
$type = $a[1];
}
else
{
my $ft = File::Type->new();
# $type = (-f $content) ?} |
| extract | description | prev | next | Top |
sub extract
{
my ($self, $content, $args) = @_;
$content = $self->{content} unless($content);
# after load_parser is ran, the $content should be registered in $self->{content}} |
| from_xml | description | prev | next | Top |
sub from_xml
{
my ($self, $xml) = @_;
require XML::Simple;
my $xs = new XML::Simple;
my $ref = $xs->XMLin($xml);
$self->debugmsg($ref, 3);
return $ref;} |
| load_parser | description | prev | next | Top |
sub load_parser
{my ($self, $content) = @_; # dtermine the content type first in order to load the appropriate module} |
| makestr | description | prev | next | Top |
sub makestr
{# used to register the document} |
| morph | description | prev | next | Top |
sub morph
{
my ($self, $content) = @_;
$self->{stem} = 1;
$self->{lowercase} = 1;
my $aref = $self->extract($content);
return undef unless scalar @$aref;
eval { require "$Clair::Config::CLAIRLIB_HOME/lib/Clair/Document.pm"; };
$self->errmsg("cannot load Clair::Document $@", 1) if($@);
if(scalar @$aref == 1)
{
my $cd = $self->newcast();
$cd->{stem} = $aref->[0]->{parsed_content};
return $cd;
}
else # we return arrays of Clair::Document objects} |
| new | description | prev | next | Top |
sub new
{
my ($proto, %args) = @_;
my $class = ref $proto || $proto;
my $self = bless {}, $class;
$DEBUG = $args{DEBUG} || $ENV{MYDEBUG};
# $self->{module_root} = (-d "$FindBin::Bin/../lib/Clair/GenericDoc") ? "$FindBin::Bin/../lib/Clair/GenericDoc" : "$FindBin::Bin/lib/Clair/GenericDoc";} |
| newcast | description | prev | next | Top |
sub newcast
{
my ($self) = @_;
eval { require "$Clair::Config::CLAIRLIB_HOME/lib/Clair/Document.pm"; };
$self->errmsg("cannot load Clair::Document $@", 1) if($@);
my $content_class = (-f $self->{content}) ? "file" : "string";
my $document_type = $self->document_type($self->{content});
# very loose and potentially buggy logic here - Clair::Document has hardcoded types it supports} |
| save_xml | description | prev | next | Top |
sub save_xml
{
my ($self, $xml, $filename) = @_;
$self->errmsg("provide the xml str", 1) unless($xml);
$self->errmsg("provide the filename", 1) unless($filename);
my $dir = dirname($filename);
$dir = $self->{xml_outputdir} unless($dir);
mkpath($dir, 0, 0777) unless(-d $dir);
# my $xml_file = "$self->{xml_outputdir}/$filename";} |
| to_xml | description | prev | next | Top |
sub to_xml
{
my ($self, $hash) = @_;
require XML::Simple;
my $xs = new XML::Simple(XMLDecl => 1);
# my $ref = $xs->XMLin([<xml file or string>] [, <options>]);} |
| AUTHOR | Top |
JB Kim jbremnant@gmail.com |
| TODOS | Top |
Make the subroutine makestr more modular Right now, it only does file to string conversion. It should auto-magicallydo url-download to string conversion as well. Make the mime type determination a bit more robust Sometimes mime-types don't come back as expected. Search for other ways todetermine the filetypes and the associated submodules more bullet proof. |