Clair

GenericDoc


SummaryPackage variablesSynopsisDescriptionGeneral documentationMethods

SummaryTop
 package Clair::GenericDoc
A class to standardize and create generic representation of documents.

Package variablesTop
No package variables defined.

Included modulesTop
Clair::Config
Clair::Debug
Clair::StringManip
Data::Dumper
File::Basename
File::Path
File::Type
XML::Simple

SynopsisTop
 This module is designed to take in any text-oriented document and parse
 it based on its MIME type. The parsing is made modular by the use of 
 sub-modules which will be dynamically loaded at runtime. 

 Furthermore, the document is converted into perl hash representation
 and can be dumped to disk in XML format.

 Once you instantiate the object, all you have to do is to invoke one
 subroutine to take parsing to effect:
  
  use Clair::GenericDoc;

  my $gdoc = new Clair::GenericDoc(content => "/path/to/your/file", stem => 1);
  my $hash = $gdoc->extract();

 This module is an alternate interface to Clair::Document. Whereas 
 Clair::Document focuses on extracting information out of documents, this
 interface focuses on parsing and its modularity.

DescriptionTop
 The module will try to do the "smart thing" and determine the file type 
 for you. You can force feed the parsing sub-module:

  my $gdoc = new Clair::GenericDoc(
   content => "/path/to/your/file",
   stem => 1,
   use_parser_module => shakespear.pm,
  );

 Assuming that "shakespear.pm" exists under ./GenericDoc sub-directory.
There are other features of this module which will be covered in the method specifications.

MethodsTop
_determine_moduleDescriptionCode
_validate_extracted_hash_contentDescriptionCode
document_typeDescriptionCode
extractDescriptionCode
from_xmlDescriptionCode
load_parserDescriptionCode
makestrDescriptionCode
morphDescriptionCode
newDescriptionCode
newcastDescriptionCode
save_xmlDescriptionCode
to_xmlDescriptionCode

Methods description


_determine_modulecode    nextTop
 This subroutine takes the mime type string and tries to match up an
 appropriate sub-module in the $self->{module_root} directory. It does
 so by listing the available modules under that dir and then match 
 the substring of the $type parameter passed in against the name of 
 the sub-module. 

 When creating a parser sub-module, one should be conscious of the 
 name he/she picks for that module. In case $self->{use_parser_module}
 exists, it blindedly returns that module to be later loaded.

_validate_extracted_hash_contentcodeprevnextTop
 NOTE: unimplemented yet. Should take care of validating the data structure returned by
       the sub-module.

document_typecodeprevnextTop
 Determines the mime content type from a file or a string, and
 returns the content type token.

extractcodeprevnextTop
 
 This is the wrapper for other crucial routines that determine the content type and 
 runtime loading of the necessary parser sub-module. Once the runtime loading of the
 sub-module is successful, it runs the functions called extract() within it - overloading
of the subroutine name. The parsing logic is entirely upon the extract() subroutine
within the loaded sub-module.
The content returned should be a reference to an array containing hash items. Thus, each document/content provided in GenericDoc can manifest into multiple, subdivided documents. The returned content, then, will be stripped of metacharacters and stemmed, based on the constructor flags. Finally, the required hash keys within the returned data structure is:

    $hash->{parsed_content}

    $hash->{title}

    $hash->{path}

 More on the convention used for sub-modules later.

from_xmlcodeprevnextTop
 Takes an xml string or file and converts it back to a perl hash.

load_parsercodeprevnextTop
 After the content/document type is determined, this subroutine tries
 to use the appropriate sub-module. Obviously, if the sub-module to handle
 the content is not available, this subroutine will exit gracefully after
 printing the reason via $self-errmsg()>.

makestrcodeprevnextTop
 If the supplied "content" is a file, it slurps in the content and converts it
 into a string. 

 TODO: make this portion more modular to operate on urls and other content types
       such as gzip-ed/tar-ed files.

morphcodeprevnextTop
 Morph the existing object into Clair::Document object. This subroutine serves as
 both convenience and compatibility functions. This function works after you've
 instantiated the Clair::Genericdoc object and all the proper constructor parameters
 have been initialized. The extract() function is invoked to parse the content, and
 then subsequently the Clair::Document will be constructed with necessary fields
 pre-populated.

newcodeprevnextTop
 The constructor. Most of the internal flags are overriden.
 The significant ones are:
  
=over 4

cast - a boolean flag that will "cast" the object to Clair::Document object.

content - either path to a file, or the actual string.

module_root - specify the directory for the submodules.

xml_outputdir - specify the directory to dump the hash into xml file.

use_parser_module - hardcode the parser module, which bypasses auto file type detection.

stem - do stemming.

strip - strip meta characters.


newcastcodeprevnextTop
 This function understands how to create Clair::Document from arguments passed in
 via this constructor.

save_xmlcodeprevnextTop
 Simply dumps the xml string into a file. It makes sure that the subdirectory
 specified in $self->{xml_outputdir} is created before the file is written
 to disk.

to_xmlcodeprevnextTop
 Takes a hash and converts it into xaml string.

Methods code


_determine_moduledescriptionprevnextTop
sub _determine_module {
	my ($self, $type) = @_;

	my $modulename;
	my $modpath;
	if($self->{use_parser_module})
	{
		$modpath = "$self->{module_root}/$self->{use_parser_module}.pm";
		$modulename = $self->{use_parser_module};

		$self->errmsg("parser module: $modpath does't exist", 1) unless(-f $modpath);
	}
	else
	{
		opendir D, $self->{module_root};
		my @files = grep { ! /^\./ && -f "$self->{module_root}/$_" } readdir D;
		closedir D;
		chomp @files;
		my @names = map { s/\.pm//; $_; } @files;

		my @type_tok = split '/', $type;
		my $type_name = pop @type_tok;
		# my $type_name = $type;
$type_name =~ s/-/_/g; $type_name =~ s/;//g; my $target_name = ""; for my $n (@names) { $self->debugmsg("matching: $type_name ~ /$n/", 2); if($type_name =~ /$n/i) { $target_name = $n; last; } } $self->errmsg("can't find appropriate module for type: $type", 1) unless($target_name); $modpath = "$self->{module_root}/$target_name.pm"; $modulename = $target_name; } return ($modulename, $modpath);
}

_validate_extracted_hash_contentdescriptionprevnextTop
sub _validate_extracted_hash_content {
	my ($self, $hash) = @_;


	return;
}

document_typedescriptionprevnextTop
sub document_type {
	my ($self, $content) = @_;
	
	$self->debugmsg("determining the type of document", 3);

	my $type = "";
	if($self->{use_system_file_cmd} && -f $content)
	{
		my $str = `file -i $content`;
		chomp $str;
		my @a = split /\s+/, $str;
		$type = $a[1];
	}
	else
	{
		my $ft = File::Type->new();

		# $type = (-f $content) ?
# $ft->checktype_filename($content) :
# $ft->checktype_contents($content);
$type = $ft->mime_type($content); } $self->debugmsg("document type is '$type'",2); return $type;
}

extractdescriptionprevnextTop
sub extract {
	my ($self, $content, $args) = @_;

	$content = $self->{content} unless($content);
	# after load_parser is ran, the $content should be registered in $self->{content}
my $modulename = $self->load_parser($content); # returns arrays of hashs containing sections of docs that are divided up
my $aref_hash = $modulename->extract($self->{content}, $self->{content_source}, $args, $self); # string manipulation routines in this module.
my $strmanip = new Clair::StringManip(DEBUG => $DEBUG); for my $hash (@$aref_hash) { $self->_validate_extracted_hash_content($hash); $hash->{parsed_content} = $strmanip->lowercase($hash->{parsed_content}) if($self->{lowercase}); $hash->{parsed_content} = $strmanip->strip($hash->{parsed_content}) if($self->{strip}); $hash->{parsed_content} = $strmanip->tokenize($hash->{parsed_content}) if($self->{tokenize}); $hash->{parsed_content} = $strmanip->stem($hash->{parsed_content}, $args->{return_array}) if($self->{stem}); # $hash->{parsed_content} = $strmanip->stem($hash->{parsed_content}) if($self->{stem});
} return $aref_hash;
}

from_xmldescriptionprevnextTop
sub from_xml {
		my ($self, $xml) = @_;
	
    require XML::Simple;
    my $xs = new XML::Simple;

    my $ref = $xs->XMLin($xml);
		$self->debugmsg($ref, 3);
		return $ref;
}

load_parserdescriptionprevnextTop
sub load_parser {
	my ($self, $content) = @_;

	# dtermine the content type first in order to load the appropriate module
my $type = $self->document_type($content); # convert to string
$self->{content} = $self->makestr($content || $self->{content}); my ($modulename, $modpath) = $self->_determine_module($type); if(exists $self->{loaded_modules}->{$modulename}) { $self->debugmsg("module '$modpath' already loaded",2); return $self->{loaded_modules}->{$modulename} } # runtime loading hotness!
$self->debugmsg("loading module '$modpath'",2); if(-f $modpath) { eval { require $modpath; }; } else { eval { require $modulename; }; } if($@) { $self->errmsg("couldn't load module $modpath: $@", 1); } $self->{loaded_modules}->{$modulename} = $modulename; return $self->{loaded_modules}->{$modulename};
}

makestrdescriptionprevnextTop
sub makestr {
	# used to register the document
my ($self, $content) = @_; if(-f $content) { $self->{content_source} = $content; # save the filename
$self->debugmsg("converting $content to string", 2); open F, "< $content" or $self->errmsg("can't open: $!", 1); my @lines = <F>; close F; my $content_str = join "", @lines; return $content_str; } elsif(! ref $content && $content =~ /^http:/i) { # do url extraction here..
} return $content;
}

morphdescriptionprevnextTop
sub morph {
	my ($self, $content) = @_;

	$self->{stem} = 1;
	$self->{lowercase} = 1;
	my $aref = $self->extract($content);		
	
	return undef unless scalar @$aref;

	eval { require "$Clair::Config::CLAIRLIB_HOME/lib/Clair/Document.pm"; };
	$self->errmsg("cannot load Clair::Document $@", 1) if($@);

	if(scalar @$aref == 1)
	{
		my $cd = $self->newcast();
		$cd->{stem} = $aref->[0]->{parsed_content};
		return $cd;
	}
	else # we return arrays of Clair::Document objects
{ my @return; for my $h (@$aref) { my $cd = $self->newcast(); $cd->{stem} = $aref->[0]->{parsed_content}; push @return, $cd; } return\@ return; }
}

newdescriptionprevnextTop
sub new {
	my ($proto, %args) = @_;
	my $class = ref $proto || $proto;	

	my $self = bless {}, $class;
	$DEBUG = $args{DEBUG} || $ENV{MYDEBUG};
	
	# $self->{module_root} = (-d "$FindBin::Bin/../lib/Clair/GenericDoc") ? "$FindBin::Bin/../lib/Clair/GenericDoc" : "$FindBin::Bin/lib/Clair/GenericDoc";
$self->{module_root} = "$Clair::Config::CLAIRLIB_HOME/lib/Clair/GenericDoc"; $self->{xml_outputdir} = "$FindBin::Bin/.xmloutput"; $self->{use_parser_module} = ""; $self->{strip} = 1; $self->{lowercase} = 1; $self->{tokenize} = 1; $self->{stem} = 1; # use system's file command: defaults to false
$self->{use_system_file_cmd} = 0; $self->{cast} = 0; # overrides
while ( my($k, $v) = each %args ) { $self->{$k} = $v if(defined $v); } unless(-d $self->{module_root}) { $self->errmsg("the submodule directory for document parsing need to be properly specified",1); } # content is arbitrated between file and string, but the name of file is saved
$self->{content_source} = ""; unless($self->{content}) { $self->errmsg("the 'content' constructor argument is required (either file or string)", 1); } # load up Clair::Document dynamically and just return that obj - this is one way street.
if($self->{cast}) { $self->debugmsg("instantiating Clair::Document object and returning that object", 1); return $self->newcast(); } return $self;
}

newcastdescriptionprevnextTop
sub newcast {
	my ($self) = @_;

	eval { require "$Clair::Config::CLAIRLIB_HOME/lib/Clair/Document.pm"; };
	$self->errmsg("cannot load Clair::Document $@", 1) if($@);

	my $content_class = (-f $self->{content}) ? "file" : "string";
	my $document_type = $self->document_type($self->{content});

	# very loose and potentially buggy logic here - Clair::Document has hardcoded types it supports
my $type = "text"; $type = "html" if($document_type =~ /html/i); $type = "xml" if($document_type =~ /xml/i); my $clair_document_object = Clair::Document->new( $content_class => $self->{content}, type => $type, ); return $clair_document_object;
}

save_xmldescriptionprevnextTop
sub save_xml {
		my ($self, $xml, $filename) = @_;

		$self->errmsg("provide the xml str", 1) unless($xml);
		$self->errmsg("provide the filename", 1) unless($filename);

		my $dir = dirname($filename);
		$dir = $self->{xml_outputdir} unless($dir);
		mkpath($dir, 0, 0777) unless(-d $dir);

		# my $xml_file = "$self->{xml_outputdir}/$filename";
open XF, "> $filename" or $self->errmsg("cannot open file for writing: $!", 1); print XF $xml; close XF;
}

to_xmldescriptionprevnextTop
sub to_xml {
		my ($self, $hash) = @_;
	
    require XML::Simple;
    my $xs = new XML::Simple(XMLDecl => 1);

    # my $ref = $xs->XMLin([<xml file or string>] [, <options>]);
my $xml = $xs->XMLout($hash); $self->debugmsg("XML output:\n\n$xml", 3); return $xml;
}

General documentation


AUTHORTop
 JB Kim
 jbremnant@gmail.com
20070407

TODOSTop

    Make the subroutine makestr more modular

    Right now, it only does file to string conversion. It should auto-magically
do url-download to string conversion as well.

    Make the mime type determination a bit more robust

    Sometimes mime-types don't come back as expected. Search for other ways to
determine the filetypes and the associated submodules more bullet proof.