ulif.openoffice.processors – Document Processors

Processors for processing documents.

The processors defined here do the heavy lifting when converting documents. All processors accept some input and create some output, which is normally some file to be converted and handled in different ways.

While all processors are equal (concerning the basic workflow), one processor is more equal than others: the MetaProcessor. It collects and coordinates the other processors, creates a pipeline and other things more. Therefore, to start processing a document, it is sufficient for callers to get an instance of MetaProcessor.

All processors got some prefix which is needed to inject any parameters when processing happens. So, oocp.out_fmt means setting or reading the out_fmt parameter of the oocp processor or, more accurate: the processor with prefix oocp (which happens to be the OOConvProcessor, see below).

class ulif.openoffice.processor.BaseProcessor(options=None)

A base for self-built document processors.

args = []

The argparser args acceptable by this processor. This list should contain ulif.openoffice.options.Argument instances.

get_options_as_string()

Get a string representation of the options used here.

This is important to get valid hashes for different sets of options. From this representation hashes might be used by the cache manager to find already processed docs.

To make caching effective, each set of option-settings that leads to the same result should have the same string representation.

prefix = 'base'

The name under which this processor is known. A simple string.

process(input, metadata)

Process the input and return output.

metadata contains data maybe valuable for other processors. Derived classes are encouraged to populate this dictionary while it cannot be guaranteed that other processors make use of this data.

output is expected to be a tuple

(<OUTPUT>, <METADATA>)

where <OUTPUT> would normally be the path to a file and <METADATA> the (maybe updated) metadata passed in.

The default implementation raises NotImplemented.

class ulif.openoffice.processor.CSSCleaner(options=None)

A processor for cleaning up CSS parts of HTML code.

Normal converters leave CSS inside an HTML document. This processor first aggregates these style parts and then puts it into an external CSS file leaving only a link to that file.

This processor requires HTML/XHTML input.

ulif.openoffice.processor.DEFAULT_PROCORDER = 'unzip,oocp,tidy,html_cleaner,css_cleaner,zip'

The default order, processors are run.

class ulif.openoffice.processor.Error(options=None)

A processor that returns an error message.

This is mainly for testing.

class ulif.openoffice.processor.HTMLCleaner(options=None)

A processor for cleaning up HTML produced by OO.org.

Fixes minor issues with HTML code produced by OO.org.

This processor expects XHTML input input.

class ulif.openoffice.processor.MetaProcessor(options={})

The meta processor handles general workflow.

When getting certain options, it constructs a pipeline of document processors.

The MetaProcessor is a kind of processor dispatcher that finds, setups and calls all requested processors in the requested order.

args = [<ulif.openoffice.options.Argument object at 0x7fd4cf4b0450>]

We support a -meta-procord option which stands for processororder. The current default order is: 'unzip,oocp,zip' which means: maybe unzip the input, then convert it into HTML and afterwards zip the results.

prefix = 'meta'

the meta processor is named ‘meta’

process(input=None, metadata={'error': False})

Run all processors defined in options.

If all processors run successful, the output of the last along with (maybe modified) metadata is returned.

Each processor is fed with the metadata dict and an input (normally a filepath). Feeding a processor means to call its process method.

If a processor sets the error entry of metadata to True this indicates some problem and the whole process is aborted returning None as output and the metadata, maybe containing some smart hints about the reasons.

If all processors work correctly, the output of the last processor is returned along with the last metadata.

The set and order of processors called depends on the procord option passed in. If this option is set to some value like oocp,oocp then the oocp processor (which is the OOConvProcessor, registered under oocp in setup.py) is called two times.

Note

after each processing, the (then old) input is removed.

class ulif.openoffice.processor.OOConvProcessor(options=None)

A processor that converts office docs into different formats.

XXX: we could support far more options. See

http://wiki.services.openoffice.org/wiki/API/Tutorials/
PDF_export#How_to_use_it_from_OOo_Basic

only for a list of PDF export options.

formats = {'pdf': 'pdf', 'xhtml': 'xhtml', 'txt': 'text', 'html': 'html'}

mapping: extension <-> format (as accepted by unoconv)

ulif.openoffice.processor.OUTPUT_FORMATS = {'pdf': 'pdf', 'xhtml': 'xhtml', 'txt': 'text', 'html': 'html'}

Output formats supported. Mapping: extension <-> format (as accepted by unoconv) For oocp-out-fmt only extensions (left column) are allowed.

class ulif.openoffice.processor.Tidy(options=None)

A processor for cleaning up HTML code produced by OO.org output.

This processor calls tidy in a subshell. That means the tidy command must be installed in system to make this processor work.

class ulif.openoffice.processor.UnzipProcessor(options=None)

A processor that unzips delivered files if applicable.

The .zip file might contain only exactly one file.

class ulif.openoffice.processor.ZipProcessor(options=None)

A processor that zips the directory delivered.

path must be str type.