User Documentation¶
The user documentation is split into the following parts:
Prerequisites¶
Of course LibreOffice (or OpenOffice) must be installed on the system. Also unoconv is mandatory and for HTML mangling we also use the tidy tool.
On Ubuntu this can be done with:
$ sudo apt-get install libreoffice unoconv tidy
The scripts in here were tested with Ubuntu and work.
It is mandatory, that the system user running oooctl
is a regular
user with at least a home directory. LibreOffice relies on that
directory to store information even in headless mode.
Recent LibreOffice versions require no X-server for running.
Normally, it should be sufficient to apt-get install unoconv. This should also install all the office packages needed.
Apart from this, you need Python 2.6 or 2.7 installed. The formerly required python-uno package is not needed anymore (but unoconv might require it). Also self-compiled Python variants should work.
Building¶
User Install¶
You can use pip to install ulif.openoffice:
$ pip install ulif.openoffice
will install the latest released version from PyPI.
Developer Install¶
If not done already, install the neccessary external packages:
$ sudo apt-get install libreoffice unoconv tidy
It is recommended to setup sources in a virtual environment:
$ virtualenv py27 # Python 2.6, 2.7 are supported
$ source py27/bin/activate
(py27) $
Get the sources:
(py27) $ git clone https://github.com/ulif/ulif.openoffice.git
(py27) $ cd ulif.openoffice
Install packages for testing:
(py27) $ python setup.py dev
It is recommended to start the oooctl
daemon before running
tests:
(py27) $ oooctl start
This will make LibreOffice listen in background and reduce runtime of tests significantly.
Running tests:
(py27) $ py.test
We also support tox to run tests for all supported Python versions:
(py27) $ pip install tox
(py27) $ tox
Of course you must have the respective Python versions installed (currently: Python 2.6, 2.7).
Running coverage detector:
(py27) $ py.test --cov=ulif.openoffice # for cmdline results
(py27) $ py.test --cov=ulif.openoffice --cov-report=html
The latter will generate HTML coverage reports in a subdirectory.
Install packages for Sphinx-base documentation:
(py27) $ python setup.py docs
(py27) $ cd doc
(py27) $ make html
Will generate the documentation in a subdirectory.
Using the scripts¶
There are two commandline-oriented scripts that come with
ulif.openoffice
:
- an oooctl-server that starts LibreOffice server in background.
- a converter script called
oooclient
. It is merely a little test programme that was used during development, but you might have some use for it. Especially the –help option might be interesting, to get an overview over the available document processors and their options.
Assuming you acticated the virtual environment where the package was installed (or installed the package otherwise accessible) you can start the oooctl-server with:
(py27) $ oooctl start
Do:
(py27) $ oooctl --help
to see all options.
You can stop the daemon with:
(py27) $ oooctl stop
The converter script can be called like this:
(py27) $ oooclient sourcefile.doc
to create a sourcefile.html.zip conversion. The ZIP file will (beside the generated HTML document) contain all images and extracted CSS styles.
Do:
(py27) $ oooclient -meta-procord=oocp, -oocp-out-fmt=pdf sourcefile.doc
to create a PDF of sourefile.doc.
Converting Docs via XMLRPC¶
One of the included WSGI apps provides access to unoconv and filters in this package via XMLRPC. More specificially we provide a WSGI app that can be served by HTTP servers and will then talk to XMLRPC clients, optionally caching result docs.
Setting Up the XMLRPC App With Paste¶
To run the included XMLRPC doc converter WSGI app we can use Paste. The required paster script can be installed locally with:
(py27) pip install PasteScript
Then we need a PasteDeploy compatible config file like the following
xmlrpc.ini
:
# xmlrpc.ini
# A sample config to run WSGI XMLRPC app with paster
[app:main]
use = egg:ulif.openoffice#xmlrpcapp
cache_dir = /tmp/mycache
[server:main]
use = egg:Paste#http
host = localhost
port = 8008
In the [app:main]
section we tell to serve the ulif.openoffice
WSGI app xmlrpcapp. We additionally set a directory where we
allow cached documents to be stored. This entry (cache_dir
) is
optional. Just leave it out if you do not want caching of result docs.
The [server:main]
section simply tells to start an HTTP server on
localhost port 8008. host
can be set to any local hostname or an
IP number. Set it to 0.0.0.0
to be accessible on all IPs assigned
to the current machine (but read the security infos below, first!).
You now can start an XMLRPC conversion server:
(py27) $ paster serve xmlrpc.ini
and start converting real office documents via XMLRPC on the configured host and port (here: localhost:8008).
While we use the Paste HTTP server here for demonstration, you are not bound to this choice. Of course you can use any HTTP server capable of serving WSGI apps you like. This includes at least Apache and nginx (with appropriate modules loaded).
Securing the XMLRPC app (optional)¶
For the ulif.openoffice XMLRPC app applies the same as for the RESTful document converter in this regard. See Securing the Document Converter (optional) for details.
Converting Documents via XMLRPC¶
Once the server is running, we can start converting docs via XMLRPC.
With standard Python xmlrpclib
this is very easy:
>>> server = ServerProxy('http://localhost:8008')
The ServerProxy can be imported from xmlrpclib
(Python
2.x) or from xmlrpc.client (Python 3.x).
The ulif.openoffice XML-RPC server provides the following methods:
>>> server.system.listMethods()
['convert_locally', 'get_cached', 'system.listMethods',
'system.methodHelp', 'system.methodSignature']
If the server is running on the same machine as the client, i.e. both components can access the same filesystem, then convert_locally() is the fastest method to convert documents via XMLRPC.
convert_locally takes as arguments a path to a source document and a dictionary of options:
>>> with open('sample.txt', 'w') as fd:
... num = fd.write('Some Content')
>>> result = server.convert_locally('sample.txt', {})
>>> pprint(result)
['/.../sample.html.zip',
'78138d2003f1a87043d65c692fb3a64b_1_1',
{'error': False, 'oocp_status': 0}]
The result consists of a result path, a cache key and a dict with
metadata: (<PATH>, <CACHE_KEY>, <METADATA>)
.
The result path will be in a newly created directory.
Note
It is up to you to remove the result directory after usage.
Here the result is a ZIP file that includes any CSS stylesheets, images, etc. generated. You can retrieve an non-zipped version by setting options to something like:
{'oocp-out-fmt': 'html', 'meta-procord': 'oocp'}
which tells the converter to run only the core converter (no post processing, etc.) and to generate HTML output.
The cache key is None
if the XMLRPC server were configured without
a cache. This can be modified in xmlrpc.ini
.
The metadata dict contains especially infos about errors happened
during processing. You can normally ignore it, as failed conversions
will be signalled by an xmlrpclib.Fault
result.
To produce different results, you can pass in different options dict. In the example above we simply used the default (an empty dict), but we can also produce a PDF file:
>>> options = {'oocp-out-fmt': 'pdf', 'meta-procord': 'oocp'}
>>> result = server.convert_locally('sample.txt', options)
>>> pprint(result)
['/.../sample.pdf',
'78138d2003f1a87043d65c692fb3a64b_1_2',
{'error': False, 'oocp_status': 0}]
Here we used the options oocp-out-fmt
and meta-procord
. The
first one tells LibreOffice to produce PDF output and the latter
option tells to call only the oocp
processor.
See ulif.openoffice.processor
for the names and options of
different document processors. You can also run the commandline client:
(py27) $ oooclient --help
to get a list of all supported options. Please note, that option keys must be provided without leading dash.
Retrieving Cached Docs via XMLRPC¶
Beside converting new docs we can also retrieve already cached docs via XMLRPC using the get_cached() method. For this we need the cache key provided in a conversion result.
>>> result = server.get_cached('78138d2003f1a87043d65c692fb3a64b_1_2')
>>> result
'/.../sample.pdf'
Of course this works only, if the XMLRPC server runs on the same machine as the client but the operation is pretty fast compared to converting.
Note
The result path is located inside the cache! The result file is therefore part of the cache and should not be modified! Instead please copy the file to an outside cache location or your cache will get corrupted.
Converting Docs via HTTP¶
The included WSGI app provides access to unoconv and filters in this package via HTTP. More specificially we provide a RESTful WSGI app that can be served by HTTP servers and optionally caches result docs.
Setting Up the WSGI App With Paste¶
To run the included RESTful doc converter WSGI app we can use Paste. The required paster script can be installed locally with:
(py27) pip install PasteScript
Then we need a PasteDeploy compatible config file like the following
sample.ini
:
# sample.ini
# A sample config to run WSGI components with paster
[app:main]
use = egg:ulif.openoffice#docconverter
cache_dir = /tmp/mycache
[server:main]
use = egg:Paste#http
host = localhost
port = 8008
In the [app:main]
section we tell to serve the ulif.openoffice
WSGI app docconverter. We additionally set a directory where we
allow cached documents to be stored. This entry (cache_dir
) is
optional. Just leave it out if you do not want caching of result docs.
The [server:main]
section simply tells to start an HTTP server on
localhost port 8008. host
can be set to any local hostname or an
IP number. Set it to 0.0.0.0
to be accessible on all IPs assigned
to the current machine (but read the security infos below, first!).
You now can start a conversion server:
(py27) $ paster serve sample.ini
and start converting real office documents via HTTP on the configured host and port (here: localhost:8008).
While we use the Paste HTTP server here for demonstration, you are not bound to this choice. Of course you can use any HTTP server capable of serving WSGI apps you like. This includes at least Apache and nginx (with appropriate modules loaded).
Securing the Document Converter (optional)¶
As told above, you can set the listening port of the Paste HTTP
server to 0.0.0.0
which will make it accessible for everyone and
from everywhere (given you’re not protected by local firewalls,
etc.). This might not be what you want.
Therefore with ulif.openoffice we provide simple authentication
(another WSGI app) that requires HTTP basic auth authentication for
incoming requests and checks sent credentials against a local
htaccess
-like file.
To activate it, you can create a sample.ini
like this:
# An sample config to run WSGI components with paster
[app:main]
use = egg:ulif.openoffice#docconverter
filter-with = auth_htaccess
cache_dir = /tmp/mycache
[server:main]
use = egg:Paste#http
host = localhost
port = 8008
[filter:auth_htaccess]
use = egg:ulif.openoffice#htaccess
realm = Sample Realm
htaccess = %(here)s/htaccess
# possible values: plain, sha1, crypt
auth_type = plain
This setup is basically the same as the one above, but with an
additional auth_htaccess
filter injected that is configured in the
[filter:auth_htaccess]
section.
The htaccess
filter requires three options:
- realm - The authentication realm.
- Some text. Might be shown by webbrowsers when asking the user for credentials in the basic-auth dialog (normally some popup).
- htaccess - The path to some password file.
- Here we set the path to some file called
htaccess
in the local directory. - auth_type - The encryption type of passwords in the password file.
- Possible values are
plain
(clear text passwords),sha1
, orcrypt
for the respective encryption types. Different to regular Apache htaccess files,md5
is not supported. All passwords in the chosen password file are expected to be encrypted with the encryption type set here. You cannot mix-up plain, crypt, and SHA1 encrypted passwords.
The password file set by the htaccess option can be some regular
Apache htaccess file (given you avoid md5
encryption). It can even
be edited using the htpasswd commandline tool (if installed).
A typical plain text password file could look like this:
# htaccess
# A password file for the document converter.
# Supported encryption types: plain, crypt, sha1
# Not supported: md5
# You can use htpasswd to edit me.
# All passwords must have same encryption type.
bird:bebop
ornette:wayout
would allow user bird
access when authenticating with plain
password bebop
. With this setup anonymous doc conversions are not
possible.
Of course you can pick a different WSGI filter to protect your document conversion server, but this one is already included in ulif.openoffice and might serve for simple use-cases.
Converting Documents¶
Once the server runs, we can start converting docs via HTTP.
The ulif.openoffice WSGI app supports the following HTTP-based protocol to create, update, and remove documents:
HTTP method | Path | Params | Semantics |
---|---|---|---|
GET | /docs/new | none | Get an HTML form to trigger a new conversion. |
POST | /docs | doc, [other...] | Create a new conversion. |
GET | /docs/<docid> | none | Get a cached conversion. |
Currently, removal and updating are not supported.
Creating New Resources¶
Via GET
to /docs/new
you can get an HTML form usable in a
browser to send new documents to the server. This form provides a
very limited set of options you can set for the conversion.
>>> url = 'http://localhost/docs/new'
>>> print(browser.GET(url))
200 OK
Content-Type: text/html; charset=UTF-8
Content-Length: ...
<html>
<head>
<title>Create a new doc</title>
</head>
<body>
<form method="POST" action="/docs"
enctype="multipart/form-data">
...
</form>
</body>
</html>
Via a POST
to /docs
you can send a document to the server that
will be converted. The result will be the converted document.
>>> url = 'http://localhost/docs'
>>> form = {'doc': ('sample.txt', 'Some Content'),
... 'oocp-out-fmt': 'html'}
>>> response = browser.POST(url, **form)
>>> response.status
'201 Created'
>>> for key in sorted(response.headers.keys()):
... print("%s: %s" % (key, response.headers.get(key)))
Content-Length: ...
Content-Type: application/zip
ETag: "...-...-..."
Last-Modified: ...
Location: http://localhost:80/docs/78138d2003f1a87043d65c692fb3a64b_1_1
>>> response.body.startswith(b"PK")
True
Here we converted a sample.txt file to HTML. To do that we POSTed a request to the server with two parameters:
- doc
- the file to be converted.
- oocp-out-fmt
- the output format we want the document to be converted to.
While the doc parameter is mandatory, other parameters are
optional. The oocp-out-fmt parameter, for instance, is set to
html
by default and you don’t have to send it with the
request. See ulif.openoffice.processor
for the options of
different document processors.
With the response we not only get the converted document (packed into a ZIP file), but also some helpful information:
Stating 201 Created
the server indicates that the converted
document was cached after creation and can be retrieved in future from
the URI given in the Location
header.
Note
The cached location for later retrieval of the generated
document works only, if caching is enabled for the REST
server. If it is not, you will get status 200 OK
and no
Location
header instead.
To get a complete list of supported document processing options you can run:
(py27) $ oooclient --help
The WSGI document converter accepts all short options (the ones with a
leading single dash) with the leading dash removed. For example while
oooclient
accepts
-oocp-out-fmt
and--oocp-output-format
,
the WSGI app accepts only
oocp-out-fmt
without the leading dash. The same applies to all other options listed
by oooclient --help
.