ulif.openoffice.cachemanager – A Cache Manager

class ulif.openoffice.cachemanager.Bucket(path)

A bucket where we store files with same hash sums.

Warning

Bucket is not thread-safe!

A bucket is a directory in filesystem, where you can store triples

(source_file, representation_file, key)

where key must be a unique identifier for a certain representation. Each representation (and hence key) belongs to one source_file, while each source_file can be the base for multiple representations and their respective keys.

Keys can be very big, therefore we allow to pass them in as strings (byte streams) or as file-like objects.

The main methods to feed a bucket or retrieve stored data are store_representation() and get_representation().

For stored documents you will get a bucket key which can be used later to retrieve data stored.

create()

Create the default dirs for this bucket.

This method is called when instantiating a bucket.

You should therefore be aware that constructing a bucket will try to modify the file system.

get_current_repr_num(num)

Get current representation num for source number num.

Get the current number of representations stored for source numer num. num is expected to be an integer.

Returns an integer.

get_current_source_num()

Get current source num.

Returns an integer.

get_representation(bucket_key)

Get path to representation identified by bucket_key.

If no such representation is stored, None is returned.

get_stored_repr_num(src_num, repr_key)

Find a representation number for source number src_num.

If for source number src_num a representation with key repr_key is stored already in bucket, the number of the respective representation will be returned.

If no such key can be found for the given source, you will get None.

get_stored_source_num(src_path)

Tell whether a file like that in src_path is already stored.

A stored one and the file in src_path are compared by content. That means that os.stat attributes, filename, etc. do not matter.

Returns the number of the stored source if found, None else.

keys()

Get a generator of all bucket keys available in this bucket.

set_current_repr_num(num, value)

Set current representation num for source number num to value.

Indicate, that source number num has value representations. Both paramters are expected to be integers.

set_current_source_num(num)

Set current source num.

store_representation(src_path, repr_path, repr_key='')

Store a representation for a source under a representation key.

repr_key can be a string or some file-like object already opened for reading.

Sources are only stored really if they do not exist already.

A source is considered to be already stored, if both, the contents of the file given in src_path and the contents of an already stored file are equal.

Representations and their respective files are created if they do not exist already or overwritten otherwise.

A representation is considered to exist already, if a representation with the same repr_key as passed in is already stored.

Returns a bucket key.

class ulif.openoffice.cachemanager.CacheManager(cache_dir, level=1)

A cache manager.

This cache manager caches processed files and their sources. It uses hashes and buckets to find paths of cached files quickly.

Overall it maps input files plus a (maybe huge) key on output files. The cache manager is interesting when the computation of an output file is expensive but must be repeated often.

A sample application is to cache converted office files: as the computation is expensive, we can store the results of conversion in the cache manager and get it any time we want much more quickly. See cachemanager.txt for more infos.

It also checks for hash collisions: if two input files give the same hash, they will be handled correctly.

get_cached_file(cache_key)

Get the representation stored for cache_key.

Returns the path to a file represented by cache_key or None if no such representation is stored in cache already.

get_cached_file_by_source(source_path, repr_key='')

Get the representation stored for a source file and a key.

New in version 1.1.

Returns (<path>, <cache_key>) where <path> is the path to a file represented by source_path and repr_key. <cache_key> is the key you can use with get_cached_file() to get cached files much quicker. Both values are None if no such representation is stored in cache already.

Does basically the same as get_cached_file() but without a cache_key. Instead the source file is examined again and we look for a representation matching the repr_key. In other words: we find docs that have been registered already with source file and repr_key.

Note

This method is much more expensive than get_cached_file(). Please use it only if the cache_key cannot be determined otherwise.

classmethod get_hash(path)

Get the hash of a file stored in path.

Currently we compute the MD5 digest.

Note for derived classes, that the hash digest computed by this method should give only chars that can easily be processed as path elements in URLs. For instance slashes (which can occur in Base64 encoded strings) could make things difficult.

keys()

Get a list of all cache keys currently available.

register_doc(source_path, to_cache, repr_key='')

Store a representation of file found in source_path which resides in path to_cache to a bucket.

repr_key can be a string or a file-like object opened for reading. It must be unique for that very special representation of the file in source_path.

Returns a marker string which can be used in connection with the appropriate cache manager methods to retrieve the representation later on.

ulif.openoffice.cachemanager.get_marker(options={})

Compute a unique marker for a set of options.

The returned marker is a string suitable for use in filessystems. Different sets of options will result in different markers where order of options does not matter.

In ulif.openoffice we use the marker to feed the cache manager and to mark different results for the same input file as different option sets will result in different output for same input.