Python - Our key to Efficiency

eGenix.com · Website 2002 · Python · mxBeeBase - On-disk B+Tree Based Database Kit

mxBeeBase - On-disk B+Tree Based Database Kit

Interface ( BeeDict : BeeStorage : BeeIndex ) Examples : Structure : Support : Download : Copyright & License : History : Home

Version 2.0.2

Introduction

XXX This documentation is unfinished... the software itself has been in use for quite a while, but the documentation still needs to be completed. For now, all I can offer you is to read the code which is well documented.

mxBeeBase is a high performance construction kit for disk based indexed databases. It offers components which you can plug together to easily build your own custom mid-sized databases (the current size limit is sizeof(long) which gives you an address range of around 2GB on 32-bit platforms).

The two basic building blocks in mxBeeBase are storage and index. Storage is implemented as variable record length data storage with integrated data protection features, automatic data recovery and locking for multi process access. Indexes use a high performance optimized B+Tree implementation built on top of Thomas Niemann's Cookbook B+Tree implementation.

Note: Due to the file locking mechanism currently used in mxBeeBase, the package will only work on systems which support symbolic links, e.g. Windows does not support symbolic links and therefore parts of mxBeeBase don't work on that platform.

BeeDict Module

The BeeDict module provides two high level on-disk dictionary implementations. The first (BeeDict) can work with arbitrary hashable key objects, while the second (BeeStringDict) uses limited sized strings as basis providing slightly better performance. Both variants need pickleable Python objects as keys and values.

Data transfer to and from the dictionaries is done in the same way as for in-memory dictionaries, e.g. d['key'] = 1; print d['key']; del d['key], so usage should be transparent to Python programs using either in-memory or on-disk dictionaries. Not all dictionary methods are understood though.

BeeDict Objects

BeeDict objects are on-disk dictionaries which use a hash-to-address index. Both Keys and values must be pickleable and can have arbitrary size (keys shouldn't be too long though); keys have to be hashable.

Hash collisions are treated by sequential reads of all records with the same hash value and testing for equality of keys. This can be expensive !

BeeDicts use a BeeStorage.BeeKeyValueStorage instance as storage object and a BeeIndex.BeeIntegerIndex instance as index object.

BeeDict objects are constructed using:

BeeDict(name,min_recordsize=0,readonly=0,recover=0, autocommit=0,validate=0)

Create an instance using name as basename for the data and index files. Two files will be created: name.dat and name.idx.

min_recordsize is passed to the BeeStorage as indicator of the minimum size for data records. readonly can be set to true to open the files in read-only mode, preventing any disk modifications.

To open the dictionary in recovery mode, pass a keyword recover=1. Then run .recover() and reopen using the normal settings. The AutoRecover() wrapper can take care of this action for you automatically.

If autocommit is true the cache control will do an automatic .commit() whenever the transaction log overflows.

If validate is true, the dictionary will run a validation check after having successfully opened storage and index. RecreateIndexError or RecoverError exceptions could be raised in case inconsistencies are found.

BeeDict Instance Methods

BeeDict objects have the following methods:

flush()

Flush buffers to disk.

close()

Flush buffers and close.

This does not issue a .commit(), so the current transaction is rolled back.

commit()

Commits all changes and starts a new transaction.

rollback()

Rolls back the current transaction. All changes are undone.

changed()

Return true in case the current transaction includes changes to the database, false otherwise.

has_key()

Check if the dictionary has an item indexed by key.

Successfully found items are put in the cache for fast subsequent access.

get(key,default=None)

Get item indexed by key from the dictionary or default if no such item exists.

This first tries to read the item from cache and reverts to the disk storage if it is not found.

cursor(key,default=None)

Return a BeeDictCursor instance for the dictionary.

In case the key is not found, default is returned instead.

Note that cursors operate with the data on disk meaning that any uncommitted changes will not be seen by the cursor.

garbage()

Determine the amount of garbage in bytes that has accumulated in the storage file.

This amount would be freed if .collect() were run.

collect()

Run the storage garbage collector.

Storage collection can only be done for writeable dictionaries and then only if the current transaction does not contain any pending changes.

This can take a while depending on the size of the dictionary.

recover()

Recover all valid records and recreate the index.

keys()

Return a list of keys.

The method does not load any data into the cache, but does take notice of uncommitted changes.

values()

Return a list of values.

The method does not load any data into the cache, but does take notice of uncommitted changes.

items()

Return a list of items.

The method does not load any data into the cache, but does take notice of uncommitted changes.

BeeDict Instance Attributes

BeeDict objects have the following read-only attributes:

name: Basenname of the dictionary. The implementation uses two files for the on-disk representation: name.dat and name.idx.
closed: Closed flag.
readonly: Read-only flag.
autocommit: Autocommit flag.
FirstKey: Special key object useable to represent the first key in the (sorted) B+Tree index.
LastKey: Special key object useable to represent the last key in the (sorted) B+Tree index.

BeeDictCursor Objects

BeeDickCursor objects are intended to iterate over the database one item at a time without the need to read all keys. You can then read/write to the current cursor position and thus modify the dictionary in place.

Note that modifying the targetted dictionary while using a cursor can cause the cursor to skip new entries or fail due to deleted items. Especially deleting the key to which the cursor currently points can cause errors to be raised. In all other cases, the cursor will be repositioned.

BeeDictCursor objects are constructed using the BeeDict .cursor() method.

BeeDictCursor Instance Methods

BeeDictCursor objects have the following methods:

position(key,value=None): Position the index cursor to index[key]. If value is given, index[key] == value is assured.
key may also be FirstKey or LastKey (in which case value has to be None).
next(): Moves to the next entry in the dictionary.
Returns true on success, false if the end-of-data has been reached.
prev(): Moves to the previous entry in the dictionary.
Returns true on success, false if the end-of-data has been reached.
read(): Reads the object from the dict to which the cursor currently points.
read_key(): Reads the key object from the dict to which the cursor currently points.
write(): Writes the object to the dict under the key to which the cursor currently points.
The new data is not written to disk until the dictionary's current transaction is committed.

BeeDictCursor Instance Attributes

BeeDictCursor don't have any useful attributes. Use the instance methods to query the key and value objects.

BeeStringDict Objects

BeeStringDict objects are on-disk dictionaries which use a limited size string to address index. Values must be pickleable and can have arbitrary size.

Since hash collisions cannot occur this dictionary type may have some performance advantages over the standard BeeDict dictionary.

BeeStringDict objects are constructed using:

BeeStringDict(name,keysize=10,min_recordsize=0,readonly=0, recover=0,autocommit=0,validate=0)

Create an instance using name as basename for the data and index files. Two files will be created: name.dat and name.idx.

keysize gives the maximal size of the strings used as index keys. min_recordsize is passed to the BeeStorage as indicator of the minimum size for data records. readonly can be set to true to open the files in read-only mode, preventing any disk modifications.

If autocommit is true the cache control will do an automatic .commit() whenever the transaction log overflows.

Note that the keysize is currently not stored in the dictionary itself -- you'll have to store this information in some other form. This may change in future versions.

BeeStringDict Instance Methods

BeeStringDict objects have the following methods:

commit()

Commits all changes and starts a new transaction.

rollback()

Rolls back the current transaction. All changes are undone.

changed()

Return true in case the current transaction includes changes to the database, false otherwise.

has_key()

Check if the dictionary has an item indexed by key.

Successfully found items are put in the cache for fast subsequent access.

get(key,default=None)

Get item indexed by key from the dictionary or default if no such item exists.

This first tries to read the item from cache and reverts to the disk storage if it is not found.

cursor(key,default=None)

Return a BeeStringDictCursor instance for the dictionary.

In case the key is not found, default is returned instead.

Note that cursors operate with the data on disk meaning that any uncommitted changes will not be seen by the cursor.

garbage()

Determine the amount of garbage in bytes that has accumulated in the storage file.

This amount would be freed if .collect() were run.

collect()

Run the storage garbage collector.

Storage collection can only be done for writeable dictionaries and then only if the current transaction does not contain any pending changes.

This can take a while depending on the size of the dictionary.

recover()

Recover all valid records and recreate the index.

keys()

Return a list of keys.

The method will raise an error if there are uncommitted changes pending. Output is sorted ascending according to keys.

values()

Return a list of values.

The method will raise an error if there are uncommitted changes pending. Output is sorted ascending according to keys.

items()

Return a list of items.

The method will raise an error if there are uncommitted changes pending. Output is sorted ascending according to keys.

BeeStringDict Instance Attributes

BeeDict objects have the following read-only attributes:

name: Basenname of the dictionary. The implementation uses two files for the on-disk representation: name.dat and name.idx.
closed: Closed flag.
readonly: Read-only flag.
autocommit: Autocommit flag.
FirstKey: Special key object useable to represent the first key in the (sorted) B+Tree index.
LastKey: Special key object useable to represent the last key in the (sorted) B+Tree index.

BeeStringDictCursor Objects

BeeStringDictCursor objects are constructed using the BeeStringDict .cursor() method.

BeeStringDictCursor Instance Methods

BeeStringDictCursor objects have the following methods:

position(key,value=None): Position the index cursor to index[key]. If value is given, index[key] == value is assured.
key may also be FirstKey or LastKey (in which case value has to be None).
next(): Moves to the next entry in the dictionary.
Returns true on success, false if the end-of-data has been reached.
prev(): Moves to the previous entry in the dictionary.
Returns true on success, false if the end-of-data has been reached.
read(): Reads the object from the dict to which the cursor currently points.
read_key(): Reads the key object from the dict to which the cursor currently points.
write(): Writes the object to the dict under the key to which the cursor currently points.
The new data is not written to disk until the dictionary's current transaction is committed.

BeeStringDictCursor Instance Attributes

BeeStringDictCursor objects have the following read-only attributes:

key: Key string which dereferences to the current cursor position.

Functions

These functions are available:

AutoRecover(Class,*args,**kws): Wrapper that can be used around the dictionary constructors to provide automatic recovery whenever needed (if possible).
Example: diskdict = AutoRecover(BeeDict.BeeDict, 'test')

Constants

These constants are available:

Error: Baseclass for errors related to this module. It is a subclass of exceptions.StandardError.
RecreateIndexError: This error is raised in case the index for a dictionary was not found and/or needs to be recreated by running recovery. It is a subclass of Error.
RecoverError: This error is raised in case the storage for a dictionary was found to be in an inconsistent state. It is a subclass of Error.
FirstKey: Special index key object representing the key of the first entry in the index (B+Tree's sort their data).
LastKey: Special index key object representing the key of the last entry in the index (B+Tree's sort their data).

BeeStorage Module

XXX Not yet documented. Please refer to the source code.

BeeIndex Module

XXX Not yet documented. Please refer to the source code.

Examples of Use

Here is a very simple one:

from mx import BeeBase

#...XXX not written yet...

More examples will eventually appear in the Examples subdirectory of the package.

Package Structure

[BeeBase]
       Doc/
       [mxBeeBase]
              test.py
       BeeBase.py
       BeeDict.py
       BeeIndex.py
       BeeStorage.py
       showBeeDict.py

Entries enclosed in brackets are packages (i.e. they are directories that include a __init__.py file). Ones without brackets are just simple subdirectories that are not accessible via import.

The package imports all symbols from the BeeBase sub module which in turn imports the extension module, so you only need to 'import BeeBase' to start working.

Support

eGenix.com is providing commercial support for this package. If you are interested in receiving information about this service please see the eGenix.com Support Conditions.

Copyright & License

This software is covered by the eGenix.com Public License Agreement. The text of the license is also included as file "LICENSE" in the package's main directory.

By downloading, copying, installing or otherwise using the software, you agree to be bound by the terms and conditions of the eGenix.com Public License Agreement.

Parts of this package are based on an ANSI C implementation of a B+Tree implementation written by Thomas Nieman, Portland, Oregon. The files in question are btr.c and btr.h which were heavily modified for the purpose of inclusion in this package by the above author.

The original files were extracted from btr.c -- an ANSI C implementation included in the source code distribution of

        SORTING AND SEARCHING ALGORITHMS: A COOKBOOK

	by THOMAS NIEMANN Portland, Oregon 
	home: http://epaperpress.com/

From the cookbook:

Permission to reproduce this document, in whole or in part, is given provided the original web site listed below is referenced, and no additional restrictions apply. Source code, when part of a software project, may be used freely without reference to the author.

History & Future

Things that still need to be done:

Provide more examples.
Test the package some more.
Write more documentation.

There were no significant changes between 2.0.0 and 2.0.3.

Version 2.0.0 was the intial release.

Contact : Impressum : Terms & Conditions : Privacy Policy : Trademarks 2025-04-01