mxTextTools™ is a collection of high-speed string manipulation routines and new Python objects for dealing with common text processing tasks.
One of the major features of this package is the integrated mxTextTools Tagging Engine which allows accessing the speed of compiled C programs while maintaining the portability of Python. The Tagging Engine uses byte code "programs" written in form of Python tuples. These programs are then compiled into an internal binary form which gets processed by a very fast virtual machine designed specifically for scanning text data.
As a result, the Tagging Engine allows parsing text at higher speeds than e.g. regular expression packages while still maintaining the flexibility of programming the parser in Python. Callbacks and user-defined matching functions extend this approach far beyond what you could do with other common text processing methods.
About the word tagging: this originated from what is done in SGML, HTML and XML, namely to mark text with a certain extra information. The Tagging Engine abstracts this notion to assigning Python objects to text substrings. Every substring marked in this way carries a 'tag' (the tag object) which can be used to do all kinds of useful things.
The two other major features of mxTextTools are the search and character set objects provided by the package. Both are implemented in C to give you maximum performance on all supported platforms.
At EuroPython 2007, we have given a talk about mxTextTools and how it can be used to parse languages. Please see our Presentations & Talks section for details.
mxTextTools is written in a very portable way and works on pretty much all platforms where you can compile Python.
We provide precompiled versions of mxTextTools for all standard platforms, so all you need is a working Python installation. The package supports all Python versions since Python 2.1.
The only requirement for compiling the package from source is an ANSI C compiler. There are no third-party libraries needed.
mxTextTools is provided as part of the eGenix.com mx Base Distribution. Please see the mx Base Distribution page for details regarding the license.
The following documentation is available for mxTextTools:
mxTextTools User Manual and Reference Guide - HTML and PDF
This manual includes a discussion of the various design principles behind mxTextTools Tagging Engine and the search objects, their implementation, as well as a reference of the available programming interfaces.
The PDF file is also available as part of the installation and can be found in the mx/TextTools/Doc/
folder.
If you are looking for more tutorial style documentation of mxTextTools, there's a book by David Mertz about Text Processing with Python which covers mxTextTools and other text oriented tools at great length.
mxTextTools is provided as part of the eGenix.com mx Base Distribution. Please see the mx Base Distribution page for downloads and installation instructions.
mxTextTools was originally written for the eGenix.com Application Server to allow fast templating of web pages and related resources.
Since then, it has been used in a wide variety of other areas. Some notable and publically available applications using mxTextTools are: BioPython (Andrew Dalke's Martel uses it as parsing engine) and SimpleParse (Mike Fletcher's parser generator for mxTextTools which he uses for parsing VRML files), also see David Mertz's article about it on IBM Developer Works.
Please see the change log for details regarding changes to the package between releases.