mxTextTools

mxTextTools - Fast Text Processing for Python

mxTextTools is an extension package for Python that provides high-performance text manipulation and searching algorithms, in addition to a very flexible and extendable state machine, the Tagging Engine, which allows scanning and processing text at C speeds.
Version: 3.2.0

Introduction

mxTextTools is a collection of high-speed string manipulation routines and new Python objects for dealing with common text processing tasks.

Tagging Engine

One of the major features of this package is the integrated mxTextTools Tagging Engine which allows accessing the speed of compiled C programs while maintaining the portability of Python. The Tagging Engine uses byte code "programs" written in form of Python tuples. These programs are then compiled into an internal binary form which gets processed by a very fast virtual machine designed specifically for scanning text data.

As a result, the Tagging Engine allows parsing text at higher speeds than e.g. regular expression packages while still maintaining the flexibility of programming the parser in Python. Callbacks and user-defined matching functions extend this approach far beyond what you could do with other common text processing methods.

About the word tagging: this originated from what is done in SGML, HTML and XML, namely to mark text with a certain extra information. The Tagging Engine abstracts this notion to assigning Python objects to text substrings. Every substring marked in this way carries a 'tag' (the tag object) which can be used to do all kinds of useful things.

Search Objects

The two other major features of mxTextTools are the search and character set objects provided by the package. Both are implemented in C to give you maximum performance on all supported platforms.

Using mxTextTools for Language Parsing

At EuroPython 2007, we have given a talk about mxTextTools and how it can be used to parse languages. Please see our Presentations & Talks section for details.

Features

  • Fast, memory efficient, highly customizable.
  • High-performance text scanner that runs compiled byte-code on a portable virtual machine.
    • Allows writing scanners that work at C speed without the need to drop to C for programming.
    • Faster and more flexible than regular expressions.
  • Fast search objects.
  • Efficient character set matching objects.
  • Handy routines for everyday text manipulation work.
  • Works on 8-bit strings as well as Unicode text.
  • Stable, robust and portable.
  • Free to use and redistribute.

System Requirements

mxTextTools is written in a very portable way and works on pretty much all platforms where you can compile Python.

We provide precompiled versions of mxTextTools for all standard platforms, so all you need is a working Python installation. The package supports all Python versions since Python 2.1.

The only requirement for compiling the package from source is an ANSI C compiler. There are no third-party libraries needed.

License

mxTextTools is provided as part of the eGenix.com mx Base Distribution. Please see the mx Base Distribution page for details regarding the license.

Documentation

The following documentation is available for mxTextTools:

mxTextTools User Manual and Reference Guide - HTML and PDF

This manual includes a discussion of the various design principles behind mxTextTools Tagging Engine and the search objects, their implementation, as well as a reference of the available programming interfaces.

The PDF file is also available as part of the installation and can be found in the mx/TextTools/Doc/ folder.

Books

If you are looking for more tutorial style documentation of mxTextTools, there's a book by David Mertz about Text Processing with Python which covers mxTextTools and other text oriented tools at great length.

Download & Installation

mxTextTools is provided as part of the eGenix.com mx Base Distribution. Please see the mx Base Distribution page for downloads and installation instructions.

References

mxTextTools was originally written for the eGenix.com Application Server to allow fast templating of web pages and related resources.

Since then, it has been used in a wide variety of other areas. Some notable and publically available applications using mxTextTools are: BioPython (Andrew Dalke's Martel uses it as parsing engine) and SimpleParse (Mike Fletcher's parser generator for mxTextTools which he uses for parsing VRML files), also see David Mertz's article about it on IBM Developer Works.

History & Changes

Please see the change log for details regarding changes to the package between releases.