HTML Tidy Options : Interface ( Functions : Constants ) : Examples : Structure : Support : Download : Copyright & License : History : Home | Version 0.3.0 |
mxTidy provides a Python interface to a thread-safe, library version of the HTML Tidy command line tool.
HTML Tidy helps you to cleanup coding errors in HTML and XML files and produce well-formed HTML, XHTML or XML as output. This allows you to preprocess web-page for inclusion in XML repositories, prepare broken XML files for validation and also makes it possible to write converters from well-known word processing applications such as MS Word to other structured data representations by using XML as intermediate format.
During the development of this interface, the original HTML Tidy version was significantly modified to turn it from a single run, command line tool into a thread-safe C library which not only interfaces to files, but also to memory buffers.
Most of mxTidy's operations are automatic or can be manipulated by large number of configuration options. It also provides you with access to the error and warning information generated by HTML Tidy.
HTML Tidy is very good at trying to restructure the HTML or XML input, but unfortunately not too fast at it. The main reason for this is the single character input/output strategy used in the code which causes quite a few C function calls.
Changing the code to use a buffer and pointer strategy would enhance the performance, but requires a lot of work.
The memory requirements in string to string mode amount to about twice the size of the input string in addition to the parser tree overhead. In file to file mode, only the tree overhead is introduced.
Note that the current releases reconfigure HTML Tidy for every run which causes additional overhead.
The mxTidy package defines the following interfaces. Most
important are the HTML Tidy options which the interface
functions allow you to pass to the underlying HTML Tidy
engine.
Most of the original HTML Tidy options are also
available in the mxTidy interface; some options have
been removed, though, since they don't map well to an
embedded module, e.g. there is no configuration file
support and the slide bursting options have also been
removed.
The following options are available. The default values
used in mxTidy are given in parenthesis. Note that some
options have different defaults than in the command line
version of HTML Tidy.
For more information about the background and workings
of HTML Tidy, please see the HTML Tidy Overview which is
also included in the package.
Note that if the input document includes an
<?xml?> declaration then it will appear in the
output independent of the value of this option.
This is needed if the whitespace in such elements is to
be parsed appropriately without having access to the
DTD. The default is 0.
The default is 0. This option is automatically
set if the input is in XML.
Microsoft has developed its own optional filter for
exporting to HTML, and the 2.0 version is much
improved. You can download the filter free from the
Microsoft Office Update site.
These descriptions were extracted from the HTML Tidy documentation and
fall under the HTML Tidy copyright.
The package defines these functions:
If
The same is true for error information which Tidy
generates. This is either written to
Tidy options can be passed to the function using keyword parameters,
e.g.
The package defines these constants:
If you find any bugs, please report them to me so that
I can fix them for the next release.
HTML Tidy Options
add_xml_decl (0)
add_xml_space (0)
assume_xml_procins (0)
break_before_br (0)
clean (0)
drop_empty_paras (1)
br
elements as HTML4 precludes
empty paragraphs. The default is 1.
drop_font_tags (0)
enclose_block_text (0)
fix_backslash (1)
fix_bad_comments (1)
gnu_emacs (0)
hide_endtags (0)
indent_attributes (0)
input_xml (0)
literal_attributes (0)
logical_emphasis (0)
numeric_entities (0)
output_error (1)
output_markup (1)
output_xhtml (0)
output_xml (0)
quiet (0)
quote_ampersand (1)
quote_marks (0)
quote_nbsp (1)
raw (0)
show_warnings (0)
tidy_mark (0)
uppercase_attributes (0)
uppercase_tags (0)
word_2000 (0)
wrap_asp (1)
wrap_attributes (0)
wrap_jste (1)
wrap_php (1)
wrap_script_literals (0)
wrap_sections (1)
indent_spaces (2)
tab_size (8)
wrap (72)
alt_text (None)
indent ("no")
char_encoding ("ascii")
Functions
tidy(input, output=None, errors=None, **options)
(nerrors, nwarnings, outputdata, errordata)
.
input
may be a string or a file open for
reading data.
output
is given as file open for
writing, the generated markup is written to this file
and outputdata
is set to None. Otherwise,
output is written to a string which is returned by the
function in outputdata
.
errors
or returned via errordata.
nerrors
and nwarnings
are
integers which are set to the number of errors/warnings
which TIDY generated.
output_xhtml=1
. Configuration
files are not supported by the interface.
Constants
Error
The package currently does not expose any submodules.
TBD
This snippet demonstrates some of the possible interactions
of mxTidy types and Python number types:
>>> from mx.Tidy import *
>>> # To be written...
More examples will appear in the Examples subdirectory of the package.
[Tidy] Doc/ [Examples] [mxTidy] libtidy/ test.py Tidy.py
Names with trailing / are plain directories, ones with []-brackets are Python packages, ones with ".py" extension are Python submodules.
The package imports all symbols from the extension module and also registers the types so that they become compatible to the pickle and copy mechanisms in Python.
eGenix.com is providing commercial support for this
package. If you are interested in receiving information
about this service please see the eGenix.com
Support Conditions.
© 2001, Copyright by eGenix.com Software, Skills and
Services GmbH, Langenfeld, Germany; All Rights Reserved.
mailto: info@egenix.com
The mxTidy software and the modifications to the HTML Tidy
source code are covered by the eGenix.com Public License
Agreement. The text of the license is also included
as file "LICENSE" in the package's main directory.
The included HTML Tidy software is covered by the following
license:
By downloading, copying, installing or otherwise using
the software, you agree to be bound by the terms and
conditions of the eGenix.com
Public License Agreement and the above HTML Tidy
license.
Things that still need to be done:
Things that changed from 0.2.0 to 0.3.0:
Things that changed from 0.1.0 to 0.2.0:
Version 0.1.0 was the first public release.
Support
What I'd like to hear from you...
Copyright & License
Copyright (c) 1998-2000 World Wide Web Consortium
(Massachusetts Institute of Technology, Institut National de
Recherche en Informatique et en Automatique, Keio University).
All Rights Reserved.
Contributing Author(s):
Dave Raggett, dsr@w3.org
The contributing author(s) would like to thank all those who
helped with testing, bug fixes, and patience. This wouldn't
have been possible without all of you.
COPYRIGHT NOTICE:
This software and documentation is provided "as is," and
the copyright holders and contributing author(s) make no
representations or warranties, express or implied, including
but not limited to, warranties of merchantability or fitness
for any particular purpose or that the use of the software or
documentation will not infringe any third party patents,
copyrights, trademarks or other rights.
The copyright holders and contributing author(s) will not be
liable for any direct, indirect, special or consequential damages
arising out of any use of the software or documentation, even if
advised of the possibility of such damage.
Permission is hereby granted to use, copy, modify, and distribute
this source code, or portions hereof, documentation and executables,
for any purpose, without fee, subject to the following restrictions:
1. The origin of this source code must not be misrepresented.
2. Altered versions must be plainly marked as such and must
not be misrepresented as being the original source.
3. This Copyright notice may not be removed or altered from any
source or altered source distribution.
The copyright holders and contributing author(s) specifically
permit, without fee, and encourage the use of this source code
as a component for supporting the Hypertext Markup Language in
commercial products. If you use this source code in a product,
acknowledgment is not required but would be appreciated.
History & Future