Jun 04, 2024

State of XML Parsing in Python

by Andrés Barreiro

While JSON is prevalent nowadays as far as data interchange goes, XML retains a crucial role. Its ability to represent complex structures and maintain strict validation through schemas makes it a great fit for some applications. Even outside of the "pure XML" world, some markup parsers can be useful for handling HTML in specific cases.

At Close we handle such content in large amounts. Some examples of common actions:

To handle this data, we settled for lxml, which gave us the combination of features and performance we were looking for.

After some time using it, we also had some hiccups along the way. We are always evaluating our tech stack choices and keeping an eye on the landscape, so now we are looking back into our decisions and wonder: does lxml serve our purposes well? Are there any better alternatives?

In this article, we’ll take a look at the main options for XML parsing in Python, what we don’t like, what we do like and potential future steps.

XML parsing

When looking for a XML parsing library, we want a number of things. Mainly:

Actual parser

The code that can process a string/file containing XML ("deserializing" XML). Its input is considered untrusted, so it must be robust and configurable to mitigate security issues while being compliant to the specification.

There are multiple preferences when it comes to:

  • How input is read:
    • Non-incremental, everything has to be read in one go. Usually we want to fill a data structure and until that’s done the document is not useful (e.g. a DOM Tree)
    • Incremental, read in chunks. Useful in cases where we want to process data as it’s read (e.g. find matches, fail validation early, process a subtree)
  • Who holds control of parsing:
    • Push/SAX (while reading, parser is in control and feeds data to callbacks)
    • Pull/StAX (reader is in control and requests data from parser)

APIs

These define how we'll interface with the document (so e.g. we can query or modify it in a convenient way).

Related to this: the data structure that might be used to hold the resulting artifacts and its API.

Serialization

A way to generate XML from code ("serializing" to XML).

Extras

While not part of the core task of "XML parsing", we might be interested in:

  • Schema validation (via XSD, Schematron, etc.)
  • Document transformation (XSLT)
  • Query syntax (XPath)

Performance is a key consideration too, which applies to all the previous points.

XML parsing in Python

Let’s look at the description of what lxml is, which hints at different pieces at play.

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.

From these concepts let’s build a table with the two main options: lxml and Python Standard Library’s xml module.

Library Data API Data Factory Parser API Actual Parser
xml ElementTree TreeBuilder XMLParser expat
PullXMLParser
minidom DomBuilder ExpatBuilder
PullDOM
lxml ElementTree* TreeBuilder* XMLParser* libxml2

* lxml implements its own versions of these

A couple of observations:

  • All of the actual parsing options in Python stdlib’s rely on expat
  • xml.dom (minidom’s module) is the "W3C Document Object Model implementation for Python". It more directly translates the spec and original Java SAX utils to Python. While xml.etree (ElementTree's module) is a more Pythonic interface.
  • Both stdlib/lxml's ElementTree and TreeBuilder are implemented in C (stdlib has a Python implementation for reference purposes, but uses the performant one by default).
  • The stdlib also includes other utils that their parsers build on top of (and can be potentially reused by others), like xml.sax.
  • As it says on the tin, lxml implements (but also tweaks and extends) stdlib API’s (for ElementTree, TreeBuilder and XMLParser). They do their best to keep compatibility though.

What about the other options?

It seems like there are no other real options when it comes to XML parsing in Python. Some related libraries that are commonly listed are:

BeautifulSoup

Delegates the parsing to lxml (available for XML and HTML) and html5lib (for HTML as the name implies). So under the hood, it’s libxml2 and html5lib.

As a side note, libxml2 acknowledges it’s not fit for parsing modern HTML (might be valid for some use cases though) and html5lib is pure Python (so performance may be lacking).

xmltodict

A Python module that makes working with XML feel like you are working with JSON

A nice tool that might fit specific use cases. Under the hood, it’s a combination of Python Standard Library’s utils (xml.sax.saxutils, xml.sax.xmlreader) backed by expat.

What about security?

The default parser configurations are considered non-secure. Python docs recommend using defusedxml.

defusedxml is a pure Python package with modified subclasses of all stdlib XML parsers that prevent any potentially malicious operation.

It is not a parser of its own but a wrapper to the common parsers that sets flags in order to harden them such as forbid DTD, forbid external entities, forbid entity expansion, etc.

So, in the end it comes down to stdlib vs lxml. And according to at least one member of the Python Steering Council, "the Python standard library XML libraries are long-term-stale and not likely to ever see improvements". So...

What’s the matter?

First off, we’d like to give kudos to the developers of lxml, libxml2 and libxslt. The amount of software that rests on your shoulders is difficult to measure, but some mentions from lxml's FAQ and the efforts that went in developing and maintaining these pieces of open source software is impressive and the ecosystem owes you so much.

As users of the software, we’ll now focus on our existing "pain points" in order to draw future steps either in improving the existing options or building new alternatives.

Things we don’t like

Unstable output when combining parser params

lxml exposes lots of options, but not all of them work (or they don’t work well in combination). One example is entity resolution.

Even if resolve_entities=False it will try to parse them unless the parser param is "poked" (e.g. passing a DTD even if it’s unused in practice) (libxml2)

from lxml import etree

xml_input = "<span>Hello&nbsp;XML!</span>"

# Minimal parser
minimal_parser = etree.XMLParser()
etree.fromstring(xml_input, minimal_parser)  # XMLSyntaxError

# Let's turn off entity resolution
restricted_parser = etree.XMLParser(resolve_entities=False)
etree.fromstring(xml_input, restricted_parser)  # XMLSyntaxError

# Ok, let's turn off DTD validation _but_ pass a dummy DTD
more_restricted_parser = etree.XMLParser(
    resolve_entities=False, load_dtd=False, dtd_validation=False
)
etree.fromstring(
    '<!DOCTYPE MyDoc SYSTEM "nonexistent.dtd">' + xml_input,
    more_restricted_parser,
)  # Works!

Lack of extensibility/hooks for parser tweaks

Sometimes it would be nice to hook some Python code to the underlying parser while developing (even if it’s temporary) without dealing with C code and recompiling everything. An example: Handling of CDATA sections. We want to check if they are present to fail validation but libxml2 doesn’t expose it.

libxml2’s inherent issues

Mainly around memory safety issues (https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=libxml2).

lxml + libxml2 build process

Related to the previous points. While the best option is using pre-built wheels, those use statically linked libxml2, which has its pros and cons.

The binary wheels of lxml statically include a (usually recent) version of libxml2, whereas xmlsec often depends on the systemwide installed libraries.

From FAQ.txt

Sometimes we had to build a custom version and we found the process finicky.

Part of it is caused by pulling dependencies from sources like zlib.org or ftp.gnu.org, which might fail/have limited availability. This can be overridden (custom sources can be set, even local ones) but there’s a lot of automated source pulling from the build process and it’s not very clear.

Separately, the C part of lxml relies not on the Cython code in the repo but on the C "snapshots" generated from it:

[...] you do not need Cython to build lxml from the normal release sources. We even encourage you to not install Cython for a normal release build, as the generated C code can vary quite heavily between Cython versions, which may or may not generate correct code for lxml.

The pre-generated release sources were tested and therefore are known to work.

So, if you want a reliable build of lxml, we suggest to a) use a source release of lxml and b) disable or uninstall Cython for the build.

From build.txt

Not a good fit for HTML

Even if processing modern HTML is not our main use case, it’s good to remember that:

  • libxml2 is not fit for the purpose
  • lxml's html module is based off html5lib and written in Python (not the most performant option)

What we do like and next steps

There are things we like:

  • ElementTree API
  • Good performance, both from the parser and data structures
  • Some extras (XPath, schema validation)

So, if we were looking to replace lxml, does this mean we can switch to stdlib’s ElementTree implementation? No, unless we build some schema validation on top of it. Also, it’s considered stale.

Maybe we should wrap (with PyO3) an existing Rust library? Not a good idea, as existing options don’t seem to be mature enough and also lack the extra features (schema validation, XPath queries) we’d need. We’d have to rewrite it all in Rust (yay?) which doesn’t seem viable.

So lxml still looks like the best fit for the time being... even if the pain points are still there.

Ideally we’d like to have a library that:

  • Exposes the parsed data using ElementTree’s API
  • Is not coupled to a single parser. Maybe libxml2 is a good enough fit, but would be nice to plug alternative XML parsers or even HTML parsers. Sometimes you need a spec compliant parser, other times you need a loose HTML5 parser. These might have Python or more performant implementations in different languages
  • Should expose settings to control the underlying parsers while having sensible top-level defaults to handle untrusted input. A basic, secure setup shouldn’t require fiddling with flags or an extra wrapping layer
  • Incorporates extras (such as XPath and schema validation)
  • Is extensible. It should allow hooking Python in parts of the process before moving to specific languages (Rust, C, C++) so we can prototype and debug at the cost of performance while developing

Outside of Python’s ecosystem, a good example that crosses many of those items would be Nokogiri:

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writing, modifying, and querying documents. It is fast and standards-compliant by relying on native parsers like libxml2, libgumbo, and xerces.

Some guiding principles Nokogiri tries to follow:

  • be secure-by-default by treating all documents as untrusted by default
  • be a thin-as-reasonable layer on top of the underlying parsers, and don't attempt to fix behavioral differences between the parsers