bridge
Latest version: 0.3.5
This page is updated for bridge 0.3.5
What is it?
bridge is a Python XML library trying to provide a high level and clean interface for manipulating XML documents.
Download it
- easy_install -U bridge
- Tarballs http://www.defuze.org/oss/bridge/
- svn co https://svn.defuze.org/oss/bridge/
W00t another XML library for Python?
Well yes and no. bridge aims at providing a common API to handle XML documents and streams uniformly with CPython and IronPython.
## On CPython >>> from bridge import Element as E >>> e = E.load('<root>hello</root>').xml_root >>> e.xml() '<?xml version="1.0" encoding="UTF-8"?>\n<root>hello</root>' >>> E.parser <class 'bridge.parser.bridge_default.Parser'> ## IronPython (1.1) >>> from bridge import Element as E >>> e = E.load('<root>hello</root>').xml_root >>> e.xml() '<?xml version="1.0" encoding="utf-8"?>\n<root>hello</root>' >>> E.parser <class 'bridge.parser.bridge_dotnet.Parser'>
bridge will parse the document with the specified parser and builds its own in-memory representation of the XML document. This means that once loaded into an Element there is no relationship with the underlying used engine. It's just pure Python elements being linked together in a tree.
bridge does not try to be the ultimate XML library. What it offers is simply a way to make your applications more portable across Python implementations. bridge does not provide the full XML machinery and therefore does not support XPath nor XSLT. bridge simply loads an XML document and makes it easy to navigate through.
Why?
I wrote bridge because I wanted one of my other product to be portable. I don't plan bridge to take over the world of Python XML parsers. However you may find it interesting or meeting your needs, if so I'll be glad to hear about your feedback.
Usage
Imagine we have the folowing document:
<?xml version="1.0" encoding="UTF-8"?>
<a:feed xmlns:a="http://www.w3.org/2005/Atom">
<a:id>urn:uuid:955b74d8-17ad-4995-8e5a-d8e64a1d5cdc</a:id>
<a:title>Hello world</a:title>
<a:updated>2006-11-04T20:25:25.934536Z</a:updated>
<a:entry>
<a:id>urn:uuid:f8bd0164-3e7d-4c0b-b6f8-092ae1e679aa</a:id>
<a:published>2006-11-04T21:14:25.468203Z</a:published>
<a:author>
<a:name>Sylvain</a:name>
</a:author>
<a:title>Blah</a:title>
<a:summary type="text">something</a:summary>
</a:entry>
</a:feed>
Let's load it:
>>> from bridge import Element as E >>> e = E.load('feed.atom').xml_root >>> dir(e) ['_Element__update_prefixes', '__class__', '__copy__', '__delattr__', '__dict__', '__doc__', '__getattribute__', '__hash__', '__init__', '__iter__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__str__', '__unicode__', '__weakref__', '_root', 'as_attribute', 'as_attribute_of_element', 'as_cdata', 'as_list', 'clone', 'encoding', 'filtrate', 'forget', 'get_attribute', 'get_attribute_ns', 'get_child', 'get_children', 'get_root', 'has_child', 'has_element', 'insert_after', 'insert_before', 'load', 'parser', 'update_prefix', 'validate', 'xml', 'xml_attributes', 'xml_children', 'xml_name', 'xml_ns', 'xml_parent', 'xml_prefix', 'xml_root', 'xml_text'] >>> e.xml_children [u' ', <a:id xmlns:a="http://www.w3.org/2005/Atom" element at 0xb7c2e1ecL />, u'\n', u' ', <a:title xmlns:a="http://www.w3.org/2005/Atom" element at 0xb7c2e24cL />, u'\n', u' ', <a:updated xmlns:a="http://www.w3.org/2005/Atom" element at 0xb7c2e2acL />, u'\n', u' ', <a:entry xmlns:a="http://www.w3.org/2005/Atom" element at 0xb7c2e30cL />, u'\n']
As you can see children of each element are stored in the xml_children attribute. But in some cases you may prefer to set some of them as actual attributes of the element. You can do it like this:
>>> from bridge import Element as E >>> from bridge.common import atom_as_attr, atom_as_list, atom_attribute_of_element >>> e = E.load('feed.atom', as_attribute=atom_as_attr, as_list=atom_as_list, as_attribute_of_element=atom_attribute_of_element).xml_root >>> dir(e) ['_Element__update_prefixes', '__class__', '__copy__', '__delattr__', '__dict__', '__doc__', '__getattribute__', '__hash__', '__init__', '__iter__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__str__', '__unicode__', '__weakref__', '_root', 'as_attribute', 'as_attribute_of_element', 'as_cdata', 'as_list', 'clone', 'encoding', 'feed', 'filtrate', 'forget', 'get_attribute', 'get_attribute_ns', 'get_child', 'get_children', 'get_root', 'has_child', 'has_element', 'insert_after', 'insert_before', 'load', 'parser', 'update_prefix', 'validate', 'xml', 'xml_attributes', 'xml_children', 'xml_name', 'xml_ns', 'xml_parent', 'xml_prefix', 'xml_root', 'xml_text'] >>> e.feed <a:feed xmlns:a="http://www.w3.org/2005/Atom" element at 0xb7c4b14cL /> >>> dir(e.feed) ['_Element__update_prefixes', '__class__', '__copy__', '__delattr__', '__dict__', '__doc__', '__getattribute__', '__hash__', '__init__', '__iter__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__str__', '__unicode__', '__weakref__', '_root', 'as_attribute', 'as_attribute_of_element', 'as_cdata', 'as_list', 'clone', 'encoding', 'entry', 'filtrate', 'forget', 'get_attribute', 'get_attribute_ns', 'get_child', 'get_children', 'get_root', 'has_child', 'has_element', 'id', 'insert_after', 'insert_before', 'load', 'parser', 'title', 'update_prefix', 'updated', 'validate', 'xml', 'xml_attributes', 'xml_children', 'xml_name', 'xml_ns', 'xml_parent', 'xml_prefix', 'xml_root', 'xml_text'] >>> e.feed.entry [<a:entry xmlns:a="http://www.w3.org/2005/Atom" element at 0xb7c4b2ccL />]
The key is to set the as_attribute and as_list objects with a dictionary of the form {NAMESPACE: [name1, name2]}
For instance atom_as_attr, atom_as_list are defined as:
ATOM10_PREFIX = u'atom' ATOMPUB_PREFIX = u'app' THR_PREFIX = u'thr' ATOM10_NS = u'http://www.w3.org/2005/Atom' ATOMPUB_NS = u'http://purl.org/atom/app#' THR_NS = u'http://purl.org/syndication/thread/1.0' atom_as_attr = {ATOM10_NS: ['feed', 'id', 'title', 'updated', 'published', 'icon', 'logo', 'generator', 'rights', 'subtitle', 'content', 'summary', 'name', 'uri', 'email'], ATOMPUB_NS: ['edited', 'accept'], THR_NS: ['in-reply-to', 'total']} atom_as_list = {ATOM10_NS: ['author', 'contributor', 'category', 'link', 'entry'], ATOMPUB_NS: ['collection', 'workspace', 'categories']} atom_attribute_of_element = {None: ['type', 'term', 'href', 'rel', 'scheme', 'label', 'title', 'length', 'hreflang' 'src', 'ref'], THR_NS: ['count']}
The difference between as_attr and as_list is that:
- as_attr values will lead to Element instances being attached to the parent as attribute
- as_list will lead to Element instances being appended to a list
As bridge grows more built-in dictionaries will be added to handle different type of formats.
Once you have the document loaded you can navigate through it like this:
for child in e: print child # this is equivalent to for child in e.xml_children: print child # You can access attributes like this: e.entry[0].summary.xml_attributes[0].xml_text # Add elements author = Element(name=u'author', prefix=e.prefix, namespace=e.xmlns, parent=e) Element(name=u'name', content=u'Jon', prefix=e.prefix, namespace=e.xmlns, parent=author) subtitle = Element(name=u'subtitle', content=u"blah", prefix=e.prefix, namespace=e.xmlns, parent=e) # Add attributes Attribute(name=u'type', value=u'text', parent=subtitle) # Remove an element from the tree del e.updated # Serialize into a string only a fragment of the tree: e.entry[0].xml()
Note that in the above example we have set the as_attribute, as_list and as_attribute_of_element directly to the class so that each instance of it will follow the same pattern.
API
More details here.
bridge provides the following entities: Document, Element, Attribute, Comment and PI.
The Document is always returned when you use the load method and is required in case you have comments or PIs before the root element. To access the root element you can simply call xml_root.
Document/Element's interface
- xml_parent: Parent element or None
- xml_prefix: The XML prefix as a unicode object or None
- xml_ns: The XML namespace as a unicode object or None
- xml_name: The local name of the Element
- xml_text: The content of the element or None. If the element has mixed content this defaults to None and each segment of the content is attached in the correct order in * xml_children.
- xml_children: The list of child elements
- xml_root: The root element of the tree
In addition to those attributes, Element provides the following methods:
- xml: Transforms the Element into a string
- load: Loads an XML document (from a string, a file path, a file object) into a tree of Elements
- clone: returns a copy of an element and its children
- update_prefix: Change the prefix of the Element and its children
- filtrate: Filters out elements based on criteria (see below)
- validate: Validates a document based on criteria (see below)
- has_element: whether or not an element has a child set as one attribute of the instance
- has_child: whether or not a specified child is part of this element's children
- get_child: return a child of the element
- get_children: return a list of child
- get_attribute: returns an attribute per its local name
- get_attribute_ns: returns an attribute per its qualified name
- forget: gracefully drop an element from the tree
- insert_before: inserts and Element before another
- insert_after: inserts and Element after another
Attribute's interface
- xml_root: Root element
- xml_parent: Parent element or None
- xml_prefix: The XML prefix as a unicode object or None
- xml_ns: The XML namespace as a unicode object or None
- xml_name: The local name of the Element
- xml_text: The content of the element or None.
Updating prefixes
>>> from bridge import Element as E >>> e = E.load('<root><msg>hello</msg></root>').xml_root >>> e.xml() '<?xml version="1.0" encoding="UTF-8"?><root><msg>hello</msg></root>' >>> e.update_prefix(u'k', None, u'http://blah.com') >>> e.xml() '<?xml version="1.0" encoding="UTF-8"?><k:root xmlns:k="http://blah.com"><k:msg>hello</k:msg></k:root>' >>> e.update_prefix(u'p', u'http://blah.com', u'http://blah.com') >>> e.xml() '<?xml version="1.0" encoding="UTF-8"?><p:root xmlns:p="http://blah.com"><p:msg>hello</p:msg></p:root>' >>> e.update_prefix(None, u'http://blah.com', None) >>> e.xml() '<?xml version="1.0" encoding="UTF-8"?><root><msg>hello</msg></root>'
Filtering
You can filter the tree of elements based on whatever criteria you wish. For instance let's retrieve from the document entries that have been published before a certain date:
>>> from bridge import Element as E >>> from bridge.common import atom_as_attr, atom_as_list >>> e = E.load('feed.xml', as_attribute=atom_as_attr, as_list=atom_as_list) >>> from bridge.filter.atom import published_before >>> import datetime >>> dt = datetime.datetime.now() >>> e.filtrate(published_before, dt_pivot=dt, recursive=True) [<a:entry xmlns:a="http://www.w3.org/2005/Atom" element at -0x484e4574 />] >>> e.entry[0].filtrate(published_before, dt_pivot=dt) [<a:entry xmlns:a="http://www.w3.org/2005/Atom" element at -0x484e4574 />]
Basically the filtrate methods take a callable and keyword arguments to pass to that callable and return whatever is returned by the callable.
bridge uses the term filter simply because it returns a slice of your tree that matches the criteria set. bridge provides several built-in filters already and you can browse the source to find out more about them.
Validating
You can validate elements and their children like this:
>>> from bridge import Element as E >>> from bridge.common import atom_as_attr, atom_as_list >>> E.as_attribute = atom_as_attr >>> E.as_list = atom_as_list >>> e = E.load('feed.xml') >>> from bridge.validator import BridgeValidatorException >>> from bridge.validator.atom import id_as_url # validates that atom:id is a valid URL >>> try: ... e.validate(id_as_url) ... except BridgeValidatorException, exc: ... exc.element ... <a:id xmlns:a="http://www.w3.org/2005/Atom" element at -0x483b07f4 />
A validator is a callable that should raise a BridgeValidatorException? exception with the element in fault.
Incremental parsing
bridge offers a way to gradually parse a document which can be useful when doing XML streaming such as XMPP. This is done as follow:
>>> from bridge.parser import IncrementalParser >>> p = IncrementalParser() >>> p.feed('<r') >>> p.feed('><b') >>> p.feed('/></r>') >>> p.close()
Incremental parsing is nice but being able to pull out bridge Elements while parsing is even better and that's what bridge allows through a simple dispatching mechanism:
>>> from bridge.parser import DispatchParser >>> p = DispatchParser() >>> def dispatch(e): ... print e.xml() ... >>> h.register_at_level(1, dispatch) >>> p.feed('<r') >>> p.feed('><b') >>> p.feed('/></r>') <?xml version="1.0" encoding="UTF-8"?> <b xmlns=""></b> >>> p.close()
In the previous example we tell bridge to run the dispatch function as soon as any elements of the first level of the tree (ie those under the root element) is complete. Our dispatcher is called after the third call to feed in our example.
bridge allows for more complex rule for dispatching
- register_at_level(level, dispatcher): the dispatcher will be run as a soon as an element of the given level is complete
- register_on_element(element_name, dispatcher, namespace): this time the dispatching is done as soon as the specified element is completed anywhere in the XML stream
- register_on_element_per_level(element_name, level, dispatcher, namespace): it's a mix of the both above
- register_by_path(path, dispatcher): this time you provide a simple string which represents the path to the expected element (see below).
By default all dispatching is disabled. Each registry is independant from the other. Bear in mind though that the more dispatchers you add the slower the parsing becomes. This is not a huge problem however for most cases.
Query by path
bridge offers a way to lookup for document based on a very simple path-matching mechanism. A path is a pattern that should be applied by bridge to match elements. A path looks like an XPath query but is not a subset of XPath and does not try to be. It's merely for convenience that it looks alike.
Here is an example:
>>> from bridge import Element as E >>> e = E.load('feed.atom') >>> from bridge.filter import lookup >>> from bridge.common import ATOM10_NS >>> e.filtrate(lookup, path='/{%s}feed/{%s}entry/{%s}summary[@type="text"]' % (ATOM10_NS, ATOM10_NS, ATOM10_NS)) <a:summary xmlns:a="http://www.w3.org/2005/Atom" element at 0xb7c4b5acL />
The syntax is a bit rough but is of course much cleaner when your document doesn't have namespaces declared.
Mixed content
Here is how bridge handles mixed content:
>>> from bridge import Element as E >>> s = '<a>hello</a>' >>> e = E.load(s).xml_root >>> e.xml_text u'hello' >>> e.xml_children [] >>> s = '<a>hello<b />there</a>' >>> e = E.load(s).xml_root >>> e.xml_text >>> e.xml_children [u'hello', <b element at 0xb7ced34cL />, u'there']
As you can see the text is directly accessible as a child when in a mixed content mode. Otherwise the text is accessible via xml_text.
