bridge
Latest version: 0.4.0
This page is updated for bridge 0.4.0
What is it?
bridge is a Python XML library trying to provide a high level and clean interface for manipulating XML documents.
Download it
- easy_install -U bridge
- Tarballs http://www.defuze.org/oss/bridge/
- svn co https://svn.defuze.org/oss/bridge/
API
More details here.
bridge provides the following entities: Document, Element, Attribute, Comment and PI.
The Document is always returned when you use the load method and is required in case you have comments or PIs before the root element. To access the root element you can simply call xml_root.
Document/Element's interface
- xml_parent: Parent element or None
- xml_prefix: The XML prefix as a unicode object or None
- xml_ns: The XML namespace as a unicode object or None
- xml_name: The local name of the Element
- xml_text: The content of the element or None. If the element has mixed content this defaults to None and each segment of the content is attached in the correct order in * xml_children.
- xml_children: The list of child elements
- xml_root: The root element of the tree
In addition to those attributes, Element provides the following methods:
- xml: Transforms the Element into a string
- load: Loads an XML document (from a string, a file path, a file object) into a tree of Elements
- clone: returns a copy of an element and its children
- update_prefix: Change the prefix of the Element and its children
- filtrate: Filters out elements based on criteria (see below)
- validate: Validates a document based on criteria (see below)
- has_element: whether or not an element has a child set as one attribute of the instance
- has_child: whether or not a specified child is part of this element's children
- get_child: return a child of the element
- get_children: return a list of child
- get_attribute: returns an attribute per its local name
- get_attribute_ns: returns an attribute per its qualified name
- get_attribute_value: returns an attribute's value
- set_attribute_value: sets the attribute's value and if it doesn't exist, it creates it
- forget: gracefully drop an element from the tree
- insert_before: inserts and Element before another
- insert_after: inserts and Element after another
Attribute's interface
- xml_root: Root element
- xml_parent: Parent element or None
- xml_prefix: The XML prefix as a unicode object or None
- xml_ns: The XML namespace as a unicode object or None
- xml_name: The local name of the Element
- xml_text: The content of the element or None.
Updating prefixes
>>> from bridge import Element as E >>> e = E.load('<root><msg>hello</msg></root>').xml_root >>> e.xml() '<?xml version="1.0" encoding="UTF-8"?><root><msg>hello</msg></root>' >>> e.update_prefix(u'k', None, u'http://blah.com') >>> e.xml() '<?xml version="1.0" encoding="UTF-8"?><k:root xmlns:k="http://blah.com"><k:msg>hello</k:msg></k:root>' >>> e.update_prefix(u'p', u'http://blah.com', u'http://blah.com') >>> e.xml() '<?xml version="1.0" encoding="UTF-8"?><p:root xmlns:p="http://blah.com"><p:msg>hello</p:msg></p:root>' >>> e.update_prefix(None, u'http://blah.com', None) >>> e.xml() '<?xml version="1.0" encoding="UTF-8"?><root><msg>hello</msg></root>'
Filtering
You can filter the tree of elements based on whatever criteria you wish. For instance let's retrieve from the document entries that have been published before a certain date:
>>> from bridge import Element as E >>> from bridge.common import atom_as_attr, atom_as_list >>> e = E.load('feed.xml', as_attribute=atom_as_attr, as_list=atom_as_list) >>> from bridge.filter.atom import published_before >>> import datetime >>> dt = datetime.datetime.now() >>> e.filtrate(published_before, dt_pivot=dt, recursive=True) [<a:entry xmlns:a="http://www.w3.org/2005/Atom" element at -0x484e4574 />] >>> e.entry[0].filtrate(published_before, dt_pivot=dt) [<a:entry xmlns:a="http://www.w3.org/2005/Atom" element at -0x484e4574 />]
Basically the filtrate methods take a callable and keyword arguments to pass to that callable and return whatever is returned by the callable.
bridge uses the term filter simply because it returns a slice of your tree that matches the criteria set. bridge provides several built-in filters already and you can browse the source to find out more about them.
Validating
You can validate elements and their children like this:
>>> from bridge import Element as E >>> from bridge.common import atom_as_attr, atom_as_list >>> E.as_attribute = atom_as_attr >>> E.as_list = atom_as_list >>> e = E.load('feed.xml') >>> from bridge.validator import BridgeValidatorException >>> from bridge.validator.atom import id_as_url # validates that atom:id is a valid URL >>> try: ... e.validate(id_as_url) ... except BridgeValidatorException, exc: ... exc.element ... <a:id xmlns:a="http://www.w3.org/2005/Atom" element at -0x483b07f4 />
A validator is a callable that should raise a BridgeValidatorException? exception with the element in fault.
Incremental parsing
bridge offers a way to gradually parse a document which can be useful when doing XML streaming such as XMPP. This is done as follow:
>>> from bridge.parser import IncrementalParser >>> p = IncrementalParser() >>> p.feed('<r') >>> p.feed('><b') >>> p.feed('/></r>') >>> p.close()
Incremental parsing is nice but being able to pull out bridge Elements while parsing is even better and that's what bridge allows through a simple dispatching mechanism:
>>> from bridge.parser import DispatchParser >>> p = DispatchParser() >>> def dispatch(e): ... print e.xml() ... >>> h.register_at_level(1, dispatch) >>> p.feed('<r') >>> p.feed('><b') >>> p.feed('/></r>') <?xml version="1.0" encoding="UTF-8"?> <b xmlns=""></b> >>> p.close()
In the previous example we tell bridge to run the dispatch function as soon as any elements of the first level of the tree (ie those under the root element) is complete. Our dispatcher is called after the third call to feed in our example.
bridge allows for more complex rule for dispatching
- register_at_level(level, dispatcher): the dispatcher will be run as a soon as an element of the given level is complete
- register_on_element(element_name, dispatcher, namespace): this time the dispatching is done as soon as the specified element is completed anywhere in the XML stream
- register_on_element_per_level(element_name, level, dispatcher, namespace): it's a mix of the both above
- register_by_path(path, dispatcher): this time you provide a simple string which represents the path to the expected element (see below).
By default all dispatching is disabled. Each registry is independant from the other. Bear in mind though that the more dispatchers you add the slower the parsing becomes. This is not a huge problem however for most cases.
Query by path
bridge offers a way to lookup for document based on a very simple path-matching mechanism. A path is a pattern that should be applied by bridge to match elements. A path looks like an XPath query but is not a subset of XPath and does not try to be. It's merely for convenience that it looks alike.
Here is an example:
>>> from bridge import Element as E >>> e = E.load('feed.atom') >>> from bridge.filter import lookup >>> from bridge.common import ATOM10_NS >>> e.filtrate(lookup, path='/{%s}feed/{%s}entry/{%s}summary[@type="text"]' % (ATOM10_NS, ATOM10_NS, ATOM10_NS)) <a:summary xmlns:a="http://www.w3.org/2005/Atom" element at 0xb7c4b5acL />
The syntax is a bit rough but is of course much cleaner when your document doesn't have namespaces declared.
Mixed content
Here is how bridge handles mixed content:
>>> from bridge import Element as E >>> s = '<a>hello</a>' >>> e = E.load(s).xml_root >>> e.xml_text u'hello' >>> e.xml_children [] >>> s = '<a>hello<b />there</a>' >>> e = E.load(s).xml_root >>> e.xml_text >>> e.xml_children [u'hello', <b element at 0xb7ced34cL />, u'there']
As you can see the text is directly accessible as a child when in a mixed content mode. Otherwise the text is accessible via xml_text.
