PHP5 XML Problems & Solutions
Posted by Maciek Wed, 17 Aug 2005 03:37:00 GMT
PHP5 DOM can be tricky at first and hard to remember every time. I always find myself digging around for code I wrote in the past to accomplish a particular trick that I know I can do in theory, if I could just remember how..
Some collected problems and solutions follow in no particular order. This is an evolving document, and I will add solutions to problems as I or my friends come across them. If you have any cool tricks or solutions, please leave a comment and let me know. For the example code, you will require at least PHP5 with the latest libxml2 (minimum) installed.
Conventions
To save some space and not repeat myself I follow a few conventions in this document.
$somedoc_xpor$xp– This variable refers to an XPath resolver instantiated on a DOMDocument like this:$somedoc_xp = new DOMXPath($somedoc);
Special Thanks:
Special thanks go to the following people:
- Corban and CJ for posting this article in earlier times.
- Marek Kuziel for corrections to my stupid mistakes :)
Contents
- Blindly get the root node in the document without node/tagname or namespace knowledge.
- Serialize a document to a string.
- Copy a part of one document into a specific place in another document.
- Delete All Child Elements of some Element whose descendants or immediate children contain a text node with some given text.
- Attempt to parse a possibly-malformed XML document gracefully
The Notes
1. Blindly get the root node in the document without node/tagname or namespace knowledge.
Using XPATH:
$doc_nodes = $xp->query("//*[position()=1]");
$root_node = $doc_nodes->item(0);
Using pure DOM:
$root_node = $d->firstChild;
2. Serialize a document to a string.
Without an XML header (first get the root node as in note 1)
$document_xml_string = $doc->saveXML($root_node);
With an XML header
$document_xml_string = $doc->saveXML();
3. Copy a part of one document into a specific place in another document.
Example document 1 ($sourcedoc)
<p>
<span>The Quick Brown Fox..!</span>
</p>
Example document 2 ($targetdoc)
<body>
<div id="targetdiv">
</div>
</body>
Desired document ($targetdoc)
<body> <div id="targetdiv"><p> <span>The Quick Brown Fox..!</span> </p></div> </body>
The copy:
// note: xpath queries return nodelists, not single nodes, // which means we have to select the 0th item from the // resulting nodesets of these queries // query for our source and target nodes$target_nodelist = $targetdoc_xp->query("//div[@id='targetdiv']"); $src_nodelist = $sourcedoc_xp->query("//p[position()=1]");// grab them from the resulting nodelists$sourcenode = $src_nodelist->item(0); $targetnode = $target_nodelist->item(0);// import node from one document into another$node_copy = $targetdoc->importNode($sourcenode,true);// place it where we want to put it$targetnode->appendChild($node_copy);
4. Delete All Child Elements of some Element whose descendants or immediate children contain a text node with some given text
This one is a little tricky so a complete, testable example follows. The red nodes in the document are the ones that we want removed. In specific, text nodes in this document that contain the word “fox” or “superman” will have their first closest ancestor “tr” element removed. In XPath lingo, that means selecting the node in position()=1 in the ancestor axis of these particular text nodes. (whew!)
<? ob_start(); ?> <body> <table> <tr> <td> <table> <tr> <td> <span>Now is the time for all good men to come to the aid of the party.</span> </td> </tr><tr> <td> <span>the <strong>quick brown</strong> fox jumps over the lazy dog.</span> </td> </tr><tr> <td> <table> <tr> <td> <span>... Hello There ! .. </span> </td> </tr><tr> <td> <span>superman never made any money</span> </td> </tr></table> <span>It was the best of times, it was the worst of times...</span> </td> </tr> </table> </td> </tr> </table> </body> <? $xmlstr = ob_get_clean(); $doc = new DOMDocument(); $doc->loadXML($xmlstr); $doc_xp = new DOMXPath($doc); // stuff the unwanted nodes into a nodelist $delNodelist = $doc_xp->query("//tr/descendant-or-self::*/text()[contains(.,'fox') or contains(.,'superman')]/ancestor::*[name()='tr'][position()=1]"); // iterate through that nodelist and remove each node from its parent foreach($delNodelist as $delNode){ $removednode = $delNode->parentNode->removeChild($delNode); } header("Content-type:application/xml;"); print $doc->saveXML(); ?>
Note that you can get very sophisticated with your XPath query very easily. For example, you can quickly expand your path of destruction by telling the XPath query to also include “div” tags. Instead of selecting //tr, change the XPath to select //*[name()='tr'], which is functionally equivalent. Then just add the “div” parts with an or. (The XPath function name() returns the qualified name of the XML node the XPath state machine is currently iterating over.)
$delNodelist = $doc_xp->query("
//*[name()='tr' or name()='div']/descendant-or-self::*/text()[contains(.,'fox')
or contains(.,'superman')]/ancestor::*[name()='tr' or name()='div'][position()=1]
");
5. Attempt to parse a possibly-malformed XML document gracefully
As of writing, PHP5 has a rather inconsistent exception and error handling system, and this inconsistency extends to its DOM and XML features as well. While in other languages you can expect to catch an exception when loading malformed XML, PHP5’s xml parser instead dumps an error message to your screen (depending on how you have configured your errors in PHP). We can get around this—superficially, at least—by silencing PHP while we attempt to parse and then doing a sanity check ourselves:
// // Attempt to parse the document, then check if it loaded // the @ symbol quietly muffles any error output from loadXML //@$doc = DOMDocument::loadXML($badxml);// check if doc has a firstchild and if that firstchild is an element (ie. tag)if($doc && $doc->firstChild && $doc->firstChild->nodeType == XML_ELEMENT_NODE){// we've loaded the document and it's goodprint "Document is valid"; } else { print "Document could not be parsed!"; }
Note that in certain situations, libxml2 will go ahead and parse a document that is “semi-bad” (for example, containing unresolvable entities and such), so you can still technically end up with a rather shady document. The idea here is to check whether you been given complete unparseable trash vs. something you can work with and process with DOM, XPath, and XSLT.







