Monday, December 19, 2011

DTD Resolution Be Gone!

I just read Cay Horstmanns post on The Sordid Tale of XML Catalogs. I could agree more: it's a mess. Not to so long ago, I had to write some software for pulling down XML content from a web site. I used Nathan Hamblan's Dispatch library and Scala's built-in support for traversing an XML document the easy way. And immediately noticed that parsing the document did take way more time than it should. It turns out that upon parsing the document, the parser pulled down a whole slew of DTDs and entity definitions. My first response was to find a way to bake a Catalog based resolver in somewhere. But then - I'm not particularly font of the current implementations. (Are they even maintained.) And since this was pure production, I had to make it working as quickly as possible. So this is what I did to get out of the mess: First of all, I defined a trait:
trait NoDtdResolution extends XMLLoader[Node] {

  override def parser = {
    val f = SAXParserFactory.newInstance()
    val result = f.newSAXParser()
    val reader = result.getXMLReader
    reader.setFeature("", false)

Normally, in Scala, you would load an XML document like this:
However, if you do it like that, you will get a parser that downloads the DTDs as well. And I don't want that. With the trait I created, I could load an XML document like this:
val loader = new NoBindingFactoryAdapter with NoDtdResolution
... without the DTDs being resolved. (I guess you could do something similar to install a catalog.) Which it great, but unfortunately, it doesn't make Dispatch aware of it. So, in dispatch, if you use the <> operator, it will still download the XML file and parse with DTD loading activated. So I needed another operator: one that uses a NoBindingFactoryAdapter without DTD resolution:
trait ImplicitXmlHandlers extends ImplicitHandlerVerbs {
  implicit def handlerToXmlHandlers(r: HandlerVerbs) = new XmlHandlers(r)
  implicit def requestToXmlHandlers(r: Request) = new XmlHandlers(r)

object XmlHandlers extends ImplicitXmlHandlers

class XmlHandlers(subject: HandlerVerbs) {

  private val tolerantAdapter = new NoBindingFactoryAdapter with NoDtdResolution

  def <|>[T](fn: NodeSeq => T) = subject >>~ { reader =>

With that in place, I can now use Dispatch without worrying about DTDs getting loaded:
Http(url(...) <|> {
  xml =>
... where I was using this before:
Http(url(...) <> {
  xml =>