In order to get good performance out of your implementation, you'll need to do a few things to ensure good parser performance.

  • Ensure you're not validating on each page.
  • Write good XSL; test your performance using a variety of tools - some XSL is expensive.
  • If you must parse, use SAX or TRaX - do not use DOM APIs which require documents to be fully loaded in memory.
  • Whenever possible, stream content - don't render the whole page to a buffer. Doing so means you lose all of the benefits of event-based XML processing gained by using SAX.
  • Ensure that remote entity references are always either cached locally or referenced from local files; see the following section for more details.

Accessing DTDs and Schemas

XML parsers rely on DTDs and schemas when performing two kinds of operations:

  • when validating XML documents, and
  • when resolving entity references.

Validation is a slow process; in general, ensure that your production builds do not require validation in order to show content. This conserves CPU on your production systems, as well as allowing for the fastest possible response by reducing the time it takes to generate content. Production environments require keeping latency to a minimum.

One of the most important things to recognize about XML parsers, however, is that even when validation is disabled, access to DTDs and documents which refer to entities are necessary for the parser; it cannot ignore them.

Take, for example, a doctype tag: <!DOCTYPE wowDestination SYSTEM "http://xml.whatsonwhen.com/whatsonwhen.destination.dtd">

A doctype tag can contain entity references - these entities, like the encoding for an e-acute, which appears on browsers as é but appears in XML encoded as &eacute; may be processed by the XML parser. In order to know how to process that entity reference, it will need to load all entity references contained in the document where they specify foreign entities.

Practically speaking, this means that documents which contain either a doctype or a foreign entity reference may refer to, for example, a remote website, as the DOCTYPE tag above specifies. That doctype tag may contain further references to other foreign entities, all of which will have to be loaded, over the internet via HTTP, on every parse operation, slowing the system to a crawl and damaging the performance of your production systems.

Parsers contain a variety of mechanisms for handling these problems; also, you can take steps yourself, without implementing code, to ensure these problems don't happen to your production systems.

In short, these are:

  • Implement your parser's DTD/EntityHandler in code; when you encounter a reference to a SystemID which is not on your local filesystem, or appears to be a URL starting with "http://", perform the HTTP request in code, but cache the results of that request in memory.
  • If your parser supports XCatalog, provide your parser with a local catalog - this contains a list of SystemIDs, along with a reference to where, on your filesystem, it can find a local copy of that remote system ID.
  • If you do not want to write code, ensure that you make local copies of all remotely referenced DTDs and foreign Entity references, and modify those documents to ensure they make no further web requests. Make a local copy of the DTD; modify all of your XML to point to that local DTD on your filesystem. Examine the DTD for foreign entity references, and modify it to do the same, making copies of each of the foreign entity reference documents it attempts to load.

Using DTD/EntityHandler API to cache content

Simple and straightforward: both Java and .NET provide API for implementing custom DTD and EntityHandlers; these make simple requests to a method like 'resolveEntity', containing a public and a system ID. Perform an HTTP request, read the contents of the website, and cache it in memory; provide the cached copy back through to the application. No document modification is required; and in many ways, this is the high performance option - ensuring that the content to be parsed is always held in memory, ready for use.

Using XCatalog

For those with the capability, this is also straightforward, and provides a universal mechanism for providing mappings of foreign entity and DTD references across your entire application to their local references. If you're using XCatalog today, add the SystemIDs to your catalog document for all entity references which refer to remote sites, and copy that content from those remote sites to your hard drive; no document modification is required.

Modifying Content To Avoid Remote References

Straightforward, but requiring a process to ensure that new content is modified before the go-live process, this means following a simple procedure on each XML document.

Make a list of each XML document to be processed; starting with the first document in the list:

  1. If this document contains a DTD reference that refers to a remote site:
    • If you have not already downloaded it, do so, and add it to the list of documents to process.
    • Change the remote HTTP reference in the System ID to refer to the path of the local file.
  2. If this document contains an Entity reference that refers to a remote site:
    • If you have not already downloaded it, do so, and add it to the list of documents to process.
    • Change the remote HTTP reference in the System ID to refer to the path of the local file.
  3. If this document contains a Schema reference that refers to a remote site:
    • If you have not already downloaded it, do so, and add it to the list of documents to process.
    • Change the remote HTTP reference in the System ID to refer to the path of the local file.
  4. When processing of the current document is done, move to the next in the list.

When finished, you should have made one pass through each XML document provided to you; in addition, you should have made one pass through each XML document which you downloaded - including those XML documents which you downloaded which, themselves, had references to other downloaded documents.

A WhatsOnWhen guide document refers to a DTD; that DTD refers to three files on W3C servers defining foreign character entities; so, that would have required modifying source XML, a downloaded copy of the WhatsOnWhen DTD, which was then modified and from which you downloaded three more files from the W3C.

Build A High-Performance XSLT Implementation

Whenever possible, use the TRAX-based API to implement XSLTs; though it has some semantic differences, and a few known problems, recent XSLTC compilers will compile xslts into high-performance document transformers; when you can, use them.

Even when you can't do that, ensure that, whenever possible, parsed XSL stylesheets are cached. The JAXP API offers a method on TransformerFactory which can be used to get a new Templates object; that represents a cached version of the requested template, and can result in much higher performance as the XSLT doesn't have to be parsed into memory for each run; it also gives XSLT processors an opportunity (though not all implementations can; indeed, this is one of the fundamental benefits of an XSLTC implementation) to do JIT-style optimisations on the in-memory representation from profile information taken over multiple document runs.