Lessons learned:
Think four times before doing stream-based XML processing, even though it appears to be more efficient than tree-based.
But if you have to do stream-based processing, make sure to use robust, fairly scaleable tools like XML::Templates, not sgmlspl. Of course it cannot be as pleasant as tree-based XML processing, but examine db2x_manxml and db2x_texixml.
Do not use XML::DOM directly for stylesheets. Your “stylesheet” would become seriously unmanageable. At least take a look at some of the XPath modules out there. Ideally, use a real stylesheet language like XSLT. A C-based implementation of XSLT is faster than any Perl hack you can come up with.
Avoid XSLT extensions whenever possible. I don't think there is anything wrong with them intrinsically, but it is a headache to have to compile your own XSLT processor. (libxslt is written in C, and the extensions must be compiled-in and cannot be loaded dynamically at runtime.) Not to mention there seems to be a thousand different set-ups for different XSLT processors.
Perl is not as good at XML as it’s hyped to be. Too many !@#$% characters when using objects in Perl. But what bites the most is that Perl SAX seems to be not well-maintained. It also seems to me, that no one else has seriously used Perl SAX for robust applications — here, I don't mean your occasional hacks, but something that wants to perform error diagnostics on its input, among other things.
Don’t be afraid to use XML intermediate formats (e.g. Man-XML and Texi-XML) for converting to other markup languages. The rules for these formats are made for human consumption, not on purely logical considerations. It is difficult for XML tools to write “perfect” output in these formats: standard stylesheets (XSLT, DSSSL) cannot be used for converting to other markup languages, and embedding the markup rules into the conversion tool increases its complexity to unmanageable proportions.
You might think that we could, instead, make a separate class (in the Java sense) that hides all this complexity from the rest of the conversion program. Theoretically you would get the same result, but it would be harder. Firstly, it is far easier to write plain text manipulation code in Perl than in Java or C or XSLT, which is what you would be restricted otherwise. Secondly, if the intermediate format is hidden in a Java class or C API, it is harder to debug errors. Whereas with the approach we have taken, we can visually examine the textual output of the XSLT processor and fix the Perl script as we go along.
Finally, another advantage of using intermediate XML formats processed by a Perl script is that we can often eliminate the use of XSLT extensions. In particular, all the way back when XSLT stylesheets first went into docbook2X, the extensions related to Texinfo node handling could have been easily moved to the Perl script, but I didn't realize it! I feel stupid now.
Design the XML intermediate format to be easy to use from the standpoint of the conversion tool, and similar to how XML document types work in general. e.g. abstract the paragraphs of a document, rather than their paragraph breaks (the latter is typical of traditional markup languages, but not of XML).
If I had known this in the very beginning, it would have saved a lot of development time, and docbook2X would be much more advanced by now.
I'm quite impressed by some of the things that people make XSLT 1.0 do. Things that I thought were impossible, or at least unworkable without using “real” scripting language. (db2x_manxml and db2x_texixml fall in the category of things that can be done in XSLT 1.0 but inelegantly.)
Internationalize as soon as possible. That is much easier than adding it in later.
Same advice for build system.
Writing good documentation takes skill. This manual has has been revised substantially at least four times [3], with the author consciously trying to condense information each time.