Converting Documents to ThML

Perl programs are available for converting from HTML to ThML (especially for HTML documents from Word 2000) and for converting ThML to text and to a linked web of HTML pages. You'll need to have a perl interpreter and (probably) Microsoft Word 2000.

To prepare a document in Word, create a named character style called XML. If you like, you can import that style and others from the sample16.rtf document. Then put in <divn> tags as required, and add other markup as desired, using the XML character style. If you are using the predefined styles, you could mark a scripture reference by selecting it and changing it to the scripref style, perhaps with a keyboard equivalent of alt-s. For more information on markup, see the simple markup guidelines or ... [pending]

The conversion scripts, available as h2h.gz, are split into a number of steps, both for speed and ease of development. The steps are these:

All of these are performed by the "h2h" script. On a unix system, type "h2h bookID" omitting the .htm extension. The rest will be done for you. Be prepared to wait a few minutes for large files.

If you have HTML files from another source and you can't first read them into Microsoft Word, you can try the process, but I haven't tested it for that use.

Known bugs: This software is still under development. It works reasonably well on a couple of documents, but there are some bugs.

  • HTML output is not valid
  • ThML generated is not even always valid
  • and more, no doubt. Probably you will be frustrated if you try to use this software, unless you are a bit of a perl hacker.

    The shell script provided is for unix, though conversion to MS-DOS is straightforward -- add "perl " before each script name and replace $1 with %1. I use the scripts under MS-DOS as well as linux regularly.