TEI Dictionaries

From CrossWire Bible Society
Revision as of 11:02, 16 May 2008 by Osk (Talk | contribs) (Introduction: updated link to latest custom TEI schema, which adds numerous modules that are useful in dictionaries)

Jump to: navigation, search


In Sword, the "LD" module type is used to contain modules keyed to non-hierarchical keys. Such modules include: dictionaries (indexed by words or numbers), glossaries (simple one-word translation dictionaries), and daily devotionals (indexed by dates).

For lexicons and dictionaries, the use of TEI P5 markup is encouraged. TEI P5 is an XML standard, quite similar to OSIS and ThML, intended for encoding all types of electronic documents. Since TEI is modular, it is possible to ignore the majority of its modules and use only a smaller set of tags necessary to our needs.

For the purpose using TEI P5 in Sword, we have developed a special XML Schema that includes the basic set of P5 modules necessary for dictionaries and adds osisID and osisRef attributes (with their normal OSIS syntax) to many elements. This will permit cross-referencing with OSIS modules and the use of standard Bible references in TEI documents. Our customized TEI schema is available at http://www.crosswire.org/osis/teiP5osis.1.2.xsd.


Sword uses a strict binary search to find entries and nearest entries in a "LD" module. There are two restrictions that this places upon a TEI dictionary:

1) Keys are unique and cannot be repeated. This poses problems with some dictionaries that have more than 1 entry for a key. These will need to be merged.

2) Entries will be re-ordered (by the importer utility) based upon their UTF-8 code points. Fortunately this turns out to be identical to a simple 8-bit ASCII collation.


Linking Entries

There are three kinds of references in a Biblical dictionary:

Internal references to entries in the same dictionary.

<xr type="see">See: Cite error: Invalid <ref> tag;

invalid names, e.g. too many</xr> Note: the key text may be any representation of the key. For example, the key may be G0019a and the key text may be 19a.

External references to entries in another work.

<xr type="xref">Cite error: Invalid <ref> tag;

invalid names, e.g. too many</xr> Note: work is the short module name as found between square brackets as in [BDB]

Biblical references to scripture passages. These use osisRef and are discussed below.

OSIS References

For the purpose of facilitating the marking of Bible references and linking with OSIS documents, two attributes have been borrowed from OSIS:

osisID exists on virtually all elements and contains osisID(s) (optionally with work IDs). An osisID might be used to link from an OSIS document to your TEI dictionary entry.

osisRef exists on the Cite error: Closing </ref> missing for <ref> tag</xr>


Use of the following XML header will assist in automated validation with most XML Schema Validators:

<?xml version="1.0" encoding="utf-8"?>
<TEI xmlns="http://www.crosswire.org/2008/TEIOSIS/namespace"


The program tei2mod is used to create a Sword Lexicon/Dictionary/Daily Devotional/Glossary module from valid TEI.

This is a work in progress, so please report any problems found.

The current usage is:

usage: ./tei2mod <output/path> <teiDoc> [OPTIONS]
  -z                     use ZIP compression (default no compression)
  -Z                     use LZSS compression (default no compression)
  -s <2|4>               max text size per entry (default 4):
  -c <cipher_key>        encipher module using supplied key
                                 (default no enciphering)
  -N                     Do not convert UTF-8 or normalize UTF-8 to NFC
                                 (default is to convert to UTF-8, if needed, and then normalize to NFC)
                                 Note: all UTF-8 texts should be normalized to NFC
-z, -Z, and -s are mutually exclusive

At this time, enciphering does not work.

Currently, all compressed SWORD module uses ZIP compression.

Alternatively, imp2ld can be used. But, at this time, it is not recommended, as it will not convert to NFC UTF-8.

Note: Having keys in the proper order may noticeably improve import time and may be affect the module's lookup performance of "adjacent" entries.