Difference between revisions of "TEI Dictionaries"

From CrossWire Bible Society
Jump to: navigation, search
(OSIS References)
(Order: Added note regarding lookup. Perhaps there should be a separate entry for it?)
Line 12: Line 12:
  
 
2) Entries will be re-ordered (by the importer utility) based upon their UTF-8 code points. Fortunately this turns out to be identical to a simple 8-bit ASCII collation.
 
2) Entries will be re-ordered (by the importer utility) based upon their UTF-8 code points. Fortunately this turns out to be identical to a simple 8-bit ASCII collation.
 +
 +
Note: Some characters can be composed or decomposed. The tei2mod importer will normalize to NFC using ICU. Any program that does not normalize it's search request in exactly the same way cannot be expected to find entries.
  
 
=Markup=
 
=Markup=

Revision as of 12:45, 19 March 2009

Introduction

In SWORD, the "LD" module type is used to contain modules keyed to non-hierarchical keys. Such modules include: dictionaries (indexed by words or numbers), glossaries (simple one-word translation dictionaries), and daily devotionals (indexed by dates).

For lexicons and dictionaries, the use of TEI P5 markup is encouraged. TEI P5 is an XML standard, quite similar to OSIS and ThML, intended for encoding all types of electronic documents. Since TEI is modular, it is possible to ignore the majority of its modules and use only a smaller set of tags necessary to our needs.

For the purpose using TEI P5 in SWORD, we have developed a special XML Schema that includes the basic set of P5 modules necessary for dictionaries and adds osisID and osisRef attributes (with their normal OSIS syntax) to many elements. This will permit cross-referencing with OSIS modules and the use of standard Bible references in TEI documents. Our customized TEI schema is available at http://www.crosswire.org/osis/teiP5osis.1.3.xsd.

Order

SWORD uses a strict binary search to find entries and nearest entries in a "LD" module. There are two restrictions that this places upon a TEI dictionary:

1) Keys are unique and cannot be repeated. This poses problems with some dictionaries that have more than 1 entry for a key. These will need to be merged.

2) Entries will be re-ordered (by the importer utility) based upon their UTF-8 code points. Fortunately this turns out to be identical to a simple 8-bit ASCII collation.

Note: Some characters can be composed or decomposed. The tei2mod importer will normalize to NFC using ICU. Any program that does not normalize it's search request in exactly the same way cannot be expected to find entries.

Markup

Linking Entries

There are three kinds of references in a dictionary:

Internal references to entries in the same dictionary.

<xr type="see">See: <ref target="self:key">key text</ref></xr>

Note: the key text may be any representation of the key. For example, the key may be G0019a and the key text may be 19a.

External references to entries in another work.

<xr type="xref"><ref target="work:key">key text</ref></xr>

Note: work is the short module name as found between square brackets as in [Strong]

Biblical references to scripture passages. These use osisRef and are discussed below.

OSIS References

For the purpose of facilitating the marking of Bible references and linking with OSIS documents, two attributes have been borrowed from OSIS:

osisID exists on virtually all elements and contains osisID(s) (optionally with work IDs). An osisID might be used to link from an OSIS document to your TEI dictionary entry.

osisRef exists on the <ref> element. A biblical reference occurring in an entry might be marked as:

<xr type="Bible"><ref osisRef="KJV:Gen.1.5-Gen.1.8">Genesis 1:5-8</ref></xr>

Rendering Instructions

TEI is focused on semantic markup, but supports rendering instructions on most elements via the rend attribute. The render attribute contains a description (or recommendation) of how the enclosed text should be rendered. In addition, if a segment of text should be marked as primarily significant because of its differentiated rendering, it may be marked by either the <emph> or <hi> elements. <emph> indicates that the text is emphasized, whereas the more general and semantically neutral <hi> element simply acts as a place to hang the rend attribute:

I was <emph rend="italic">extremely</emph> excited by the new TEI filters. Yay.
This text is <hi rend="bold">bold</hi>

The rend attribute may contain a list of values, but these values are not specified by the TEI P5 specification itself and so are not . As such, for the purpose of interoperability and consistency, it is important that values for use in Sword be enumerated. Some of these values come from the set of allowed values on the type attribute of OSIS <hi> elements, which in turn borrows from CSS. Other values will generally borrow from CSS conventions.

CrossWire values for the TEI P5 rend attribute:

bold          'from OSIS'
illuminated   'from OSIS; an illuminated letter or drop-cap, rendered very large, preferably across multiple subsequent lines of text'
italic        'from OSIS'
line-through  'from OSIS; used for strike-through text'
normal        'from OSIS; used to switch off special rendering while in the midst of a string of special rendering'
small-caps    'from OSIS'
sub           'from OSIS; subscript text'
super         'from OSIS; superscript text'
underline     'from OSIS'

Validation

Use of the following XML header will assist in automated validation with most XML Schema Validators:

<?xml version="1.0" encoding="utf-8"?>
<TEI xmlns="http://www.crosswire.org/2008/TEIOSIS/namespace"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.crosswire.org/2008/TEIOSIS/namespace
       http://www.crosswire.org/osis/teiP5osis.1.4.xsd">

Importing

The program tei2mod is used to create a SWORD Lexicon/Dictionary/Daily Devotional/Glossary module from valid TEI.

This is a work in progress, so please report any problems found.

The current usage is:

usage: ./tei2mod <output/path> <teiDoc> [OPTIONS]
  -z                     use ZIP compression (default no compression)
  -Z                     use LZSS compression (default no compression)
  -s <2|4>               max text size per entry (default 4):
  -c <cipher_key>        encipher module using supplied key
                                 (default no enciphering)
  -N                     Do not convert UTF-8 or normalize UTF-8 to NFC
                                 (default is to convert to UTF-8, if needed, and then normalize to NFC)
                                 Note: all UTF-8 texts should be normalized to NFC
-z, -Z, and -s are mutually exclusive

At this time, enciphering does not work.

Currently, all compressed SWORD module uses ZIP compression.

Alternatively, imp2ld can be used. But, at this time, it is not recommended, as it will not convert to NFC UTF-8.

Note: Having keys in the proper order may noticeably improve import time and may be affect the module's lookup performance of "adjacent" entries.

Other helpful sites

The TEI Guidelines and in particular, the documentation for <entryFree>, on which most dictionaries should be based.