Difference between revisions of "TEI Dictionaries"

From CrossWire Bible Society
Jump to: navigation, search
(Introduction: [http://www.tei-c.org/ Text Encoding Initiative] P5 markup is encouraged. '''TEI P5''')
(Introduction: [http://www.tei-c.org/release/doc/tei-p5-doc/readme-2.0.1.html TEI P5])
Line 2: Line 2:
 
In SWORD, the "LD" module type is used to contain modules keyed to non-hierarchical keys. Such modules include: dictionaries (indexed by words or numbers), glossaries (simple one-word translation dictionaries), and daily devotionals (indexed by dates).
 
In SWORD, the "LD" module type is used to contain modules keyed to non-hierarchical keys. Such modules include: dictionaries (indexed by words or numbers), glossaries (simple one-word translation dictionaries), and daily devotionals (indexed by dates).
  
For lexicons and dictionaries, the use of [http://www.tei-c.org/ Text Encoding Initiative] P5 markup is encouraged. '''TEI P5''' is an XML standard, quite similar to OSIS and ThML, intended for encoding all types of electronic documents. Since TEI is modular, it is possible to ignore the majority of its modules and use only a smaller set of tags necessary to our needs.
+
For lexicons and dictionaries, the use of [http://www.tei-c.org/ Text Encoding Initiative] P5 markup is encouraged. [http://www.tei-c.org/release/doc/tei-p5-doc/readme-2.0.1.html TEI P5] is an XML standard, quite similar to OSIS and ThML, intended for encoding all types of electronic documents. Since TEI is modular, it is possible to ignore the majority of its modules and use only a smaller set of tags necessary to our needs.
  
 
For the purpose using TEI P5 in SWORD, we have developed a special XML Schema that includes the basic set of P5 modules necessary for dictionaries and adds osisID and osisRef attributes (with their normal OSIS syntax) to many elements. This will permit cross-referencing with OSIS modules and the use of standard Bible references in TEI documents. Our customized TEI schema is available at http://www.crosswire.org/osis/teiP5osis.1.4.xsd.
 
For the purpose using TEI P5 in SWORD, we have developed a special XML Schema that includes the basic set of P5 modules necessary for dictionaries and adds osisID and osisRef attributes (with their normal OSIS syntax) to many elements. This will permit cross-referencing with OSIS modules and the use of standard Bible references in TEI documents. Our customized TEI schema is available at http://www.crosswire.org/osis/teiP5osis.1.4.xsd.

Revision as of 19:26, 25 January 2012

Introduction

In SWORD, the "LD" module type is used to contain modules keyed to non-hierarchical keys. Such modules include: dictionaries (indexed by words or numbers), glossaries (simple one-word translation dictionaries), and daily devotionals (indexed by dates).

For lexicons and dictionaries, the use of Text Encoding Initiative P5 markup is encouraged. TEI P5 is an XML standard, quite similar to OSIS and ThML, intended for encoding all types of electronic documents. Since TEI is modular, it is possible to ignore the majority of its modules and use only a smaller set of tags necessary to our needs.

For the purpose using TEI P5 in SWORD, we have developed a special XML Schema that includes the basic set of P5 modules necessary for dictionaries and adds osisID and osisRef attributes (with their normal OSIS syntax) to many elements. This will permit cross-referencing with OSIS modules and the use of standard Bible references in TEI documents. Our customized TEI schema is available at http://www.crosswire.org/osis/teiP5osis.1.4.xsd.

Order

SWORD uses a strict binary search to find entries and nearest entries in a "LD" module. There are two restrictions that this places upon a TEI dictionary:

1) Keys are unique and cannot be repeated. This poses problems with some dictionaries that have more than 1 entry for a key. These will need to be merged.

2) Entries will be re-ordered (by the importer utility) based upon their UTF-8 code points. Fortunately this turns out to be identical to a simple 8-bit ASCII collation.

Note: Some characters can be composed or decomposed. The tei2mod importer will normalize to NFC using ICU. Any program that does not normalize its search request in exactly the same way cannot be expected to find entries.

Markup

Linking Entries

There are three kinds of references in a dictionary:

Internal references to entries in the same dictionary.

<xr type="see">See: <ref target="self:key">key text</ref></xr>

Note: the key text may be any representation of the key. For example, the key may be G0019a and the key text may be 19a.

External references to entries in another work.

<xr type="xref"><ref target="work:key">key text</ref></xr>

Note: work is the short module name as found between square brackets as in [Strong]

Biblical references to scripture passages. These use osisRef and are discussed below.

OSIS References

For the purpose of facilitating the marking of Bible references and linking with OSIS documents, two attributes have been borrowed from OSIS:

osisID exists on virtually all elements and contains osisID(s) (optionally with work IDs). An osisID might be used to link from an OSIS document to your TEI dictionary entry.

osisRef exists on the <ref> element. A biblical reference occurring in an entry might be marked as:

<xr type="Bible"><ref osisRef="KJV:Gen.1.5-Gen.1.8">Genesis 1:5-8</ref></xr>

Rendering Instructions

TEI is focused on semantic markup, but supports rendering instructions on most elements via the rend attribute. The render attribute contains a description (or recommendation) of how the enclosed text should be rendered. In addition, if a segment of text should be marked as primarily significant because of its differentiated rendering, it may be marked by either the <emph> or <hi> elements. <emph> indicates that the text is emphasized, whereas the more general and semantically neutral <hi> element simply acts as a place to hang the rend attribute:

I was <emph rend="italic">extremely</emph> excited by the new TEI filters. Yay.
This text is <hi rend="bold">bold</hi>

The rend attribute may contain a list of values, but these values are not specified by the TEI P5 specification itself and so are not . As such, for the purpose of interoperability and consistency, it is important that values for use in Sword be enumerated. Some of these values come from the set of allowed values on the type attribute of OSIS <hi> elements, which in turn borrows from CSS. Other values will generally borrow from CSS conventions.

CrossWire values for the TEI P5 rend attribute:

bold          'from OSIS'
illuminated   'from OSIS; an illuminated letter or drop-cap, rendered very large, preferably across multiple subsequent lines of text'
italic        'from OSIS'
line-through  'from OSIS; used for strike-through text'
normal        'from OSIS; used to switch off special rendering while in the midst of a string of special rendering'
small-caps    'from OSIS'
sub           'from OSIS; subscript text'
super         'from OSIS; superscript text'
underline     'from OSIS'

Validation

Use of the following XML header will assist in automated validation with most XML Schema Validators:

 <?xml version="1.0" encoding="utf-8"?>
 <TEI xmlns="http://www.crosswire.org/2008/TEIOSIS/namespace"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.crosswire.org/2008/TEIOSIS/namespace
        http://www.crosswire.org/osis/teiP5osis.1.4.xsd">

Importing

The program tei2mod is used to create a SWORD Lexicon/Dictionary/Daily Devotional/Glossary module from valid TEI.

This is a work in progress, so please report any issues at http://www.crosswire.org/bugs/browse/API

The current usage is:

You are running utils\tei2mod: $Rev: 2138 $
TEI Lexicon/Dictionary/Daily Devotional/Glossary module creation tool for
	The SWORD Project

usage: utils\tei2mod <output/path> <teiDoc> [OPTIONS]
  -z			 use ZIP compression (default no compression)
  -Z			 use LZSS compression (default no compression)
  -s <2|4>		 max text size per entry(default 4):
  -c <cipher_key>	 encipher module using supplied key
				 (default no enciphering)
  -N			 Do not convert UTF-8 or normalize UTF-8 to NFC
				 (default is to convert to UTF-8, if needed,
				  and then normalize to NFC. Note: all UTF-8
				  texts should be normalized to NFC.)

	The options -z, -Z, and -s are mutually exclusive.

Alternatively, imp2ld can be used. Usage:

imp2ld 1.0 Lexicon/Dictionary/Daily Devotional/Glossary module creation tool for the SWORD Project
  usage:
   utils\imp2ld <filename> [modname] [ 4 (default) | 2 | z - module driver] [entries per compression block]

Notes:

  1. At this time, enciphering does not work for tei2mod.
  2. Currently, all compressed SWORD modules use ZIP compression.
  3. Having keys in the proper order may noticeably improve import time and may be affect the module's lookup performance of "adjacent" entries.
  4. At this time, imp2ld is not recommended, as it does not convert to NFC UTF-8.

Other helpful sites

The TEI Guidelines and in particular, the documentation for <entryFree>, on which most dictionaries should be based.

Sample TEI P5 documents