TEI Dictionaries

From CrossWire Bible Society
Revision as of 11:03, 11 April 2019 by Refdoc (talk | contribs) (Body)

Jump to: navigation, search

Introduction

In SWORD, the "LD" module type is used to contain modules keyed to non-hierarchical keys. Such modules include: dictionaries (indexed by words or numbers), glossaries (simple one-word translation dictionaries), and daily devotionals (indexed by dates).

For lexicons and dictionaries, the use of Text Encoding Initiative P5 markup is encouraged. TEI P5 is an XML standard, quite similar to OSIS and ThML, intended for encoding all types of electronic documents. Since TEI is modular, it is possible to ignore the majority of its modules and use only a smaller set of tags necessary to our needs.

For the purpose using TEI P5 in SWORD, we have developed a special XML Schema that includes the basic set of P5 modules necessary for dictionaries and adds osisID and osisRef attributes (with their normal OSIS syntax) to many elements. This will permit cross-referencing with OSIS modules and the use of standard Bible references in TEI documents. Our customized TEI schema is available at http://www.crosswire.org/OSIS/teiP5osis.2.5.0.xsd.

Order

SWORD uses a strict binary search to find entries and nearest entries in a "LD" module. There are two restrictions that this places upon a TEI dictionary:

1) Keys are unique and cannot be repeated. This poses problems with some dictionaries that have more than 1 entry for a key. These will need to be merged.

2) Entries will be re-ordered (by the importer utility) based upon their UTF-8 code points. Fortunately this turns out to be identical to a simple 8-bit ASCII collation.

Note: Some characters can be composed or decomposed. The tei2mod importer will normalize to NFC using ICU. Any program that does not normalize its search request in exactly the same way cannot be expected to find entries.

General Structure

To produce a Lexicon/Dictionary with our customized TEI Schema, you can use this template:

<?xml version="1.0" encoding="utf-8"?>
 <TEI xmlns="http://www.crosswire.org/2013/TEIOSIS/namespace"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.crosswire.org/2013/TEIOSIS/namespace
                          http://www.crosswire.org/OSIS/teiP5osis.2.5.0.xsd">

Header

A header minimally contains information about the title of the electronic text in <titleStmt>, about its publication in <publicationStmt>, and bibliographic information about the source document from which it is derived <sourceDesc>. A minimal TEI header would look as follows:

<teiHeader>
 <fileDesc>
  <titleStmt>
   <title>A title statement about the electronic text</title>
  </titleStmt>
  <publicationStmt>
   <p>Information on the publication of the electronic text.</p>
  </publicationStmt>
  <sourceDesc>
   <p>A bibliographic description of the source for the electronic text.</p>
  </sourceDesc>
 </fileDesc>
</teiHeader>

Body

Here is the general structure of the body content:

<text>
 <body>
  <entryFree sortKey="AARON">
   <def>The son of Amram and Jochebed, and the older brother of Moses and Miriam.</def>
  </entryFree>
  <entryFree sortKey="ABADDON">
   <def>Destroyer, the name given to the king of the hosts represented by the locusts.</def>
  </entryFree>
  ....
 </body>
</text>

1. sortKeys are unique and cannot be repeated.

2. The container for a dictionary element in TEI is <entry> for structured entries, <entryFree> for any child elements allowed, and <superEntry> to collect entries into a larger one.

  • <entry> is strongly structured. It is more like a database row definition. Elements have particular parenting. And whitespace is insignificant. <entry> requires elements to be in a particular order and nested in a particular fashion and may not allow text in places one would want. <entry> seems more appropriate for original works.
  • <entryFree> is weakly structured. It is more like a document. The elements can come in any order, nested in any fashion and text can be interspersed as desired. Thus, whitespace is significant. For transforming e-texts into TEI, <entryFree> is then highly recommended.
  • <superEntry> can be used as a collector for several <entry> or <entryFree>. The tei2mod module creator will create an entry for the <superEntry> and one for each <entry> and <entryFree> in it.

3. The engine is designed to support multiple keys in a single dictionary module so something like <entryFree n="ἀγαπάω|agapaō|G25"> would be a feasible entry. A user could look up the same word in different languages modules without having to switch dictionaries.

An example:

<entryFree n="H0002|אב">
 <title>H2</title>
 <orth>אב</orth>
 <orth type="trans" rend="bold">'ab</orth>
 <pron rend="italic">ab</pron><lb/>
 <def>(Chaldee); corresponding to <ref target="Strong:H0001">H1</ref>: - father.</def>
</entryFree>

And here a more complicated example from our Webster1913 module

<entryFree sortKey="A" split="A|A per se">
<form type="headword"><orth rend="bold">A</orth></form> <pron>(<hi rend="italic">named ā in the English, and most commonly ä in other languages</hi>)</pron>. <def>The first letter of the English and of many other alphabets. The capital A of the alphabets of Middle and Western Europe, as also the small letter (a), besides the forms in Italic, black letter, etc., are all descended from the old Latin A, which was borrowed from the Greek <name type="biologicalSpecies" rend="italic"><ref target="alpha">Alpha</ref></name>, of the same form; and this was made from the first letter (�) of the Phœnician alphabet, the equivalent of the Hebrew <hi rend="italic">Aleph</hi>, and itself from the Egyptian origin. The <hi rend="italic">Aleph</hi> was a consonant letter, with a guttural breath sound that was not an element of Greek articulation; and the Greeks took it to represent their vowel <hi rend="italic">Alpha</hi> with the ä sound, the Phœnician alphabet having no vowel symbols.</def><lb/>
This letter, in English, is used for several different vowel sounds. See <hi rend="italic">Guide to pronunciation</hi>, §§ 43-74. The regular long <hi rend="italic">a</hi>, as in <hi rend="italic">fate</hi>, etc., is a comparatively modern sound, and has taken the place of what, till about the early part of the 17th century, was a sound of the quality of ä (as in <hi rend="italic">far</hi>).<lb/>
<sense n="2"><num type="sense">2.</num> <seg type="specialization" rend="italic">(Mus.)</seg> <def>The name of the sixth tone in the model major scale (that in C), or the first tone of the minor scale, which is named after it the scale in A minor. The second string of the violin is tuned to the A in the treble staff. — A sharp (A♯) is the name of a musical tone intermediate between A and B. — A flat (A♭) is the name of a tone intermediate between A and G.</def></sense><lb/>
<re type="colloc" rend="smaller"><term type="colloc" rend="bold smaller">A per se</term> <etym>(L. <oVar rend="italic">per se</oVar> by itself)</etym>, <def rend="narrow-spacing">one preëminent; a nonesuch.</def> <usg rend="italic"></usg></re><lb/>
<cit type="quotation"><quote>O fair Creseide, the flower and <oVar rend="italic">A per se</oVar><lb/>
Of Troy and Greece.<lb/>
<hi rend="text-align(right)"><persName type="author" rend="italic">Chaucer.</persName></hi></quote></cit><lb/>
</entryFree>

Markup

Linking Entries

There are three kinds of references in a dictionary:

Internal references to entries in the same dictionary.

<xr type="see">See: <ref target="self:key">key text</ref></xr>

Note: the key text may be any representation of the key. For example, the key may be G0019a and the key text may be 19a.

External references to entries in another work.

<xr type="xref"><ref target="work:key">key text</ref></xr>

Note: work is the short module name as found between square brackets as in [Strong]

Biblical references to scripture passages. These use osisRef and are discussed below.

OSIS References

For the purpose of facilitating the marking of Bible references and linking with OSIS documents, two attributes have been borrowed from OSIS:

osisID exists on virtually all elements and contains osisID(s) (optionally with work IDs). An osisID might be used to link from an OSIS document to your TEI dictionary entry.

osisRef exists on the <ref> element. A biblical reference occurring in an entry might be marked as:

<xr type="Bible"><ref osisRef="KJV:Gen.1.5-Gen.1.8">Genesis 1:5-8</ref></xr>

Rendering Instructions

TEI is focused on semantic markup, but supports rendering instructions on most elements via the rend attribute. This attribute contains a description (or recommendation) of how the enclosed text should be rendered. In addition, if a segment of text should be marked as primarily significant because of its differentiated rendering, it may be marked by either the <emph> or <hi> elements. <emph> indicates that the text is emphasized, whereas the more general and semantically neutral <hi> element simply acts as a place to hang the rend attribute:

I was <emph rend="italic">extremely</emph> excited by the new TEI filters. Yay.
This text is <hi rend="bold">bold</hi>

The rend attribute may contain a list of values, but these values are not specified by the TEI P5 specification itself. As such, for the purpose of interoperability and consistency, it is important that values for use in Sword be enumerated. Some of these values come from the set of allowed values on the type attribute of OSIS <hi> elements, which in turn borrows from CSS. Other values will generally borrow from CSS conventions.

CrossWire values for the TEI P5 rend attribute:

bold          'from OSIS'
illuminated   'from OSIS; an illuminated letter or drop-cap, rendered very large, preferably across multiple subsequent lines of text'
italic        'from OSIS'
line-through  'from OSIS; used for strike-through text'
normal        'from OSIS; used to switch off special rendering while in the midst of a string of special rendering'
small-caps    'from OSIS'
sub           'from OSIS; subscript text'
super         'from OSIS; superscript text'
underline     'from OSIS'

Images

You can easily place images in your TEI file using the <graphic /> element. This element is "milestoned," meaning it isn't a container. The forward slash near the end signals that fact. Use the "url" attribute to define the location of the image relative to the compiled module. In the example below, the image "crosswire.jpg" resides in a folder "images" in the same folder as the compiled module. (SVN version)

<graphic url="images/crosswire.jpg" />

Tables

Tables require a bit of work to get set up but can be useful for some purposes. The entire table is contained in a element, and each row is then contained in a <row> element. For each column in each row a <cell> element contains the text of that cell. The following table creates column labels in bold type and includes two columns and two rows below the label row. (SVN version)

<table>
   <row>
      <cell><hi type="bold">Column 1 Label</hi></cell>
      <cell><hi type="bold">Column 2 Label</hi></cell>
   </row>
   <row>
      <cell>Column 1, Row 1</cell>
      <cell>Column 2, Row 1</cell>
   </row>
   <row>
      <cell>Column 1, Row 2</cell>
      <cell>Column 2, Row 2</cell>
   </row>
</table>

Validation

Use of the above XML header will assist in automated validation with most XML Schema Validators.

See also the TEI Validator online: http://teibyexample.org/xquery/TBEvalidator.xq

Importing

The program tei2mod is used to create a SWORD Lexicon/Dictionary/Daily Devotional/Glossary module from valid TEI.

This is a work in progress, so please report any problems found.

The current usage is:

You are running tei2mod:
TEI Lexicon/Dictionary/Daily Devotional/Glossary module creation tool for
	The SWORD Project

usage: tei2mod <output/path> <teiDoc> [OPTIONS]
  -z			 use ZIP compression (default no compression)
  -Z			 use LZSS compression (default no compression)
  -s <2|4>		 max text size per entry(default 4):
  -c <cipher_key>	 encipher module using supplied key
				 (default no enciphering)
  -N			 Do not convert UTF-8 or normalize UTF-8 to NFC
				 (default is to convert to UTF-8, if needed,
				  and then normalize to NFC. Note: all UTF-8
				  texts should be normalized to NFC.)

	The options -z, -Z, and -s are mutually exclusive.

Notes:

  1. At this time, enciphering does not work for tei2mod.
  2. Currently, all compressed SWORD modules use ZIP compression.
  3. Having keys in the proper order may noticeably improve import time and may affect the module's lookup performance of "adjacent" entries.

Troubleshooting

If a module compiles but causes a front-end to crash:

  • Double-check the entry id (@n) to make sure there is a unique id for each entry.
  • For modules keyed to Strong's numbers, ensure there is "zero-padding," meaning all Strong's numbers should include four digits plus a zero at the beginning. So "G1" should be "00001". If any numbered id is not five digits, it make cause frontends to crash.
  • Remove all <div> elements.

Other helpful sites

The TEI Guidelines and in particular, the documentation for <entryFree>, on which most dictionaries should be based.

Tutorial modules accompanied with a dedicated examples section: http://teibyexample.org/TBE.htm

Sample TEI P5 documents