Whiteboard/TEI Dictionary Proposal

From CrossWire Bible Society
Revision as of 02:25, 23 April 2012 by Dowens (Talk | contribs) (Proposed Features of a New Model)

Jump to: navigation, search

Summary

The current SWORD lexicon model allows for only flat dictionaries, which is ideally suited to Strongs but not to more recent lexica. A new model needs to take into account page numbers, non-alphabetic sorting, and hierarchical entries. Ideally it would also allow for browsing a dictionary more like a book.

Problems with the Current Model

  • Dictionaries are flat. Dictionaries that are hierarchical must be flattened, but BDB (forthcoming) is hierarchical. Roots form super-entries, and the lexicon as a whole is not strictly alphabetical. See the example document below, which is abstracted from BDB.
  • Entries are sorted according to unicode code points. This leads to a number of problems.
    • In many languages and scripts (including some Latin scripts), sorting by unicode code points does not preserve the proper alphabetic order.
    • If BDB were sorted in this way, it would nullify the information about the connections between words based on the roots.
  • There is no way to identify page numbers found in print editions of a given module. This information is particularly important for those doing academic work.
  • In practice, front-ends usually display lexicon entries as isolated containers. This model works well for Strongs because Strongs-tagged texts take you directly to the correct entry. This does not work for all dictionaries, though. When looking up a word, you might want to scan up and down the "page" to find the entry you are looking for. This is especially important if natural language keys are used so that the text might get you to roughly the right place in the dictionary but not necessarily the exact entry you need. Dictionaries need to allow for fuzzy lookup.
  • Lexicon modules in effect have a single key. That means the Bible module, for example, must be marked up in exactly the same way as the lexicon module for the two to work well. Even with Strongs this creates problems. G0001 and 00001 should lead to the same place, but they do not. Strong's texts are marked up as 00001, but the Strong's TEI file is G0001.

Proposed Features of a New Model

  • The order of the lexicon should be the same order as the XML file used to compile the module.
  • Perhaps an arbitrary (numeric?) key for entries could be created that would be hidden from the user to make life easier for developers, but topic maps could connect corresponding entries in numbered dictionaries (Strongs) and dictionaries keyed to natural language (BDB, etc.). This could be extensible over the long-term.
  • Continuous scrolling would facilitate displaying page numbers and browsing entries.

Example TEI File

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.crosswire.org/2008/TEIOSIS/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.crosswire.org/2008/TEIOSIS/namespace http://www.crosswire.org/OSIS/teiP5osis.1.4.xsd">
<text>
<body>
  <pb n="1"/>
  <div1><head>א</head>
    <superEntry id="אבב" trans="abb">Entry text
      <entry id="אב" trans="ab" strong="H3">Entry text</entry>
      <entry id="אביב" trans="abib" strong="H24">Entry text</entry>
    </superEntry>
    <superEntry id="אבד" trans="abd" strong="H6">
  <pb n="2"/>
      <entry id="אבד" trans="abd" strong="H8">Entry text</entry>
      <entry id="אבדה" trans="abdh" strong="H9">Entry text</entry>
      <entry id="אבדון" trans="abdwn" strong="H10">Entry text</entry>
    </superEntry>
  </div1>
</body>
</text>
</TEI>