Osis2mod

From CrossWire Bible Society
Revision as of 19:23, 11 September 2008 by Dmsmith (talk | contribs) (some planned functionality now implemented.)

Jump to: navigation, search

Introduction

osis2mod transforms an OSIS encoded Bible or commentary into a SWORD module.

History of Changes

The following outlines in reverse, chronological order the major changes to osis2mod. When several changes were made over the span of a few days, they are lumped into the most recent date. Bug fixes are not mentioned.

Date Feature

2008-09-11

  • Words of Christ (WoC) can be marked up naturally. osis2mod does the right thing.
  • Container elements converted to their milestoned form.

2008-02-29

  • Now supports commentaries in addition to Bibles.
  • Added utf-8 detection and automatic conversion from cp1252 and iso8859-1 to utf-8, making that the default behavior
  • Added NFC normalization, making it the default
  • Now handles the full definition of an osisID and an osisRef, including workID and grain.

2007-09-27

  • Changed command-line parsing from positional parameters to flags.

2007-05-13

  • All whitespace tokens are converted into blanks and adjacent spaces are merged into one. Leading whitespace on each verse is still removed.

2007-05-01

  • Added version identification to the running of osis2mod.

2007-04-24

  • Added ignoring of unknown books

2006-07-15

  • Optimized writing to module.
  • Added simple transformations of <q> and <p> into milestones.
  • Now allow empty attributes, such as marker=""

2006-07-04

  • Updated to support OSIS 2.1
  • Added validation that verses in isolation are well formed

2005-12-22

  • Added command-line options for compress and encipher.

2005-04-29

  • Removed <verse> and </verse>

2005-01-23

  • Reverted pre-verse title handling

2004-06-12

  • Removed pre-verse title handling
  • Added <verse> and </verse>
  • Added support for milestoned version of some OSIS elements
  • Added support for verses that are not in the KJV versification by appending them to the closest prior verse.
  • Added support for osisIDs with sub-verse references (i.e. osisID grains)

2004-05-19

  • Added support for linked verses

2003-11-20

  • Added recognition for <chapter> element

2003-05-26

  • Initial version

Transformations

Osis2mod performs the following transformations:

  • Whitespace -- Allows for human-readable OSIS files.
    • Leading whitespace on books, chapters and verses is removed
    • Whitespace is normalized into blanks
    • multiple adjacent whitespace is reduced to a single space
  • Unicode handling - All modules should be UTF-8, NFC.
    • Latin-1 (cp1252 and iso8859-1) are converted into UTF-8
    • UTF-8 is normalized into NFC
  • Milestone conversion - necessary for frontends to show a verse at a time.
    (note: genX is unique for an sID/eID pair, where X is a number.)
    • <q ...>...</q> is converted into <q sID="genX" .../>...<lt;q eID="genX" .../>. Note: Quotes with who="Jesus" are not transformed at this time.
    • <p> is converted into <lb type="x-begin-paragraph"/> and </p> is transformed into <lb type="x-end-paragraph">.
    • <verse ...>...</verse> becomes <verse sID="genX" .../>...<verse eID="genX" .../>
    • <chapter ...>...</chapter> becomes <chapter sID="genX" .../>...<chapter eID="genX" .../>
    • <closer ...>...</closer> becomes <closer sID="genX" .../>...<closer eID="genX" .../>
    • <div ...>...</div> becomes <div sID="genX" .../>...<div eID="genX" .../>
    • <l ...>...</l> becomes <l sID="genX" .../>...<l eID="genX" .../>
    • <lg ...>...</lg> becomes <lg sID="genX" .../>...<lg eID="genX" .../>
    • <salute ...>...</salute> becomes <salute sID="genX" .../>...<salute eID="genX" .../>
    • <signed ...>...</signed> becomes <signed sID="genX" .../>...<signed eID="genX" .../>
    • <speech ...>...</speech> becomes <speech sID="genX" .../>...<speech eID="genX" .../>
  • Words of Christ - necessary for front-ends to appropriately highlight the WOC, a verse at a time.
    • <q sID="XXX" who="Jesus" .../>...<eID="XXX" who="Jesus" .../> becomes <q who="Jesus" marker=""><q sID="XXX" .../>...<q eID="XXX" .../></q>
    • <q who="Jesus" ...>...</q> becomes <q who="Jesus" marker=""><q sID="genX" .../>...<q eID="genX" .../></q>
    • Within the following construct, <q who="Jesus" marker="">...</q> will surround verse text.
  • Pre-Verse Titles
    • Titles immediately preceeding a verse are converted into <title type="section" subType="x-preverse>...</title>
    • Interverse tags not in titles are appended to prior verse.
    • (planned) <div sID="genX" type="YYY" subType="x-preverse"/>.../<div eID="genX" type="YYY" subType="x-preverse"/> will replace preverse titles.

Note: Other than Pre-Verse Titles these transformations can be reversed to produce the original elements.

Exclusions

Only content within <div>...</div> is retained. All other is excluded.

Usage

It is always best to use the most recent version of osis2mod and compiling it from SVN is best.

After the SWORD 1.5.9 release, osis2mod was changed to take flags rather than positional arguments.

usage: ./osis2mod <output/path> <osisDoc> [OPTIONS]
  -a                     augment module if exists (default is to create new)
  -z                     use ZIP compression (default no compression)
  -Z                     use LZSS compression (default no compression)
  -b <2|3|4>    compression block size (default 4):
                                 2 - verse; 3 - chapter; 4 - book
  -c <cipher_key>        encipher module using supplied key
                                 (default no enciphering)
  -N			 do not convert to UTF-8 and normalize to NFC
                                 (default is to convert to UTF-8 and normalize to NFC)
				 Note: all UTF-8 texts should be normalized to NFC

<output/path>
This a path to any existing directory. It is best for it to be empty.

<osisDoc>
This is a single, well-formed, valid OSIS document.

-a
Osis2mod can create a Bible all at once or incrementally, depending on the presence of the -a flag. This provides for two abilities,

  1. Assembling a Bible from book files:
    mkdir /tmp/mymodule
    osis2mod /tmp/mymodule  matt.xml
    osis2mod /tmp/mymodule -a mark.xml
    ...
    osis2mod /tmp/mymodule -a rev.xml
    

    Note: The book files can be in any order. SWORD will order them correctly in the index.

  2. Adding corrections to a Bible:
    osis2mod /tmp/mymodule -a fixes.xml
    

    Note: When fixes are put into the module they are appended to the data file and do not actually replace the verses. The index file is adjusted to point to the new place in the data file.

-z|-Z
A SWORD Bible can be compressed with Zip (-z) or LZSS (-Z). All of SWORD's Bible modules are compressed with Zip. This saves significant space over an uncompressed module. Uncompressed modules are useful for debugging.

-b 2|3|4
This setting is only useful for a compressed module. The choice as to whether to use Verse (2), Chapter (3) or Book (4, the default) level compression depends upon the amount of data in the block. A typical Bible is best compressed book by book. A commentary, chapter by chapter. If the commentary is very robust and the amount of text per verse is really huge, then verse compression might make sense.

All of SWORD's compressed Bible modules are compressed by book. Basically, all of the verses in a block are compressed and appended to the data file. For this reason, the datafile cannot be uncompressed by anything other than the SWORD and JSword libraries.

When creating the module by appending it is important to do so by whole compression block. That is, if blockType is Chapter, then the osisDoc needs to contain one or more whole chapters.

-c cipherKey
This is typically 16 characters in length, having no leading or trailing spaces, consisting of alternating sets of 4 alpha and 4 numeric characters, such as Aduf0274PjNq0328.

-N
All OSIS modules should be UTF-8 and all that are UTF-8 are also to be NFC. The default is to automatically detect the presense of Latin-1 (either cp1252 or iso8859-1) and convert it to UTF-8 and to normalize UTF-8 to NFC. This flag will turn off this behavior and is useful for creating Latin-1 modules or for modules that are already UTF-1 and NFC.

Note: this was added late Feb 2008 and requires ICU support when compiling.