Difference between revisions of "Osis2mod"
(→History of Changes: clarified the div in header comment.)
|Line 561:||Line 561:|
== See also ==
== See also ==
* [[Osis2mod testcases]]
* [[Osis2mod testcases]]
Revision as of 17:55, 27 April 2012
- 1 Introduction
- 2 Current status
- 3 History of Changes
- 4 Transformations
- 5 Handling of Introductions, Titles and Inter-Verse Material
- 6 Exclusions
- 7 Usage
- 8 Messages
- 9 See also
Software bugs relating to osis2mod should be reported in http://www.crosswire.org/bugs/browse/API
- Please describe current status of osis2mod, including a list of any outstanding issues or unsolved difficulties.
History of Changes
The following outlines in reverse, chronological order the major changes to osis2mod. When several changes were made over the span of a few days, they are lumped into the most recent date. Bug fixes are not mentioned.
Osis2mod performs the following transformations:
- Whitespace -- Allows for human-readable OSIS files.
- Leading whitespace on books, chapters and verses is removed
- Whitespace is normalized into blanks
- Multiple adjacent whitespace is reduced to a single space
- Unicode handling - All modules should be UTF-8, NFC.
- Latin-1 (cp1252 and iso8859-1) are converted into UTF-8
- UTF-8 is normalized into NFC, unless specified otherwise.
- Milestone conversion - necessary for frontends to show a verse at a time.
(note: genX is unique for an sID/eID pair, where X is a number.)
- <q ...>...</q> is converted into <q sID="genX" .../>...<q eID="genX" .../>. Note: Quotes with who="Jesus" are not transformed at this time.
- <p ...>...</p> becomes <div type="paragraph" sID="genX" .../>... <div type="paragraph" eID="genX" ...>.
- <chapter ...>...</chapter> becomes <chapter sID="genX" .../>...<chapter eID="genX" .../>
- <closer ...>...</closer> becomes <closer sID="genX" .../>...<closer eID="genX" .../>
- <div ...>...</div> becomes <div sID="genX" .../>...<div eID="genX" .../>
- <l ...>...</l> becomes <l sID="genX" .../>...<l eID="genX" .../>
- <lg ...>...</lg> becomes <lg sID="genX" .../>...<lg eID="genX" .../>
- <salute ...>...</salute> becomes <salute sID="genX" .../>...<salute eID="genX" .../>
- <signed ...>...</signed> becomes <signed sID="genX" .../>...<signed eID="genX" .../>
- <speech ...>...</speech> becomes <speech sID="genX" .../>...<speech eID="genX" .../>
- <verse ...>...</verse> becomes (when using -d 2 for debugging.) <milestone resp="v" sID="genX" ... />...<milestone resp="v" eID="genX" ... />
- Words of Christ - necessary for front-ends to appropriately highlight the WOC, a verse at a time.
- <q sID="XXX" who="Jesus" .../>...<eID="XXX" who="Jesus" .../> becomes <q who="Jesus" marker=""><q sID="XXX" .../>...<q eID="XXX" .../></q>
- <q who="Jesus" ...>...</q> becomes <q who="Jesus" marker=""><q sID="genX" .../>...<q eID="genX" .../></q>
- Within the following construct, <q who="Jesus" marker="">...</q> will surround verse text.
- Pre-Verse Titles (obsolete with SVN revision 2358 for the SWORD 1.6.0 release)
- Titles immediately preceeding a verse are converted into <title type="section" subType="x-preverse>...</title>
- Interverse tags not in titles are appended to prior verse.
- (In 1.6.0) <div sID="pvX" type="x-milestone" subType="x-preverse"/>...<div eID="pvX" type="x-milestone" subType="x-preverse"/> will replace preverse titles.
Note: With 1.6.0, these transformations can be reversed to produce the original elements.
Handling of Introductions, Titles and Inter-Verse Material
SWORD for module, testament, book and chapter introductory material. Those introductions can have appropriate titles as well. In SWORD 1.6.0 the handling of this material has changed.
- In the following, the effects of the above transformations are not shown.
Module and Testament Introductions
At this time, osis2mod does not fully support module and testament introductions.
- It may include the first testament introduction, but not the second.
Book Introductions and Titles
Book introductions and titles are straight forward. It includes the start of the book and everything following it up to, but not including the start of the chapter. See OSIS Bibles for best practices in marking up titles and introductions.
<div type="book" ...> ... introductory material ... <chapter"...>
will put the following into the book introduction:
<div type="book" ...> ... introductory material ...
Chapter introductions and titles are a bit problematic. Between the start of a chapter and its first verse, we could have a chapter title, a chapter introduction and/or a start of a section of verses or a titled verse. Osis2mod now handles this in a predictable fashion. From the start of the chapter up to and not including a section div or a title that has a type that is not main, chapter or sub, the content is chapter introduction. After that, it is part of the verse.
Specifically, the following list gives the possible first elements following the chapter introduction.:
- <div type="section" ...>
- <title type="yyy" ...> where yyy is not main, chapter or sub.
<chapter ...> <title>Chapter Title</title> (or <title type="chapter">Chapter Title</title> or <title type="main">Chapter Title</title>) <title type="sub">Chapter Subtitle</title> <div type="introduction">... intro ...</div> <p> <lg> <div type="section"> or <title type="yyy">
will put the following into the chapter introduction:
<chapter ...> <title>Chapter Title</title> <title type="sub">Chapter Subtitle</title> <p> <lg> <div type="introduction">... intro ...</div>
Note: Prior to 1.6.0, osis2mod would change the order of some these elements.
The material starting with:
<div type="section"> or <title type="yyy">
and including everything up to the <verse ...> will be put into the following construct and prepended to the verse content.
<div type="x-milestone" subType="x-preverse" sID="pvXXX"/> <div type="section"> or <title type="yyy"> <div type="x-milestone" subType="x-preverse" eID="pvXXX"/> ... verse content ...
Between verses we may have closing tags to finish off what was started earlier, structural opening tags (e.g. line groups, divisions, paragraphs, ...), titles and/or introductory material.
Upon finding the close of a verse, osis2mod will append all adjacent closing tags to it. Once it finds a start tag, it will attach that to the following verse, marking it up in the same fashion.
For example, the following would be prepended to the verse content:
<div type="x-milestone" subType="x-preverse" sID="pvXXX"/> <div type="section"> <title>Section title</title> <p> <lg> <div type="x-milestone" subType="x-preverse" eID="pvXXX"/> ... verse content ...
The material following the last verse of a chapter is appended to that verse. You might find:
... verse content ... </chapter> <div type="colophon">... colophon text ...</div> </div> </div>
Only content starting the first <div> to the last </div> is retained. All other is excluded. From a practical perspective, this excludes the OSIS header information.
SWORD's utilxml.cpp does not support XML comments. As this component is called by osis2mod, then this means that osis2mod does not support XML comments. i.e. Currently, it does not even look for comments and ignore them. This can sometimes lead to FATAL stack errors when there are too many comments in the source text. See API-140.
It is always preferable to use the most recent version of osis2mod and compiling it from SVN is best.
- After the SWORD 1.5.9 release, osis2mod was changed to take flags rather than positional arguments.
You are running osis2mod: $Rev: 2562 $ OSIS Bible/commentary module creation tool for The SWORD Project usage: utils\osis2mod <output/path> <osisDoc> [OPTIONS] <output/path> an existing folder that the module will be written <osisDoc> path to the validated OSIS document, or '-' to read from standard input -a augment module if exists (default is to create new) -z use ZIP compression (default no compression) -Z use LZSS compression (default no compression) -b <2|3|4> compression block size (default 4): 2 - verse; 3 - chapter; 4 - book -c <cipher_key> encipher module using supplied key (default no enciphering) -N do not convert UTF-8 or normalize UTF-8 to NFC (default is to convert to UTF-8, if needed, and then normalize to NFC) Note: UTF-8 texts should be normalized to NFC. -s <2|4> bytes used to store entry size (default is 2). Note: useful for commentaries with very large entries in uncompressed modules (2 bytes to store size equal 65535 characters) -v <v11n> specify a versification scheme to use (default is KJV) Note: The following are valid values for v11n: Catholic Catholic2 German KJV KJVA Leningrad Luther MT NRSV NRSVA Synodal SynodalP Vulg -d <flags> turn on debugging (default is 0) Note: This flag may change in the future. Flags: The following are valid values: 0 - no debugging 1 - writes to module, very verbose 2 - verse start and end 4 - quotes, esp. Words of Christ 8 - titles 16 - inter-verse material 32 - BSP to BCV transformations 64 - v11n exceptions 128 - parsing of osisID and osisRef 256 - internal stack 512 - miscellaneous This argument can be used more than once. (Or the flags may be added together.) See http://www.crosswire.org/wiki/osis2mod for more details.
Parameters and Options
This a path to any existing directory. It is best for it to be empty.
This is a single, well-formed, valid OSIS document.
If - is used instead of a file name, the document will be read from standard input. This allows for two constructs:
osis2mod ./modules/texts/ztext/KJV - < kjv.xml
cat kjv.xml | osis2mod ./modules/texts/ztext/KJV -
Osis2mod can create a Bible all at once or incrementally, depending on the presence of the -a flag. This provides for two abilities,
- Assembling a Bible from book files:
mkdir /tmp/mymodule osis2mod /tmp/mymodule matt.xml osis2mod /tmp/mymodule -a mark.xml ... osis2mod /tmp/mymodule -a rev.xml
Note: The book files can be in any order. SWORD will order them correctly in the index.
- Adding corrections to a Bible:
osis2mod /tmp/mymodule -a fixes.xml
Note: When fixes are put into the module they are appended to the data file and do not actually replace the verses. The index file is adjusted to point to the new place in the data file.
A SWORD Bible can be compressed with Zip (-z) or LZSS (-Z). All of SWORD's Bible modules are compressed with Zip. This saves significant space over an uncompressed module. Uncompressed modules are useful for debugging.
This setting is only useful for a compressed module. The choice as to whether to use Verse (2), Chapter (3) or Book (4, the default) level compression depends upon the amount of data in the block. A typical Bible is best compressed book by book. A commentary, chapter by chapter. If the commentary is very robust and the amount of text per verse is really huge, then verse compression might make sense.
All of SWORD's compressed Bible modules are compressed by book. Basically, all of the verses in a block are compressed and appended to the data file. For this reason, the datafile cannot be uncompressed by anything other than the SWORD and JSword libraries.
When creating the module by appending it is important to do so by whole compression block. That is, if blockType is Chapter, then the osisDoc needs to contain one or more whole chapters.
This is typically 16 characters in length, having no leading or trailing spaces, consisting of alternating sets of 4 alpha and 4 numeric characters, such as Aduf0274PjNq0328. The key is case-sensitive.
All OSIS modules should be UTF-8 and all that are UTF-8 are also to be NFC. The default is to automatically detect the presense of Latin-1 (either cp1252 or iso8859-1) and convert it to UTF-8 and to normalize UTF-8 to NFC. This flag will turn off this behavior and is useful for creating Latin-1 modules or for modules for which the source text is already UTF-8 and NFC.
Note: this was added late Feb 2008 and requires ICU support when compiling.
A value of 2, the default, restricts raw, uncompressed modules to 64K bytes per entry. A value of 4, breaks this barrier. This is needed for Bibles, having large introductory materials, and for commentaries with large entries. All compressed OSIS modules can handle large entries.
Note: this was added late Apr 2009 and will be part of the SWORD 1.6.0 release (formerly known as 1.5.11).
By default, osis2mod uses the KJV versification. The practical implication of this is that only books in the KJV canon are allowed and any text in an allowed book are retained. However, if the verse reference of a supported book falls outside of the versification it is appended to the prior verse in the canon. This flag allows for an alternate versification.
Note: this was added late Apr 2009 and will be part of the SWORD 1.6.0 release (formerly known as 1.5.11). With that release, only the Leningrad Codex will be supported, with -v Leningrad.
The flag can be used more than once or the flags can be added together. For example,
-d 2 -d 4
is the same as
To do verbose debugging use:
For the most part these flags are not intended for debugging modules, but rather for debugging problems in osis2mod.
The -d 2 flag produces no output but puts milestones into the module where verses start and end. The form of the milestone is:
<milestone resp="v" [attributes from verse] />
The milestone will contain the osisID from the verse and also a valid sID or eID. The sID/eID indicates the start of end of the verse.
Note: the -d 2 flag might change at any time, or may even be removed.
Osis2mod has robust, mind-boggling messages. These are provided here in hopes that it will help problem diagnosis.
When an error occurs that causes osis2mod to exit without processing the entire input file, a non-zero exit status is supplied to the caller. Here are the codes that osis2mod uses:
const int EXIT_BAD_ARG = 1; // Bad parameter given for program const int EXIT_NO_WRITE = 2; // Could not open the module for writing const int EXIT_NO_CREATE = 3; // Could not create the module const int EXIT_NO_READ = 4; // Could not open the input file for reading. const int EXIT_BAD_NESTING = 5; // BSP or BCV nesting is bad or improper XML comment
In the following, example values are given in [...]. The brackets do not actually appear in the message. Also, the messages are a bit prettier here than in reality.
WARNING(UTF8): [ osisID ]: Should be converted to UTF-8 ([ text ])
The program will always check for text that is not UTF-8.
INFO(UTF8): [ osisID ]: Converting to UTF-8 ([ text before conversion ])
Text that is converted to UTF-8 is noted.
ERROR(UTF8): [ osisID ]: Converting to UTF-8 ([ text after first conversion ])
It is an error if after a conversion it still is not UTF-8.
WARNING(UTF8): osis2mod is not compiled with support for ICU. Ignoring -n flag.
Normalization was requested, but since osis2mod was not compiled for it, it cannot honor the default request.
INFO(V11N): [ osisID ] is not in the [ v11n ] versification.
Indicates that a verse is not in the versification.
INFO(V11N): [ osisID ] is not in the [ v11n ] versification. Appending content to [ osisID ]
This like the other indicates a versification problem, but shows where the text will be found. Osis2mod preserves all module content for supported books.
WARNING(V11N): New book is [ name ] and is not in [ v11n ] versification, ignoring
The name of the book was not recognized as belonging to the chosen versification, it and all of it's content is ignored.
INFO(WRITE): Appending entry: [ osisID ]: [ text so far ]
If osis2mod encounters text that needs to be appended to a verse that is already in the module. This could indicate that
- the reference is in the input twice. This typically indicates a problem.
- more text was found that needs to be added to the prior verse.
- osis2mod is being run in append mode to fix a verse in the module.
INFO(LINK): Linking [ osisID ] to [ osisID ]
An osisID such as "Gen.1.1 Gen.1.2 Gen.1.3" was used and the latter are linked to the first.
ERROR(REF): Invalid osisID/annotateRef: [ invalid attribute value ]
This indicates that the SWORD library was unable to parse the osisID or annotateRef.
FATAL(NESTING): [ currentOsisID ]: tag expected
This indicates that the specified verse is not balanced with regard to its tags. Building a raw text module, looking in the module for the verse and pairing begin/end tags will help find the problem. Typically, this indicates an end tag that did not have a matching begin tag and all tags before it were properly paired.
FATAL(NESTING): [ currentOsisID ]: Expected [ topToken.getName() ] found [ tokenName ]
This also indicates that the specified verse is not balanced with regard to its tags. Building a raw text module, looking in the module for the verse and pairing begin/end tags will help find the problem. It could be either a begin or an end tag problem.
WARNING(NESTING): verse [ currentOsisID ] is not well formed:([ verseDepth ],[ tagDepth ])
This indicates that the verse probably will not show properly in some front-ends in some circumstances. Typically, it shows the problem if the verse is shown in isolation.
ERROR(NESTING): improper nesting [ currentOsisID ]: matching (sID,eID) not found. Looking at ([ sID ],[ eID ])
OSIS specifies that every sID has a matching eID. Osis2mod is checking that BSP elements are properly nested.
FATAL(COMMENTS): unknown commentstate on comment start: [ comment state ]
This indicates that the comment is not of the form <!-- ... -->.
FATAL(COMMENTS): unknown commentstate on comment end: [ comment state ]
This indicates that the comment is not of the form <!-- ... -->.
The following are shown in the same form as the diagnostic messages above. They are given without comment.
Output of what is being written to the module. The two osisIDs should always be the same.
DEBUG(WRITE): [ osisID ]:[ osisID ]: [ text so far ]
A stack is maintained to represent the Words of Christ on a per verse basis. This is internal diagnostic of that stack
DEBUG(QUOTE): [ currentOsisID ]: quote top([ quote stack size ]) [ token ] DEBUG(QUOTE): [ currentOsisID ]: quote pop([ quote stack size ]) [ topToken ] -- [ token ] DEBUG(QUOTE): [ currentOsisID ]: ([ quote stack size ]) [ topToken ] -- [ token ]
Identifies when book and chapter introductions are being determined.
DEBUG(TITLE): [ currentOsisID ]: OOPS INTRO inChapterIntro = [ inChapterIntro ] inBookIntro = [ inBookIntro ] DEBUG(TITLE): [ currentOsisID ]: Looking for book introduction DEBUG(TITLE): [ currentOsisID ]: Done looking for book introduction DEBUG(TITLE): [ currentOsisID ]: BOOK INTRO [ beading ] DEBUG(TITLE): [ currentOsisID ]: Looking for chapter introduction DEBUG(TITLE): [ currentOsisID ]: Done looking for chapter introduction DEBUG(TITLE): [ currentOsisID ]: CHAPTER INTRO [ heading ]
Inter-verse material either goes with the prior "verse" or the next. This help diagnose problems related to that split.
DEBUG(INTERVERSE): [ currentOsisID ]: interverse start token [ token ]:[ text ] DEBUG(INTERVERSE): [ currentOsisID ]: interverse end tag: [ tokenName ]([ tagDepth ],[ chapterDepth ],[ bookDepth ]) DEBUG(INTERVERSE): [ currentOsisID ]: appending interverse end tag: [tokenName ]([ tagDepth ],[ chapterDepth ],[ bookDepth ])
The following messages relate to the transformations of containers to milestones.
DEBUG(XFORM): [ currentOsisID ]: xform empty [ token ] DEBUG(XFORM): [ currentOsisID ]: xform push ([ bspTagStack.size() ]) [ token ] (tagname=[ tagName ]) DEBUG(XFORM): [ currentOsisID ]: xform top([ bsp stack size ]) [ topToken ] DEBUG(XFORM): [ currentOsisID ]: xform pop([ bsp stack size ]) [ topToken ]
Occasionally a verse reference is outside of the chosen versification. These messages help to understand difficulties that osis2mod has in storing extra-canonical material in the module.
DEBUG(V11N): [ currentOsisID ] normalizes to [ after ] DEBUG(V11N): Chapter max:[ chapterMax ], Verse Max:[ verseMax ]
OSIS ids and references can be of a form that SWORD cannot parse. Osis2mod contains a routine that munges these into a form that SWORD can understand.
DEBUG(REF): Copy range marker: [ marker ] DEBUG(REF): Found a work prefix [ workPrefix ] DEBUG(REF): Copy osisID:[ osisID ] DEBUG(REF): Found a grain suffix [ grain ] DEBUG(REF): Found a range DEBUG(REF): replacing space with ;. Remaining: [ text ] DEBUG(REF): shortended keyVal to`[ text ]`
Osis2mod contains two stacks to validate proper nesting of BSP and BCV, respectively. This is an internal representation of the BCV stacks. It provides additional information to understand the diagnostic nesting messages.
DEBUG(STACK): [ currentOsisID ]: push ([ stack size]) [ tokenName ] DEBUG(STACK): [ currentOsisID ]: pop([ tagDepth ]) [ topToken.getName() ]
These are general debug messages.
DEBUG(FOUND): Found first div and pitching prior material: [ text ] DEBUG(FOUND): New book is [ currentOsisID ] DEBUG(FOUND): Current chapter is [ currentOsisID ] ([ osisID attribute value ]) DEBUG(FOUND): Entering verse DEBUG(FOUND): New current verse is [ currentOsisID ] DEBUG(FOUND): osisID/annotateRef is adjusted to: [ simpler osisID or osisRef ] DEBUG(COMMENTS): in comment DEBUG(COMMENTS): out of comment
The following gives the input arguments.
DEBUG(ARGS): path: [ path ] osisDoc: [ osisDoc ] create: [ append ] compressType: [ compType ] blockType: [ iType ] cipherKey: [ cipherKey ] normalize: [ normalize ]