WBTI Bible discussion
General Remarks on Conversion
A long, long time ago, we were given 43 Bibles by Wycliffe/SIL. Some of the Bibles belong to WBTI, others to the Bible League.
The 43 Bibles were delivered as zips. Each zip contained OSIS files, a separate file for each book of the Bible. The OSIS files were not all valid at the time of delivery. The contents of the files were encoded as UTF-8, but made extensive use of the SIL PUA for characters not present within Unicode at the time of encoding.
The major task of developing a conversion system for these Bibles was to develop a means for mass conversion of the whole set that could, hopefully, be performed by someone from SIL with a crosswire.org login. The basic conversion process was:
- Unzip the provided zips and output a modifiable table-of-contents file listing the OSIS books in their canonical order. (The converter outputs a list based on alphabetical order of the files, which sometimes needs to be modified. Neither the extracted files nor the table-of-contents file will be overwritten by subsequent executions of the build script, permitting corrections.)
- Create a single OSIS file from the separate book files, using the header of the first file and the body of all files.
- Use TECkit with SIL's own PUA to Unicode 5.1 converter to convert as much of the PUA content as possible to Unicode 5.1. (Since complete conversion remains unlikely and Unicode 5.1 fonts are uncommon, use of recent builds of Charis SIL is recommended and included in the .conf.)
- Determine a module name based on the language, owner, & year of publication: (lowercase language code)_(uppercase owner)_(4-digit year)
- Construct .confs based on header information from the first provided OSIS file.
- Run osis2mod on the combined, validated OSIS file.
At this point a few additional steps are needed:
- Convert SIL Ethnologue codes to ISO 639-3. For the most part, especially with minority languages like these, these will be the same, but there are exceptions.
- Perform validation on the OSIS files (either with xmllint or (preferrably) Xerces).
- Port the whole thing to Linux so that it can be run on the server. (At the moment TECkit is the only obstacle, but the source code will probably compile under Linux.)
Since the 43 Bibles from WBTI are encoded using similar practices and will likely have similar problems in Sword software, we should keep discussion in a single place. Nevertheless, if you do see a problem, it would be good to note not only the location (which verse) where you spot a problem but also the particular Bible that it appears in.
Our discussion should focus on how we can make our software work better with this content as it is, rather than how we can fix the content. The content itself validates against the OSIS 2.1.1 schema (after some corrections), according to Xerces.
--Osk 20:42, 30 June 2008 (MDT)
(moved from Wycliffe Bibles section of beta modules page by Osk, original by Dmsmith)
Many modules have a problem with notes tied to headers. This may not be a module problem, but an osis2mod problem or a SWORD engine problem. For example, using SW to view Matt 5:21, there is a note marker that stands before the verse number and the heading appears after the verse number. Perhaps as a side effect, there is way too much whitespace (blank lines) surrounding the note. And the note have no content.
(moved from amu_BL_1999 section of beta modules page by Osk)
Verses are not well-formed. References are not within notes and are not proper.
E.g. Matt 5:21
<title subType="x-preverse" type="section"> <reference>(Lc. 12:57-59)</reference> </title> <div scope="Matt.5.21-Matt.5.26" type="section"> <title>Maˈmo̱ⁿ Jesús na tilaˈwja̱a̱ya ncˈiaaya</title> ’ˈO jnda̱ jndyeˈyoˈ ñˈoom na tyolue nnˈaⁿ nda̱a̱ welooya na matsonaˈ: “Tintseicueˈ xˈiaˈ ndoˈ meiⁿcwiˈñeeⁿcheⁿ tsˈaⁿ na nntsˈaa na ljoˈ, maxjeⁿ nntˈuiityeⁿnaˈ juu.”
This example show the problem with the osis2mod pre-verse hack: There are simply too many valid conditions for osis2mod to anticipate everything. I'd be interested in what the original for this was. My guess is that it is really good OSIS.
However, osis2mod and the SWORD engine are not that amenable to all good OSIS inputs.
With regard to <div>, osis2mod needs to change this to a milestoned version. Also, because the title is within the <div> it is not a candidate to become "x-preverse". This probably needs to be fixed too.
With regard to <reference> it is perfectly valid OSIS. However, SWORD expects it to have an osisRef, and because it does not, it displays a non-functional link. Further, SWORD expects references to be within <note type="crossReference">...</note>. I may be wrong on this last point but every OSIS module having references wraps them in <note>. Further, SWORD does not expect a <reference> to be between two verses. To which one should it be attached?
--Dmsmith 18:33, 30 June 2008 (MDT)