WBTI Bible discussion
General Remarks on Conversion
A long, long time ago, we were given 43 Bibles by Wycliffe/SIL. Some of the Bibles belong to WBTI, others to the Bible League.
The 43 Bibles were delivered as zips. Each zip contained OSIS files, a separate file for each book of the Bible. The OSIS files were not all valid at the time of delivery. The contents of the files were encoded as UTF-8, but made extensive use of the SIL PUA for characters not present within Unicode at the time of encoding.
The major task of developing a conversion system for these Bibles was to develop a means for mass conversion of the whole set that could, hopefully, be performed by someone from SIL with a crosswire.org login. The basic conversion process was:
- Unzip the provided zips and output a modifiable table-of-contents file listing the OSIS books in their canonical order. (The converter outputs a list based on alphabetical order of the files, which sometimes needs to be modified. Neither the extracted files nor the table-of-contents file will be overwritten by subsequent executions of the build script, permitting corrections.)
- Create a single OSIS file from the separate book files, using the header of the first file and the body of all files.
- Use TECkit with SIL's own PUA to Unicode 5.1 converter to convert as much of the PUA content as possible to Unicode 5.1. (Since complete conversion remains unlikely and Unicode 5.1 fonts are uncommon, use of recent builds of Charis SIL is recommended and included in the .conf.)
- Determine a module name based on the language, owner, & year of publication: (lowercase language code)_(uppercase owner)_(4-digit year)
- Construct .confs based on header information from the first provided OSIS file.
- Run osis2mod on the combined, validated OSIS file.
At this point a few additional steps are needed:
- Convert SIL Ethnologue codes to ISO 639-3. For the most part, especially with minority languages like these, these will be the same, but there are exceptions.
- Perform validation on the OSIS files (either with xmllint or (preferrably) Xerces).
- Port the whole thing to Linux so that it can be run on the server. (At the moment TECkit is the only obstacle, but the source code will probably compile under Linux.)
Since the 43 Bibles from WBTI are encoded using similar practices and will likely have similar problems in Sword software, we should keep discussion in a single place. Nevertheless, if you do see a problem, it would be good to note not only the location (which verse) where you spot a problem but also the particular Bible that it appears in.
Our discussion should focus on how we can make our software work better with this content as it is, rather than how we can fix the content. The content itself validates against the OSIS 2.1.1 schema (after some corrections), according to Xerces.Osk 20:42, 30 June 2008 (MDT)