Difference between revisions of "Converting SFM Bibles to OSIS"

From CrossWire Bible Society
Jump to: navigation, search
(UBS Paratext: Contrast the above with the standard definition file used by CrossWire. http://www.bibletechnologies.net/osisCore.2.1.1.xsd)
m (UBS Paratext: this)
Line 77: Line 77:
 
  Checking | Paratext 6 Checks | Publishing | Convert USFM to OSIS (Best Practice)
 
  Checking | Paratext 6 Checks | Publishing | Convert USFM to OSIS (Best Practice)
  
:See [[Paratext to OSIS]] for illustration of menu option.
+
:See [[Paratext to OSIS]] for illustration of this menu option.
  
 
This option generates XML files with the following schema definition file referenced.
 
This option generates XML files with the following schema definition file referenced.

Revision as of 07:19, 23 April 2012

Introduction

Standard Format Markers (SFM) and its more standardized derivative USFM have been used for decades to store Bibles for printing and display in programs like UBS Paratext. The format is popular among Bible translation agencies and Bible societies. The basic format of SFM is simply plaintext with backslash(\) codes. For example, a new Bible verse is signaled by \v followed by the number of the verse.

Older texts in various dialects of SFM can still be found. It was common for various agencies, Bible societies, and even regional offices of such groups to have their own SFM standards. Unified Standard Format Markers (USFM) was developed to standardize SFM and encourage interoperability so that Bibles from one agency could be reasonably expected to operate with the software and stylesheets employed by another. The current version of USFM is 2.3, defined at http://confluence.ubs-icap.org/display/USFM/Home.

Preparing (U)SFM files for conversion

The simplicity of writing SFM also makes it easy to write poor SFM that fails to correspond to any kind of standard. The first task in preparing to convert SFM files to OSIS is to clean the text. The more regular your source files are, the more likely the conversion process will operate correctly.

One method of cleaning up your files is to import them into an SFM editor such as Bibledit (which runs on Linux, Mac OS X, and Windows) or SIL FieldWorks Translation Editor (which runs on Windows). The editors will frequently perform some basic corrections to the SFM syntax, but Bibledit in particular can perform a number of checks to correct specific errors common to SFM.

Checking USFM files for versification issues

Rather than waiting until you have the end result, a SWORD module, and then running the general utility emptyvss, it is possible to use another Windows based utility called GoBibleCreator USFM Preprocessor. Despite its name, its use is not limited to conversion of documents for use towards making Go Bible applications. Although this was originally developed for supporting the Go Bible project, it does have some useful more general features. This is one of them, and is found under the "Check for Consistency" tab. Another is the ability to export the USFM Bible into BQ/DigiStudy format. The Search for Versification Issues is particularly helpful for Bibles structured primarily as "verse per line", rather than as paragraphs containing a verse range. Another useful feature is the identification of non-standard markers.

Converting USFM files to OSIS

This section gives details of software utilities capable of converting USFM files to OSIS:

usfm2osis.pl

usfm2osis.pl is a simple Perl script intended only for converting USFM files to OSIS. usfm2osis.pl was written by CrossWire, so its OSIS output is geared towards the use of OSIS documents in preparing to make SWORD modules. usfm2osis.pl is actively maintained by CrossWire, and bug reports and feature requests are welcomed at sword-support.

usfm2osis.pl requires a Perl interpreter in the system path. Then you can run:

perl usfm2osis.pl <osisWork> [-o OSIS-file] [-e USFM encoding] <USFM filenames|wildcard>

If usfm2osis.pl is not in the current directory, use its full path.

osisWork should be a value such as Bible.en.WEB.2007.

If you include an OSIS-file value, the output will be written there. Otherwise, it will be written to a file name based on your osisWork.

The USFM encoding argument should indicate the character encoding found in the source files. If none is given, UTF-8 is the default. The list of available encodings depends on your system. Executing the script with no arguments will print the list (as will executing it with an invalid encoding value).

The final argument is a list of filenames or a wildcard value such as *.sfm containing the SFM data.

Notes:

  1. In Windows, a wildcard as one of the usfm2osis.pl parameters does not work. This is due to the way the Windows command shell (cmd.exe) does not expand the wildcard parameter. If you have CygWin installed, the bash shell solves this problem, because it expands the wildcard filename specifier.
  2. In Windows, the following command line syntax will create an OSIS file for each SFM file. This assumes the current directory is the folder where the SFM files are stored, and that there is a same-level directory called osis. It also illustrates the use of (subsititute) drive p: for the full path to usfm2osis.pl, and omits the USFM encoding parameter merely for clarity. The additional use of %f within the output filenames is to give each OSIS XML file a unique name.
for %f in (*.SFM) do perl p:\usfm2osis.pl <osisWork> -o ..\osis\<osisWork>.%f.osis.xml %f

Markers Not Yet Supported by usfm2osis.pl

(2012-04-21) The following list was extracted/reformated from usfm2osis.pl (version 1.7.3) and is reproduced here for handy reference.

#### Markers Not Yet Supported: \ipi, \im, \imi, \ipq, \imq, \ipr, \iq#, \ib, \ili, \ior...\ior*, \iex, \imte
#### Markers Not Yet Supported: \mte#, \mr, \sr 
#### Markers Not Yet Supported: \ca...\ca*, \cp, \cd, \va...\va*
#### Markers Not Yet Supported: \pmo, \pm, \pmc, \pmr, \pi#, \mi, \li#, \pc, \pr, \ph#
#### Markers Not Yet Supported: \thr#
#### Markers Not Yet Supported: \fe...\fe*, \fr, \fl, \fp, \fdc...\fdc*, \fm...\fm*
#### Markers Not Yet Supported: \xdc...\xdc*
#### Markers Not Yet Supported: Special Text:  \k...\k*, \lit, \ord...\ord*, \sig...\sig*,
#### Markers Not Yet Supported: Character Styling: \em...\em*,  \bdit...\bdit*, \no...\no*
#### Markers Not Yet Supported: Spacing and Breaks: !$, // 
#### Markers Not Yet Supported: Special Features: \fig...\fig*, \ndx...\ndx*, \pro...\pro*, \w...\w*, \wg...\wg*, \wh...\wh*

If you come across USFM files containing markers that are not yet supported, please contact one of the Perl script developers in CrossWire.

Fixing the output of usfm2osis.pl

usfm2osis.pl does not produce valid and working cross references. If your text contains such you need xreffix.pl which sits alongside usfm2osis.pl on the output of usfm2osis.pl. Using it requires the SWORD Perl bindings be installed and a sword-locale.

Unfortunately, (right now), the next step from a usfm2osis.pl generated OSIS file via osis2mod does not produce working titles. A further perl script, title-cleanup.pl adds the argument subType="x-preverse" to some titles.[1]

Finally, and that has not much to do with usfm2osis.pl, many texts in non-Latin scripts (and some in Latin scripts) require batch conversion of characters and numbers. There are several scripts to assist with this, without harming the USFM or OSIS markup.

These supplementary Perl scripts are located in various sub-directories under sword-tools.

Notes:

  1. Also (right now) there is an unsolved issue that causes pre-verse titles to be misplaced. The symptoms are described in [1].

UBS Paratext

Since version 6, UBS Paratext has a menu option to export to OSIS.

Checking | Paratext 6 Checks | Publishing | Convert USFM to OSIS (Best Practice)
See Paratext to OSIS for illustration of this menu option.

This option generates XML files with the following schema definition file referenced.

osisCore.2.0_UBS_SIL_BestPractice.xsd

UBS SIL Best Practice OSIS makes use of several XML attributes that are not used within CrossWire, and some of these are custom (i.e. x-prefix named attributes).

Contrast the above with the standard definition file used by CrossWire.

http://www.bibletechnologies.net/osisCore.2.1.1.xsd

osis-converters

osis-converters is the location for the open source (U)SFM-to-OSIS and OSIS-to-SWORD module converters developed independently by the main programmer of xulsword. Some of the OSIS elements used in these Perl scripts are not the same as generated by the CrossWire maintained scripts. See [2].

Creation of publication ready modules from USFM and submission to CrossWire

Before you import your OSIS files to SWORD format, you should check that it is valid OSIS.
Before you submit any files to modules@crosswire.org, you must ensure that your files are valid OSIS. Invalid OSIS files will not be accepted.

Our process to create publication ready modules from USFM looks like this:

Stage 1

The command line examples in this section are for Unix. In Windows, it's not as simple to make equivalent command lines for Perl, especially in regard to globbing wildcards for filenames.

  • Iterate usfm2osis.pl over each single USFM file to produce a OSIS file per USFM file
for biblebook in 'ls *usfm'; do usfm2osis.pl $biblebook $biblebook; done
  • Check each OSIS file for XML validity - this will throw up a lot of not obvious USFM encoding problems
for biblebook in 'ls *osis.xml'; do checkbible $biblebook; done
  • Correct USFM files as per mistakes found

Rerun this stage until you have clean and validating OSIS files.

Stage 2

  1. Check the OSIS files for not transformed USFM markers.[1][2]
  2. Fix usfm2osis.pl and add the missing tags for correct OSIS transformation.
  3. Rerun above until you have a clean validating collection of OSIS files with no left over USFM tags.[3]
  4. Send the updates to usfm2osis.pl to CrossWire's SVN repository.
  5. Run usfm2osis.pl over all the USFM files and create a single large OSIS file.
  6. Check again for XML validity.[4]
  7. Run the OSIS file through osis2mod[5], create a conf file, and check the resultant module for problems in a variety of front-ends.[6]
  8. Create a sword locale file and store in your /usr/share/sword/locales.d/ directory[7]
  9. If cross references are part of the USFM files fix these with the help of xreffix.pl (also in sword-tools).[8]
  10. Run again through osis2mod and check again.
  11. Submit the final OSIS file and the conf file to CrossWire together with the locale file.[9]
  12. Send any corrections on the USFM files back to the translation team. Chances are these are all valid and necessary corrections which help to improve the USFM.[10]

Notes:

  1. Not many common ones are left unsupported, but you may encounter usually one or two new ones if you get files from a different new source.
  2. Some supported USFM markers may remain unconverted due to anomalies in how the USFM files were edited.
    Paratext does not check all potential mismatches against the USFM reference manual, and moreover, the manual itself leaves room for some ambiguities.
  3. A lot of the repetitive aspects can be done with the help of some shell scripts, so do not worry that you have to run each time multiples of usfm2osis.pl by hand.
  4. See OSIS Bible validation for further instructions on OSIS validation.
  5. Instructions for running osis2mod are available at osis2mod usage.
  6. Fixes should be made in the USFM files at the very start and then cascaded down into the module following the overall process again.
  7. or the equivalent directory on your system
  8. This requires a sword locale in the local installation at least.
  9. Include advice regarding which repository the module is intended for.
  10. Both Paratext and Bibledit are a bit more forgiving than our tools, but some problems you encounter could have affected the final paper print.