Difference between revisions of "Converting SFM Bibles to OSIS"

From CrossWire Bible Society
Jump to: navigation, search
(intro)
m (Checking USFM files for versification issues: Updated GoBibleCreator USFM Preprocessor link URL)
 
(326 intermediate revisions by 6 users not shown)
Line 1: Line 1:
=Introduction=
+
==Module Development==
 +
Please first visit this page [[Module Development Collaboration]] – it may save you a lot of time.
  
Standard Format Markers (SFM) and its more standardized derivative USFM have been used for decades to store Bibles for printing and display in programs like UBS' [http://paratext.ubs-translations.org/Home.html Paratext]. The format is popular among Bible translation agencies and Bible societies. The basic format of SFM is simply plaintext with backslash(\) codes. For example, a new Bible verse is signaled by \v followed by the number of the verse. The simplicity of writing SFM also makes it easy to write poor SFM that fails to correspond to any kind of standard.
+
==Introduction==
  
Older texts in various dialects of SFM can still be found. It was common for various agencies, Bible societies, and even regional offices of such groups to have their own SFM standards. Unified Standard Format Markers (USFM) was developed to standardize SFM and encourage interoperability so that Bibles from one agency could be reasonably expected to operate with the software and stylesheets employed by another. The current version of USFM is 2.1, defined at http://confluence.ubs-icap.org/display/USFM/Home.
+
'''Standard Format Markers''' (SFM) and its more standardized derivative USFM have been used for decades to store Bibles for printing and display in programs like [http://paratext.org/ UBS Paratext]. The format is popular among Bible translation agencies and Bible societies. The basic format of SFM is simply plaintext with backslash(\) codes. For example, a new Bible verse is signaled by '''\v''' followed by the number of the verse.  
  
=Preparing (U)SFM files for conversion=
+
Older texts in various dialects of SFM can still be found. It was common for various agencies, Bible societies, and even regional offices of such groups to have their own SFM standards. Unified Standard Format Markers (USFM) was developed to standardize SFM and encourage interoperability so that Bibles from one agency could be reasonably expected to operate with the software and style sheets employed by another. [http://paratext.org/usfm USFM] version 2.4 is the latest one released though USFM [http://ubsicap.github.io/usfm/about/releasenotes.html#about-release-3-0 version 3.0] is already documented. It is expected to be released sometime in 2017 (probably alongside [https://pt8.paratext.org/2016/11/09/paratext-8-beta-is-released/ ParaTExt version 8.0]).
  
 +
USFM 3.0 will introduce some new markers. These will require incorporating into the USFM to OSIS conversion utilities described below.
  
 +
==Preparing (U)SFM files for conversion==
  
=Converting (U)SFM files to OSIS=
+
The simplicity of writing SFM also makes it easy to write poor SFM that fails to correspond to any kind of standard. The first task in preparing to convert SFM files to OSIS is to clean the text. The more regular your source files are, the more likely the conversion process will operate correctly.
  
 +
One method of cleaning up your files is to import them into an SFM editor such as [http://bibledit.org/ Bibledit] (which now runs on Android, iOS, Linux, Mac OS X, Windows and Cloud) or [http://www.sil.org/computing/fieldworks/BTE-FW_downloads.htm SIL FieldWorks Translation Editor] (which runs on Windows). The editors will frequently perform some basic corrections to the SFM syntax, but Bibledit in particular can perform a number of checks to correct specific errors common to SFM.
  
 +
As of March 2013, after a formal agreement was signed with UBS, anyone who is a member of CrossWire's publishing staff is allowed to register and download [http://paratext.org/ Paratext] software - subject to individual approval by CrossWire's director. This agreement covers the software only, without giving access to the copyrighted resources associated with Paratext that are of greater use for Bible translators.
  
=Importing OSIS files into Sword=
+
===Checking USFM files for versification issues===
 +
Rather than waiting until you have the end result, a SWORD module, and then running the general utility '''emptyvss''', it is possible to use another Windows based utility called [https://github.com/GeoDirk/gbcpreprocessor GoBibleCreator USFM Preprocessor]. Despite its name, its use is not limited to conversion of documents for use towards making Go Bible applications. Although this was originally developed for supporting the Go Bible project, it does have some useful more general features. This is one of them, and is found under the "Check for Consistency" tab. Another is the ability to export the USFM Bible into BQ/DigiStudy format. The Search for Versification Issues is particularly helpful for Bibles structured primarily as "verse per line", rather than as paragraphs containing a verse range. Another useful feature is the identification of non-standard markers.
 +
 
 +
=== USFM tag statistics ===
 +
Before embarking on any USFM conversion to another format, it's often very useful to list all the different USFM markers (tags) that are used in the project SFM files. [[User:osk|Osk]] has developed a Python script called usfmtags.py to perform this task. It can be downloaded from his GitHub repository [https://github.com/chrislit/usfm2osis].
 +
 
 +
[[User:David Haslam|David Haslam]] has developed a bespoke [http://www.datamystic.com/textpipe TextPipe] filter that performs a similar task. The tabbed output text file has columns for Count, Tag & Description. TextPipe is for Windows only. Details available on request.
 +
 
 +
==Converting USFM files to OSIS==
 +
 
 +
This section gives details of software utilities capable of converting USFM files to OSIS.
 +
 
 +
All such conversions require the use of [[List of eXtensions to OSIS used in SWORD|eXtensions to OSIS]] of one form or another. Whether these are supported by the SWORD engine should be researched before users conclude that there are faults with the particular conversion method.
 +
 
 +
None of the methods listed below are unfortunately perfect. Sometimes - as here - the plethora of tools demonstrates the lack of a all round satisfying solution. At the moment Crosswire will use usfm2osis.py in John Austin's development version as the most useful and likely most successful conversion tool. 
 +
 
 +
 
 +
=== usfm2osis ===
 +
Some time ago CrossWire volunteer [[User:Osk|Osk]] developed a Python script to convert USFM to OSIS. Since 2014 it has been developed as a more modular Python module rather than a single script. It is maintained at [https://github.com/chrislit/usfm2osis GitHub], but most users should install it directly from [https://pypi.python.org/pypi/usfm2osis PyPI] via the pip command:
 +
pip install usfm2osis
 +
 
 +
==== Usage ====
 +
Once the usfm2osis module in installed, the usfm2osis.py conversion script itself may be invoked by calling:
 +
usfm2osis.py
 +
 
 +
=== usfm2osis.py ===
 +
The last GPL2 version of usfm2osis (above) has been branched by others and significantly further developed. It is published at  [https://github.com/refdoc/Module-tools].
 +
 
 +
One of the original stated aims was to achieve '''full coverage''' of USFM 2.35 according to the [http://paratext.org/usfm ParaTExt spec].<ref>Strangely, it's not even updated to refer to USFM 2.40 yet.</ref><ref>The developers have been advised that USFM 3.0 is on its way. See [https://github.com/refdoc/Module-tools/issues/1 issue #1].</ref>
 +
 
 +
'''Notes:'''
 +
<references />
 +
 
 +
'''Observations:'''
 +
# usfm2osis.py version 0.6 is the current release. Refer to the roadmap within the script.
 +
# usfm2osis.py will always only generate best practice OSIS, to the best of its ability, so it won't generate stuff that SWORD requires due to its own shortcomings.
 +
# usfm2osis.py generates OSIS with the '''milestone''' form of both chapter and verse elements.
 +
# usfm2osis.py doesn't yet use SWORD bindings, so it's very limited in its knowledge of versification systems. This may mean that for the time being, expanding cross-references would typically have to be done by other means. ''See below''.
 +
# In general, anyone interested in converting USFM to OSIS for use with SWORD ought now to employ usfm2osis.py as it's now the only supported pathway for USFM->OSIS->SWORD, inasmuch as its author has a fairly close connection with SWORD and regularly commits stuff to SWORD filters and whatnot. He will definitely not make the least effort to accommodate any non-standard output from converters other than usfm2osis.py.
 +
# It is advised to use Python 2.7.x rather than Python 3.x as there are still some issues in getting the script to work with Python 3.
 +
# Windows users are advised that should they encounter problems trying to run the script with ordinary CPython from python.org, to use the Python interpreter that can be installed with Cygwin. This would also help for specifying the input files using wildcards.
 +
# Useful tip: Cygwin's bash shell recognizes [http://en.wikipedia.org/wiki/NTFS_symbolic_link NTFS symbolic links]. This means that the USFM input files and the Python script don't need to be copied to the cygwin/home/''user'' directory. Just make suitable links by means of the Windows mklink command.
 +
# Bug reports and feature requests are welcomed at either [https://github.com/chrislit/usfm2osis/issues], [http://www.crosswire.org/tracker/browse/MODTOOLS] or [mailto:sword-support@crosswire.org sword-support].
 +
 
 +
==== Usage ====
 +
usfm2osis.py requires a Python interpreter in the system path.
 +
 
 +
Optional: If you wish the script to validate the OSIS output, then you'll need to install [http://lxml.de/ lxml].
 +
 
 +
The usage of usfm2osis.py is output when the script is run without parameters, i.e. with the shell command line
 +
python usfm2osis.py
 +
The syntax is very close to that for usfm2osis.pl &ndash; ''see below''.
 +
 
 +
==== XML Syntax Checking ====
 +
 
 +
Until the Python script is fully debugged, it's possible that it may generate an OSIS file that fails the XML syntax check. As and when this occurs, it is always useful to isolate the location[s] of the XML file where the syntax is incorrect. This can be done by making separate OSIS XML files for each USFM file, or at least to exclude suspect files from the build by temporarily renaming the file extensions such that it does not match the wildcard pattern used in the command line.
 +
 
 +
==== OSIS Validation ====
 +
Until the Python script is fully debugged, it's quite possible that even though you have an OSIS file that passes the XML syntax check, it may still fail to validate to the OSIS schema. Similar techniques are useful here too. Make an individual OSIS file from each USFM file, so that the valid files can be distinguished from the invalid files. This facilitates further investigation to more easily locate the invalid elements.
 +
 
 +
===== OSIS schema =====
 +
In CrossWire, we can now validate OSIS files using our own custom adaptation of the OSIS schema. See [[OSIS 211 CR]] for details.
 +
 
 +
There's also a copy of the OSIS 2.1.1 schema at http://ebible.org/osisCore.2.1.1.xsd which can be used in the event that http://www.bibletechnologies.net/ is ever inaccessible.
 +
 
 +
==== Compatibility with osis2mod ====
 +
Until the Python script is fully debugged, it's quite possible that even though you have an OSIS file that validates to the OSIS schema, it may still lead to problems when used as the input file for osis2mod. Much will depend on the complexities of the original USFM files, in terms of which markers were used and in what sequences. Some post-processing of the OSIS XML may be necessary to fix such potential problems.
 +
 
 +
==== Unsupported Markers ====
 +
Please refer to the roadmap within the script itself if you encounter any unsupported USFM tags.<ref>e.g. We recently encountered the as then undocumented use of 'nested' tags with the syntax '''\+tag_...\+tag*''' <BR>These are now documented in USFM Reference 2.4</ref>
 +
 
 +
'''Note:'''
 +
<references />
 +
 
 +
=== Other osis-converters ===
 +
==== John Austin ====
 +
[https://github.com/johnaustindev/osis-converters osis-converters] is the location for the open source (U)SFM-to-OSIS and OSIS-to-SWORD module converters originally developed independently by the main programmer of [https://github.com/JohnAustinDev/xulsword xulsword]. He now uses his own maintained version of usfm2osis.py and has commit privilege to the one maintained for CrossWire by [[User:refdoc|RefDoc]]. ''See above''.
 +
 
 +
==== Adyeths ====
 +
Adyeths has developed a Python 3 script that is ostensibly faster than CrossWire's usfm2osis.py<BR>
 +
Visit https://github.com/adyeths/u2o
 +
 
 +
==== Haiola ====
 +
Michael Johnson's [http://haiola.org/ Haiola] Windows software includes the conversion from USFM to OSIS as one of its features. This software is used for making the Bible modules hosted at [[Official and Affiliated Module Repositories#eBible.org|eBible.org]].
 +
 
 +
=== usfm2osis.pl ===
 +
[http://crosswire.org/svn/sword-tools/trunk/modules/perlconverters/usfm2osis.pl usfm2osis.pl] is a simple Perl script intended only for converting USFM files to OSIS. It does not provide comprehensive cover for all the tags in the USFM Reference.
 +
 
 +
'''Notes:'''
 +
# usfm2osis.pl is now '''deprecated'''. Please use the Python script instead. ''See above''.
 +
# Its output was geared towards the use of OSIS documents in preparing to make SWORD modules.
 +
# usfm2osis.pl generates OSIS with the '''milestone''' form of both chapter and verse elements.
 +
# usfm2osis.pl is no longer being actively maintained by CrossWire.
 +
 
 +
==== Usage ====
 +
usfm2osis.pl requires a Perl interpreter in the system path. Then you can run:
 +
perl usfm2osis.pl <osisWork> [-o OSIS-file] [-e USFM encoding] <USFM filenames|wildcard>
 +
 
 +
If usfm2osis.pl is not in the current directory, use its full path.
 +
 
 +
osisWork should be a value such as Bible.en.WEB.2007.
 +
 
 +
If you include an OSIS-file value, the output will be written there. Otherwise, it will be written to a file name based on your osisWork.
 +
 
 +
The USFM encoding argument should indicate the character encoding found in the source files. If none is given, UTF-8 is the default. The list of available encodings depends on your system. Executing the script with no arguments will print the list (as will executing it with an invalid encoding value).
 +
 
 +
The final argument is a list of filenames or a wildcard value such as '''*.sfm''' containing the SFM data.
 +
 
 +
'''Notes:'''
 +
# In Windows, a wildcard as one of the <tt>usfm2osis.pl</tt> parameters does not work. This is due to the way the Windows command shell (cmd.exe) does not expand the wildcard parameter. If you have [http://cygwin.com/ CygWin] installed, the bash shell solves this problem, because it expands the wildcard filename specifier.
 +
# In Windows, the following command line syntax will create an OSIS file for each SFM file. This assumes the current directory is the folder where the SFM files are stored, and that there is a same-level directory called osis. It also illustrates the use of (subsititute) drive p: for the full path to usfm2osis.pl, and omits the USFM encoding parameter merely for clarity. The additional use of %f within the output filenames is to give each OSIS XML file a unique name.
 +
 
 +
for %f in (*.SFM) do perl p:\usfm2osis.pl <osisWork> -o ..\osis\<osisWork>.%f.osis.xml %f
 +
 
 +
Similarly in CygWin or another bash environment you can use
 +
 
 +
for f in $(ls *.SFM); do perl usfm2osis.pl <osisWork> -o ../osis/<osisWork>.$f.osis.xml $f; done
 +
 
 +
==== Unsupported Markers ====
 +
 
 +
Not all USFM tags are currently supported, just those so far seen "in the wild" by the maintainers of the script. The comments of the script  document the current set of tags coverage. In view of the many omissions, it is advised to use the new Python script instead.
 +
 
 +
=== Fixing the OSIS XML files ===
 +
This section attempts to address some of the current deficiencies in SWORD, and lists tools to workaround these issues. Eventually (it is hoped) these will not be necessary, as the fixes will be implemented within usfm2osis.py during its further development, or as fixes are made to the SWORD engine and/or front-ends, as appropriate.
 +
 
 +
==== OSIS hacks ====
 +
OSIS 2.1.1 still lacks some desirable features or improvements. While these are not yet forthcoming, sometimes it may be necessary to implement some hacks to the OSIS generated by scripted OSIS converters to workaround these "deficiencies" in the OSIS schema. You can read about OSIS change requests (or judiciously add to the page) [[OSIS 211 CR|here]].
 +
 
 +
==== Cross-references ====
 +
<tt>usfm2osis.py</tt> and <tt>usfm2osis.pl</tt> do not produce [[OSIS Bibles#Marking_cross-references_notes|cross-references]] (or [[OSIS Bibles#Marking_parallel_passage_headings|parallel passage headings]]) that work properly as links with SWORD.<ref>A solution for this was envisaged in the original roadmap for <tt>usfm2osis.py</tt></ref> If your text contains such, you may need <tt>xreffix.pl</tt> which sits in directory [https://crosswire.org/svn/sword-tools/trunk/modules/crossreferences/]. Using it requires the SWORD Perl bindings be installed and a [[DevTools:Locale Files|sword-locale]]<ref>If the locale file has any errors or omissions, some references may end up with an incorrect osisRef. Thorough checking of the converted file is paramount.</ref> for the language of the module to be created.<ref>Special care is required when the cross-reference (or heading) includes additional text which is not a scripture reference. Until all these issues are fixed, it becomes almost impossible to parse the reference elements correctly such that the OSIS references can be added automatically.</ref>
 +
 
 +
===== Fixing references =====
 +
There are two conceptually distinct process steps required in order to fix vernacular reference strings:
 +
# To create a separate '''reference''' element for each ''noncontiguous'' reference.
 +
# To assign the correct '''osisRef''' attribute to each reference element.
 +
 
 +
===== orefs.py =====
 +
* [[Converting SFM Bibles to OSIS#Adyeths|Adyeths]] has an ancillary script <tt>orefs.py</tt> to add the proper '''osisRef''' attribute to OSIS reference elements.<ref>This is fairly new (as of 2017-12-12) &ndash; it does not require SWORD bindings for Python.</ref>
 +
 
 +
'''Notes:'''
 +
<references />
 +
 
 +
==== Miscellaneous ====
 +
Many texts in non-Latin scripts (and even for some Latin scripts) require batch conversion of characters and numbers. There are several scripts to assist with this, without harming the USFM or OSIS markup.<ref>These supplementary Perl scripts are located in various sub-directories under [https://crosswire.org/svn/sword-tools/trunk/modules/ sword-tools].</ref><ref>This has little to do with converting USFM to OSIS</ref>
 +
 
 +
'''Notes:'''
 +
<references/>
 +
 
 +
==== Soft hyphens ====
 +
OSIS XML files should ideally not contain any [https://en.wikipedia.org/wiki/Soft_hyphen soft hyphens].
 +
 
 +
Sometimes, Bibles that have already been published using some form of desktop publishing are retrospectively converted to USFM. When this happens, the exported text used as the starting point for deriving the USFM files may contain a lot of soft hyphens, i.e. &ndash; where they were used to control hyphenated word wrap in the printed Bibles.
 +
 
 +
The soft hyphen is typographical device. It is not a semantic construction. It is therefore advised that they should be removed completely from the USFM files before converting to OSIS XML. At the very least, the soft hyphens should be removed entirely from the converted OSIS XML file.
 +
 
 +
=== ParaTExt (export to OSIS) ===
 +
 
 +
:''This section may require updating''.
 +
 
 +
Since version 6, [http://paratext.org/ ParaTExt] has a menu option to export to OSIS.
 +
 
 +
Checking | Paratext 6 Checks | Publishing | Convert USFM to OSIS (Best Practice)
 +
 
 +
This option generates XML files with the following schema definition file referenced.
 +
 
 +
osisCore.2.0_UBS_SIL_BestPractice.xsd
 +
 
 +
UBS SIL Best Practice OSIS makes use of several XML attributes that are not used within CrossWire, and some of these are custom (i.e. x-prefix named attributes). An accessible online copy does not seem to be available. Furthermore, the only undated copy of this .xsd file (40KB) that has been obtained includes the following lines:
 +
<pre>
 +
  <!--    WARNING  WARNING  WARNING
 +
  THIS SCHEMA IS DEVELOPMENTAL AND SHOULD BE USED WITH THAT UNDERSTANDING -->
 +
</pre>
 +
Contrast the above with the standard definition file (92KB) used by CrossWire.
 +
 
 +
http://www.bibletechnologies.net/osisCore.2.1.1.xsd
 +
 
 +
Notwithstanding the above concerns, based on one sample exported from a Paratext project, it has been verified that replacing line 2 by the similar line output from usfm2osis.pl, such a converted XML file can be validated against the standard definition file.
 +
 
 +
Nevertheless, further testing (using <tt>osis2mod</tt>) of some OSIS files exported from Paratext revealed that they generated WARNING(NESTING) errors. Some of these were due to the eID milestone for a verse occurring BEFORE the sID milestone for the same verse! Others were due to <tt></note></tt> being placed AFTER the eID milestone for the verse where the note occurs.
 +
 
 +
Hence I conclude that what Paratext outputs as OSIS is not really "best practice", and would require some improvements to fix issues such as these. However, we know that work on OSIS export from Paratext has virtually stopped - so this may now prove to be of mere historical interest.
 +
 
 +
The bottom line is that the OSIS export from Paratext is not ready for publishing a SWORD module.
 +
 
 +
==Creation of publication ready modules from USFM and submission to CrossWire==
 +
 
 +
=== Checking and validating OSIS files ===
 +
When checking OSIS XML files there are 3 steps:
 +
# Is the OSIS well formed XML ?
 +
# Is the file valid to the defined OSIS schema?<ref>e.g. Using [http://lxml.de/ lxml] or any other suitable means (e.g. the XML Tools plugin for Notepad++ on Windows).</ref>
 +
# Is the file "fit for purpose" (i.e. suitable for immediate use by the SWORD conversion tool called <tt>osis2mod</tt>)
 +
Step #1 does not guarantee step #2 and step #2 does not guarantee step #3. See above and below and [[OSIS Bibles#Limitations_of_XML_validators|limitations of XML validators]].
 +
 
 +
Before you convert your OSIS files to SWORD format, you should always check that it is valid OSIS.<BR>Before you submit any files to modules@crosswire.org, you '''must''' ensure that your files are valid OSIS. Invalid OSIS files will not be accepted.
 +
 
 +
'''Note:'''
 +
<references />
 +
 
 +
=== Module creation process ===
 +
:''This section requires updating to reflect the move to using the Python script <tt>usfm2osis.py</tt> in preference to its Perl predecessor''.
 +
 
 +
Our process to create publication ready modules from USFM looks like this:
 +
 
 +
==== Stage 1 ====
 +
''The command line examples in this section are for Unix. In Windows, it's not as simple to make equivalent command lines for Perl, especially in regard to [http://en.wikipedia.org/wiki/Glob_%28programming%29 globbing] wildcards for filenames''.
 +
*Iterate <tt>usfm2osis.pl</tt> over each single USFM file to produce a  OSIS file per USFM file
 +
for biblebook in 'ls *usfm'; do usfm2osis.pl $biblebook $biblebook; done
 +
*Check each OSIS file for XML validity - this will throw up a lot of not obvious USFM encoding problems
 +
for biblebook in 'ls *osis.xml'; do checkbible $biblebook; done
 +
*Correct USFM files as per mistakes found
 +
 
 +
Rerun this stage until you have clean and validating OSIS files.
 +
 
 +
==== Stage 2 ====
 +
 
 +
# Check the OSIS files for not transformed USFM markers.<ref>Not many common ones are left unsupported, but you may encounter usually one or two new ones if you get files from a different new source.</ref><ref>Some supported USFM markers may remain unconverted due to anomalies in how the USFM files were edited.<br>Paratext does not check all potential mismatches against the USFM reference manual, and moreover, the manual itself leaves room for some ambiguities.</ref>
 +
# Fix <tt>usfm2osis.pl</tt> and add the missing tags for correct OSIS transformation.
 +
# Rerun above until you have a clean validating collection of OSIS files with no left over USFM tags.<ref>A lot of the repetitive aspects can be done with the help of some shell scripts, so do not worry that you have to run each time multiples of <tt>usfm2osis.pl</tt> by hand.</ref>
 +
# Send the updates to usfm2osis.pl to CrossWire's SVN repository.
 +
# Run <tt>usfm2osis.pl</tt> over all the USFM files and create a single large OSIS file.
 +
# Check again for XML validity.<ref>See [[OSIS_Bibles#Valid_OSIS_test|OSIS Bible validation]] for further instructions on OSIS validation.</ref>
 +
# Run the OSIS  file through <tt>osis2mod</tt><ref>Instructions for running <tt>osis2mod</tt> are available at [[osis2mod#Usage|osis2mod usage]].</ref>, create a [[DevTools:conf Files|conf file]], and check the resultant module for problems in a variety of front-ends.<ref>Fixes should be made in the USFM files at the very start and then cascaded down into the module following the overall process again.</ref>
 +
# Create a [[DevTools:SWORD#Locale_file_layout|sword locale file]] and store in your <tt>/usr/share/sword/locales.d/</tt> directory<ref>or the equivalent directory on your system</ref>
 +
# If cross references are part of the USFM files fix these with the help of <tt>xreffix.pl</tt> (also in sword-tools).<ref>This requires a sword locale in the local installation at least.</ref>
 +
# Run again through <tt>osis2mod</tt> and check again.
 +
# Submit the final OSIS file and the conf file to CrossWire together with the locale file.<ref>Include advice regarding which [[Official and Affiliated Module Repositories|repository]] the module is intended for.</ref>
 +
# Send any corrections on the USFM files back to the translation team. Chances are these are all valid and necessary corrections which help to improve the USFM.<ref>Both Paratext and Bibledit are a bit more forgiving than our tools, but some problems you encounter could have affected the final paper print.</ref>
 +
 
 +
'''Notes:'''
 +
<references />
 +
 
 +
[[Category:Guides|Converting SFM Bibles to OSIS]]
 +
[[Category:OSIS]]
 +
[[Category:USFM]]
 +
[[Category:Bibledit]]
 +
[[Category:Paratext]]

Latest revision as of 11:22, 18 March 2019

Module Development

Please first visit this page Module Development Collaboration – it may save you a lot of time.

Introduction

Standard Format Markers (SFM) and its more standardized derivative USFM have been used for decades to store Bibles for printing and display in programs like UBS Paratext. The format is popular among Bible translation agencies and Bible societies. The basic format of SFM is simply plaintext with backslash(\) codes. For example, a new Bible verse is signaled by \v followed by the number of the verse.

Older texts in various dialects of SFM can still be found. It was common for various agencies, Bible societies, and even regional offices of such groups to have their own SFM standards. Unified Standard Format Markers (USFM) was developed to standardize SFM and encourage interoperability so that Bibles from one agency could be reasonably expected to operate with the software and style sheets employed by another. USFM version 2.4 is the latest one released though USFM version 3.0 is already documented. It is expected to be released sometime in 2017 (probably alongside ParaTExt version 8.0).

USFM 3.0 will introduce some new markers. These will require incorporating into the USFM to OSIS conversion utilities described below.

Preparing (U)SFM files for conversion

The simplicity of writing SFM also makes it easy to write poor SFM that fails to correspond to any kind of standard. The first task in preparing to convert SFM files to OSIS is to clean the text. The more regular your source files are, the more likely the conversion process will operate correctly.

One method of cleaning up your files is to import them into an SFM editor such as Bibledit (which now runs on Android, iOS, Linux, Mac OS X, Windows and Cloud) or SIL FieldWorks Translation Editor (which runs on Windows). The editors will frequently perform some basic corrections to the SFM syntax, but Bibledit in particular can perform a number of checks to correct specific errors common to SFM.

As of March 2013, after a formal agreement was signed with UBS, anyone who is a member of CrossWire's publishing staff is allowed to register and download Paratext software - subject to individual approval by CrossWire's director. This agreement covers the software only, without giving access to the copyrighted resources associated with Paratext that are of greater use for Bible translators.

Checking USFM files for versification issues

Rather than waiting until you have the end result, a SWORD module, and then running the general utility emptyvss, it is possible to use another Windows based utility called GoBibleCreator USFM Preprocessor. Despite its name, its use is not limited to conversion of documents for use towards making Go Bible applications. Although this was originally developed for supporting the Go Bible project, it does have some useful more general features. This is one of them, and is found under the "Check for Consistency" tab. Another is the ability to export the USFM Bible into BQ/DigiStudy format. The Search for Versification Issues is particularly helpful for Bibles structured primarily as "verse per line", rather than as paragraphs containing a verse range. Another useful feature is the identification of non-standard markers.

USFM tag statistics

Before embarking on any USFM conversion to another format, it's often very useful to list all the different USFM markers (tags) that are used in the project SFM files. Osk has developed a Python script called usfmtags.py to perform this task. It can be downloaded from his GitHub repository [1].

David Haslam has developed a bespoke TextPipe filter that performs a similar task. The tabbed output text file has columns for Count, Tag & Description. TextPipe is for Windows only. Details available on request.

Converting USFM files to OSIS

This section gives details of software utilities capable of converting USFM files to OSIS.

All such conversions require the use of eXtensions to OSIS of one form or another. Whether these are supported by the SWORD engine should be researched before users conclude that there are faults with the particular conversion method.

None of the methods listed below are unfortunately perfect. Sometimes - as here - the plethora of tools demonstrates the lack of a all round satisfying solution. At the moment Crosswire will use usfm2osis.py in John Austin's development version as the most useful and likely most successful conversion tool.


usfm2osis

Some time ago CrossWire volunteer Osk developed a Python script to convert USFM to OSIS. Since 2014 it has been developed as a more modular Python module rather than a single script. It is maintained at GitHub, but most users should install it directly from PyPI via the pip command:

pip install usfm2osis

Usage

Once the usfm2osis module in installed, the usfm2osis.py conversion script itself may be invoked by calling:

usfm2osis.py

usfm2osis.py

The last GPL2 version of usfm2osis (above) has been branched by others and significantly further developed. It is published at [2].

One of the original stated aims was to achieve full coverage of USFM 2.35 according to the ParaTExt spec.[1][2]

Notes:

  1. Strangely, it's not even updated to refer to USFM 2.40 yet.
  2. The developers have been advised that USFM 3.0 is on its way. See issue #1.

Observations:

  1. usfm2osis.py version 0.6 is the current release. Refer to the roadmap within the script.
  2. usfm2osis.py will always only generate best practice OSIS, to the best of its ability, so it won't generate stuff that SWORD requires due to its own shortcomings.
  3. usfm2osis.py generates OSIS with the milestone form of both chapter and verse elements.
  4. usfm2osis.py doesn't yet use SWORD bindings, so it's very limited in its knowledge of versification systems. This may mean that for the time being, expanding cross-references would typically have to be done by other means. See below.
  5. In general, anyone interested in converting USFM to OSIS for use with SWORD ought now to employ usfm2osis.py as it's now the only supported pathway for USFM->OSIS->SWORD, inasmuch as its author has a fairly close connection with SWORD and regularly commits stuff to SWORD filters and whatnot. He will definitely not make the least effort to accommodate any non-standard output from converters other than usfm2osis.py.
  6. It is advised to use Python 2.7.x rather than Python 3.x as there are still some issues in getting the script to work with Python 3.
  7. Windows users are advised that should they encounter problems trying to run the script with ordinary CPython from python.org, to use the Python interpreter that can be installed with Cygwin. This would also help for specifying the input files using wildcards.
  8. Useful tip: Cygwin's bash shell recognizes NTFS symbolic links. This means that the USFM input files and the Python script don't need to be copied to the cygwin/home/user directory. Just make suitable links by means of the Windows mklink command.
  9. Bug reports and feature requests are welcomed at either [3], [4] or sword-support.

Usage

usfm2osis.py requires a Python interpreter in the system path.

Optional: If you wish the script to validate the OSIS output, then you'll need to install lxml.

The usage of usfm2osis.py is output when the script is run without parameters, i.e. with the shell command line

python usfm2osis.py

The syntax is very close to that for usfm2osis.pl – see below.

XML Syntax Checking

Until the Python script is fully debugged, it's possible that it may generate an OSIS file that fails the XML syntax check. As and when this occurs, it is always useful to isolate the location[s] of the XML file where the syntax is incorrect. This can be done by making separate OSIS XML files for each USFM file, or at least to exclude suspect files from the build by temporarily renaming the file extensions such that it does not match the wildcard pattern used in the command line.

OSIS Validation

Until the Python script is fully debugged, it's quite possible that even though you have an OSIS file that passes the XML syntax check, it may still fail to validate to the OSIS schema. Similar techniques are useful here too. Make an individual OSIS file from each USFM file, so that the valid files can be distinguished from the invalid files. This facilitates further investigation to more easily locate the invalid elements.

OSIS schema

In CrossWire, we can now validate OSIS files using our own custom adaptation of the OSIS schema. See OSIS 211 CR for details.

There's also a copy of the OSIS 2.1.1 schema at http://ebible.org/osisCore.2.1.1.xsd which can be used in the event that http://www.bibletechnologies.net/ is ever inaccessible.

Compatibility with osis2mod

Until the Python script is fully debugged, it's quite possible that even though you have an OSIS file that validates to the OSIS schema, it may still lead to problems when used as the input file for osis2mod. Much will depend on the complexities of the original USFM files, in terms of which markers were used and in what sequences. Some post-processing of the OSIS XML may be necessary to fix such potential problems.

Unsupported Markers

Please refer to the roadmap within the script itself if you encounter any unsupported USFM tags.[1]

Note:

  1. e.g. We recently encountered the as then undocumented use of 'nested' tags with the syntax \+tag_...\+tag*
    These are now documented in USFM Reference 2.4

Other osis-converters

John Austin

osis-converters is the location for the open source (U)SFM-to-OSIS and OSIS-to-SWORD module converters originally developed independently by the main programmer of xulsword. He now uses his own maintained version of usfm2osis.py and has commit privilege to the one maintained for CrossWire by RefDoc. See above.

Adyeths

Adyeths has developed a Python 3 script that is ostensibly faster than CrossWire's usfm2osis.py
Visit https://github.com/adyeths/u2o

Haiola

Michael Johnson's Haiola Windows software includes the conversion from USFM to OSIS as one of its features. This software is used for making the Bible modules hosted at eBible.org.

usfm2osis.pl

usfm2osis.pl is a simple Perl script intended only for converting USFM files to OSIS. It does not provide comprehensive cover for all the tags in the USFM Reference.

Notes:

  1. usfm2osis.pl is now deprecated. Please use the Python script instead. See above.
  2. Its output was geared towards the use of OSIS documents in preparing to make SWORD modules.
  3. usfm2osis.pl generates OSIS with the milestone form of both chapter and verse elements.
  4. usfm2osis.pl is no longer being actively maintained by CrossWire.

Usage

usfm2osis.pl requires a Perl interpreter in the system path. Then you can run:

perl usfm2osis.pl <osisWork> [-o OSIS-file] [-e USFM encoding] <USFM filenames|wildcard>

If usfm2osis.pl is not in the current directory, use its full path.

osisWork should be a value such as Bible.en.WEB.2007.

If you include an OSIS-file value, the output will be written there. Otherwise, it will be written to a file name based on your osisWork.

The USFM encoding argument should indicate the character encoding found in the source files. If none is given, UTF-8 is the default. The list of available encodings depends on your system. Executing the script with no arguments will print the list (as will executing it with an invalid encoding value).

The final argument is a list of filenames or a wildcard value such as *.sfm containing the SFM data.

Notes:

  1. In Windows, a wildcard as one of the usfm2osis.pl parameters does not work. This is due to the way the Windows command shell (cmd.exe) does not expand the wildcard parameter. If you have CygWin installed, the bash shell solves this problem, because it expands the wildcard filename specifier.
  2. In Windows, the following command line syntax will create an OSIS file for each SFM file. This assumes the current directory is the folder where the SFM files are stored, and that there is a same-level directory called osis. It also illustrates the use of (subsititute) drive p: for the full path to usfm2osis.pl, and omits the USFM encoding parameter merely for clarity. The additional use of %f within the output filenames is to give each OSIS XML file a unique name.
for %f in (*.SFM) do perl p:\usfm2osis.pl <osisWork> -o ..\osis\<osisWork>.%f.osis.xml %f

Similarly in CygWin or another bash environment you can use

for f in $(ls *.SFM); do perl usfm2osis.pl <osisWork> -o ../osis/<osisWork>.$f.osis.xml $f; done

Unsupported Markers

Not all USFM tags are currently supported, just those so far seen "in the wild" by the maintainers of the script. The comments of the script document the current set of tags coverage. In view of the many omissions, it is advised to use the new Python script instead.

Fixing the OSIS XML files

This section attempts to address some of the current deficiencies in SWORD, and lists tools to workaround these issues. Eventually (it is hoped) these will not be necessary, as the fixes will be implemented within usfm2osis.py during its further development, or as fixes are made to the SWORD engine and/or front-ends, as appropriate.

OSIS hacks

OSIS 2.1.1 still lacks some desirable features or improvements. While these are not yet forthcoming, sometimes it may be necessary to implement some hacks to the OSIS generated by scripted OSIS converters to workaround these "deficiencies" in the OSIS schema. You can read about OSIS change requests (or judiciously add to the page) here.

Cross-references

usfm2osis.py and usfm2osis.pl do not produce cross-references (or parallel passage headings) that work properly as links with SWORD.[1] If your text contains such, you may need xreffix.pl which sits in directory [5]. Using it requires the SWORD Perl bindings be installed and a sword-locale[2] for the language of the module to be created.[3]

Fixing references

There are two conceptually distinct process steps required in order to fix vernacular reference strings:

  1. To create a separate reference element for each noncontiguous reference.
  2. To assign the correct osisRef attribute to each reference element.
orefs.py
  • Adyeths has an ancillary script orefs.py to add the proper osisRef attribute to OSIS reference elements.[4]

Notes:

  1. A solution for this was envisaged in the original roadmap for usfm2osis.py
  2. If the locale file has any errors or omissions, some references may end up with an incorrect osisRef. Thorough checking of the converted file is paramount.
  3. Special care is required when the cross-reference (or heading) includes additional text which is not a scripture reference. Until all these issues are fixed, it becomes almost impossible to parse the reference elements correctly such that the OSIS references can be added automatically.
  4. This is fairly new (as of 2017-12-12) – it does not require SWORD bindings for Python.

Miscellaneous

Many texts in non-Latin scripts (and even for some Latin scripts) require batch conversion of characters and numbers. There are several scripts to assist with this, without harming the USFM or OSIS markup.[1][2]

Notes:

  1. These supplementary Perl scripts are located in various sub-directories under sword-tools.
  2. This has little to do with converting USFM to OSIS

Soft hyphens

OSIS XML files should ideally not contain any soft hyphens.

Sometimes, Bibles that have already been published using some form of desktop publishing are retrospectively converted to USFM. When this happens, the exported text used as the starting point for deriving the USFM files may contain a lot of soft hyphens, i.e. – where they were used to control hyphenated word wrap in the printed Bibles.

The soft hyphen is typographical device. It is not a semantic construction. It is therefore advised that they should be removed completely from the USFM files before converting to OSIS XML. At the very least, the soft hyphens should be removed entirely from the converted OSIS XML file.

ParaTExt (export to OSIS)

This section may require updating.

Since version 6, ParaTExt has a menu option to export to OSIS.

Checking | Paratext 6 Checks | Publishing | Convert USFM to OSIS (Best Practice)

This option generates XML files with the following schema definition file referenced.

osisCore.2.0_UBS_SIL_BestPractice.xsd

UBS SIL Best Practice OSIS makes use of several XML attributes that are not used within CrossWire, and some of these are custom (i.e. x-prefix named attributes). An accessible online copy does not seem to be available. Furthermore, the only undated copy of this .xsd file (40KB) that has been obtained includes the following lines:

   <!--    WARNING   WARNING   WARNING
  THIS SCHEMA IS DEVELOPMENTAL AND SHOULD BE USED WITH THAT UNDERSTANDING -->

Contrast the above with the standard definition file (92KB) used by CrossWire.

http://www.bibletechnologies.net/osisCore.2.1.1.xsd

Notwithstanding the above concerns, based on one sample exported from a Paratext project, it has been verified that replacing line 2 by the similar line output from usfm2osis.pl, such a converted XML file can be validated against the standard definition file.

Nevertheless, further testing (using osis2mod) of some OSIS files exported from Paratext revealed that they generated WARNING(NESTING) errors. Some of these were due to the eID milestone for a verse occurring BEFORE the sID milestone for the same verse! Others were due to </note> being placed AFTER the eID milestone for the verse where the note occurs.

Hence I conclude that what Paratext outputs as OSIS is not really "best practice", and would require some improvements to fix issues such as these. However, we know that work on OSIS export from Paratext has virtually stopped - so this may now prove to be of mere historical interest.

The bottom line is that the OSIS export from Paratext is not ready for publishing a SWORD module.

Creation of publication ready modules from USFM and submission to CrossWire

Checking and validating OSIS files

When checking OSIS XML files there are 3 steps:

  1. Is the OSIS well formed XML ?
  2. Is the file valid to the defined OSIS schema?[1]
  3. Is the file "fit for purpose" (i.e. suitable for immediate use by the SWORD conversion tool called osis2mod)

Step #1 does not guarantee step #2 and step #2 does not guarantee step #3. See above and below and limitations of XML validators.

Before you convert your OSIS files to SWORD format, you should always check that it is valid OSIS.
Before you submit any files to modules@crosswire.org, you must ensure that your files are valid OSIS. Invalid OSIS files will not be accepted.

Note:

  1. e.g. Using lxml or any other suitable means (e.g. the XML Tools plugin for Notepad++ on Windows).

Module creation process

This section requires updating to reflect the move to using the Python script usfm2osis.py in preference to its Perl predecessor.

Our process to create publication ready modules from USFM looks like this:

Stage 1

The command line examples in this section are for Unix. In Windows, it's not as simple to make equivalent command lines for Perl, especially in regard to globbing wildcards for filenames.

  • Iterate usfm2osis.pl over each single USFM file to produce a OSIS file per USFM file
for biblebook in 'ls *usfm'; do usfm2osis.pl $biblebook $biblebook; done
  • Check each OSIS file for XML validity - this will throw up a lot of not obvious USFM encoding problems
for biblebook in 'ls *osis.xml'; do checkbible $biblebook; done
  • Correct USFM files as per mistakes found

Rerun this stage until you have clean and validating OSIS files.

Stage 2

  1. Check the OSIS files for not transformed USFM markers.[1][2]
  2. Fix usfm2osis.pl and add the missing tags for correct OSIS transformation.
  3. Rerun above until you have a clean validating collection of OSIS files with no left over USFM tags.[3]
  4. Send the updates to usfm2osis.pl to CrossWire's SVN repository.
  5. Run usfm2osis.pl over all the USFM files and create a single large OSIS file.
  6. Check again for XML validity.[4]
  7. Run the OSIS file through osis2mod[5], create a conf file, and check the resultant module for problems in a variety of front-ends.[6]
  8. Create a sword locale file and store in your /usr/share/sword/locales.d/ directory[7]
  9. If cross references are part of the USFM files fix these with the help of xreffix.pl (also in sword-tools).[8]
  10. Run again through osis2mod and check again.
  11. Submit the final OSIS file and the conf file to CrossWire together with the locale file.[9]
  12. Send any corrections on the USFM files back to the translation team. Chances are these are all valid and necessary corrections which help to improve the USFM.[10]

Notes:

  1. Not many common ones are left unsupported, but you may encounter usually one or two new ones if you get files from a different new source.
  2. Some supported USFM markers may remain unconverted due to anomalies in how the USFM files were edited.
    Paratext does not check all potential mismatches against the USFM reference manual, and moreover, the manual itself leaves room for some ambiguities.
  3. A lot of the repetitive aspects can be done with the help of some shell scripts, so do not worry that you have to run each time multiples of usfm2osis.pl by hand.
  4. See OSIS Bible validation for further instructions on OSIS validation.
  5. Instructions for running osis2mod are available at osis2mod usage.
  6. Fixes should be made in the USFM files at the very start and then cascaded down into the module following the overall process again.
  7. or the equivalent directory on your system
  8. This requires a sword locale in the local installation at least.
  9. Include advice regarding which repository the module is intended for.
  10. Both Paratext and Bibledit are a bit more forgiving than our tools, but some problems you encounter could have affected the final paper print.