Difference between revisions of "Talk:Converting SFM Bibles to OSIS"

From CrossWire Bible Society
Jump to: navigation, search
(Using SFMToOSIS: new section)
(usfm2osis.py moved to github: new section)
 
(10 intermediate revisions by the same user not shown)
Line 44: Line 44:
 
== Using SFMToOSIS ==
 
== Using SFMToOSIS ==
  
'':Section being removed from main page - links to the software are now broken''. [[User:David Haslam|David Haslam]] 02:14, 13 January 2012 (MST)
+
:''Section being removed from main page - links to the software are now broken''. [[User:David Haslam|David Haslam]] 02:14, 13 January 2012 (MST)
  
 
[http://www.virtualstorehouse.org/index.php/scripture-preservation-tool.html SFMToOSIS] is a robust, but somewhat difficult to setup and use, program produced by [http://www.snowfallsoftware.com/ Snowfall Software]. The program requires that you install a [http://www.python.org/ Python] interpreter and that the interpreter be within the system path. The program can also convert OSIS files back to SFM. SFMToOSIS more robust than usfm2osis.pl and more likely to produce valid OSIS files automatically. The program has been renamed as '''OSIS Conversion Utility''' but comes under the heading '''Scripture Preservation Utility'''.
 
[http://www.virtualstorehouse.org/index.php/scripture-preservation-tool.html SFMToOSIS] is a robust, but somewhat difficult to setup and use, program produced by [http://www.snowfallsoftware.com/ Snowfall Software]. The program requires that you install a [http://www.python.org/ Python] interpreter and that the interpreter be within the system path. The program can also convert OSIS files back to SFM. SFMToOSIS more robust than usfm2osis.pl and more likely to produce valid OSIS files automatically. The program has been renamed as '''OSIS Conversion Utility''' but comes under the heading '''Scripture Preservation Utility'''.
Line 67: Line 67:
  
 
Next, update the ''run.bat'' file that came with SFMToOSIS to point to your SFM files and SSF file. Then run ''run.bat''.
 
Next, update the ''run.bat'' file that came with SFMToOSIS to point to your SFM files and SSF file. Then run ''run.bat''.
 +
 +
== checkbible  ? ==
 +
 +
In Stage 1, the command <tt>checkbible</tt> is generally unknown and requires explaining. [[User:David Haslam|David Haslam]] 09:51, 14 January 2012 (MST)
 +
 +
:At a guess, it's the name of a script file that calls <tt>xmllint</tt> with command line parameters supplied to validate the specified OSIS XML file against the schema. [[User:David Haslam|David Haslam]] 09:56, 14 January 2012 (MST)
 +
 +
::See http://xmlsoft.org/xmllint.html [[User:David Haslam|David Haslam]] 09:58, 14 January 2012 (MST)
 +
 +
== The Every Tribe Every Nation initiative ==
 +
 +
As there is a real prospect for CrossWire to eventually become an approved End-User-Ministry-Partner (EUMP) within the framework of the [http://everytribeeverynation.org/ Every Tribe Every Nation] initiative, it makes good sense to gain a better knowledge of how UBS & SIL are using OSIS. This is why I have just added further details about the OSIS files exported from Paratext. We need to learn more. [[User:David Haslam|David Haslam]] 01:25, 23 April 2012 (MDT)
 +
 +
== Haiola? ==
 +
 +
Should we include anything about Michael Johnson's [http://haiola.org/ Haiola Scripture Publishing Software]?
 +
[[User:David Haslam|David Haslam]] 01:50, 8 May 2013 (MDT)
 +
 +
== Module creation process - improving this section ==
 +
 +
Need to replace all the references to '''usfm2osis.pl''' by '''usfm2osis.py''', making further changes as and where required. [[User:David Haslam|David Haslam]] 06:56, 17 May 2013 (MDT)
 +
 +
== Removed subsections  ==
 +
 +
I just removed these two subsections that were misleading developers. [[User:David Haslam|David Haslam]] 04:43, 18 May 2013 (MDT)
 +
 +
:Scribe wrote in sword devel...
 +
::Please forget about x-preverse.  It was never intended for module developers.  It is an internal attribute we add to help the SWORD engine process OSIS.  Brief overview without details: SWORD must keep all data in 'verse' chunks.  Hence, the title for a verse goes WITHIN the verse chunk and we need to mark it somehow that it was really meant be be rendered 'pre-verse'.  This should never be known to anyone but a developer of the internals of the SWORD engine.
 +
::I realize it was added manual by module developers to pragmatically get titles to display how they wished when their markup was not showing properly because their titles were not getting the x-preverse tag added by osis2mod.  This was the wrong long-term solution (though I don't need to be lectured on WHY it was used by module developers; I sympathize that you just wanted to get something working).  The correct solution is 2 fold: to use correct OSIS and make sure osis2mod "Does The Right Thing"(tm).
 +
:[[User:David Haslam|David Haslam]] 04:48, 18 May 2013 (MDT)
 +
 +
==== Titles ====
 +
Unfortunately, (right now) from such a generated OSIS file, the next step via osis2mod does not ensure that certain titles display correctly with SWORD.<ref>The main issue as regards whether titles are displayed by any front-end seems to be for titles at the start of a chapter.</ref><ref>Modules made from OSIS files generated by usfm2osis.py do display titles when they are not at the start of a chapter, and they are in the right position.</ref>
 +
 +
A further perl script, <tt>title-cleanup.pl</tt> may be used (with care) to add the argument <tt>subType="x-preverse"</tt> to some titles.<ref>This attribute should only be used when the title element is placed within the verse. As it stands, <tt>title-cleanup.pl</tt> does not restrict the changes to this pattern.</ref><ref>We have encountered a few modules in which this script has been mistakenly applied where it was unnecessary. The effect has been to misplace some titles to above the verse prior to the one they should be displayed as headings for. In some cases, this shifted the titles to before the last verse of the previous chapter!</ref><ref>There is still an unsolved issue that caused some preverse titles to be misplaced in '''TurNTB''' version 2.0. The symptoms are described in [http://www.crosswire.org/bugs/browse/MOD-189 modules issue 189].</ref>
 +
 +
'''Notes:'''
 +
<references />
 +
 +
==== New lines ====
 +
usfm2osis.py does not insert a new line above each section title. That shouldn't have mattered, but SWORD mysteriously appends the title text to the end of the previous verse text. The following workaround should fix this, but the replacements should be restricted to titles other than those at the start of a chapter.
 +
 +
Replace
 +
<pre>
 +
<div type="section"><title
 +
</pre>
 +
by
 +
<pre>
 +
<div type="section"><p></p>
 +
<title
 +
</pre>
 +
using any compatible text editor.<ref>The omission of the ">" for the title element was just in case any of these had an attribute.</ref><ref>Similar workarounds are required for other kinds of title (such as acrostic stanza headings).</ref><ref>Care must be taken to ensure that the XML remains valid.</ref>
 +
 +
'''Notes:'''
 +
<references />
 +
 +
== Nested tags ==
 +
 +
These (as yet undocumented) 'nested' USFM tags were encountered in the source files for the complete Welsh Beibl project (2013). [[User:David Haslam|David Haslam]] 05:09, 18 May 2013 (MDT)
 +
<pre>
 +
Count SFM tag Description
 +
----- -------------------
 +
00004 \+it Italics text style begin (nested)
 +
00004 \+it* Italics text style end  (nested)
 +
00026 \+nd Name of deity begin (nested)
 +
00026 \+nd* Name of deity end  (nested)
 +
00003 \+qt Quoted text begin (nested)
 +
00003 \+qt* Quoted text end  (nested)
 +
00206 \+sc Small-cap text begin (nested)
 +
00206 \+sc* Small-cap text end  (nested)
 +
00122 \+tl Transliterated (or foreign) word[s] begin (nested)
 +
00122 \+tl* Transliterated (or foreign) word[s] end  (nested)
 +
</pre>
 +
 +
== usfm2osis.py moved to github ==
 +
 +
[[User:Osk|Osk]] moved development to GitHub in April: https://github.com/chrislit/usfm2osis
 +
[[User:David Haslam|David Haslam]] 03:02, 30 December 2014 (MST)

Latest revision as of 10:02, 30 December 2014

Paratext USFM stylesheet, version 2.2

There has been an update to the Paratext USFM stylesheet, version 2.2, Updated October 17, 2008. See [1]. This update post-dates the making of Go Bible Creator version 2.3.2 which was the first to provide some support for USFM as source text format. CrossWire programmers who make use of the USFM stylesheet standards will need to take note of the changelog. I have also flagged this to the SWORD Dev mailing list. David Haslam 14:37, 8 December 2008 (UTC)

I suggest that CrossWire programmers who make reference to external standards make use of the excellent service provided by http://www.changedetection.com/ to be alerted by email when there are changes to websites like this. David Haslam 14:46, 8 December 2008 (UTC)

Paratext export to XML

The UBS page about Paratext includes this sentence, "Paratext also provides function for exporting text to RTF and XML formats." I have been informed that these XML export formats include OSIS in either the milestoned form (BSP) or the containerized form (BCV). Has anyone in CrossWire compared the direct Paratext export XML formats with what we can obtain using either the Snowfall software conversion tool or the CrossWire conversion script? David Haslam 13:07, 5 June 2009 (UTC)

One of my contacts emailed me (2009-05-15) as follows, "I was talking with the UBS Paratext folks yesterday about OSIS output and they let me know that they can do that to a degree. I tried using Paratext 6.1's "Convert from USFM to OSIS (Best Practices)" facility, which is found under the "Checking=>Publishing" menu. That generated cleaner OSIS that validated and didn't require an intervening XSLT (it has an option to not do milestones)." The latter option is relevant in the context of using Go Bible Creator, which for OSIS requires BCV, rather than BSP with milestones. He continued, "... Inasmuch as the Paratext conversion facility seems to work better than SFMToOSIS, I'll stick with that for now. Paratext 7 will also support this.". David Haslam 16:00, 6 June 2009 (UTC)

Who May Use Paratext

The UBS page also states,

The Paratext program and its associated text files have been developed by the United Bible Societies for use by its translations project teams. Paratext is also made available for use by other Bible agencies for whom a formal licensing arrangement with UBS has been established. Persons qualified to use Paratext will include translation consultants, translators, reviewers, and those directly engaged in the translation's production. Copyright restrictions, especially those determined by third party copyright holders, require that use and distribution of Paratext be strictly limited to those engaged in Bible translation (as outlined above).

Seeing as CrossWire could be properly classified under "other Bible agencies", has there been any move towards negotiating a formal licensing arrangement for CrossWire members? David Haslam 13:14, 5 June 2009 (UTC)

Converting OSIS to USFM

Just followed the link to the SnowFall Software Scripture Preservation Utility, and observed that it now also provides a means to convert OSIS to USFM. David Haslam 14:32, 1 May 2010 (UTC)

Has anyone tried using it in this direction? David Haslam 12:04, 4 May 2010 (UTC)
Description also includes, "A newer version is being developed with a graphical user interface for non-technical users that includes workflow tracking, keyboarding from scratch, and converting from formats other than USFM/SFM files." David Haslam 12:07, 4 May 2010 (UTC)

Broken links for UBS-ICAP

The link to http://confluence.ubs-icap.org/display/USFM/Home is broken. I can't even reach http://confluence.ubs-icap.org/ David Haslam 09:42, 21 June 2010 (UTC)

xreffix.pl

Please provide a link for xreffix.pl David Haslam 09:47, 11 December 2010 (UTC)

It's not in the CrossWire ftpmirror for perl utilities. David Haslam 14:35, 11 December 2010 (UTC)

Converting files received in miscellaneous formats to USFM

Based on the experience of several CrossWire volunteers, it would be useful to start a new page with advice on converting files received in miscellaneous formats to USFM. David Haslam 20:14, 13 December 2010 (UTC)

Do we have much experience in that? Even if so, is it really our place to provide this information or to encourage movement of data to USFM? --Osk 22:06, 13 December 2010 (UTC)

Using SFMToOSIS

Section being removed from main page - links to the software are now broken. David Haslam 02:14, 13 January 2012 (MST)

SFMToOSIS is a robust, but somewhat difficult to setup and use, program produced by Snowfall Software. The program requires that you install a Python interpreter and that the interpreter be within the system path. The program can also convert OSIS files back to SFM. SFMToOSIS more robust than usfm2osis.pl and more likely to produce valid OSIS files automatically. The program has been renamed as OSIS Conversion Utility but comes under the heading Scripture Preservation Utility.

The program also requires a Paratext .ssf file. If you do not have such a file, you can use the following sample and adjust it to the specifics of your own text:

<ScriptureText>
<BooksPresent>111111111111111111111111111111111111111111111111111111111111111111000000000000000000000000000000000</BooksPresent>
<Encoding>UTF-8</Encoding>
<ChapterMarker>c</ChapterMarker>
<Copyright></Copyright>
<FileNameForm>41MAT</FileNameForm>
<FileNamePostPart>.SFM</FileNamePostPart>
<FileNamePrePart></FileNamePrePart>
<FullName>The Bible in English</FullName>
<Language>ENGLISH</Language>
<LeftToRight>T</LeftToRight>
<Name>EN</Name>
<VerseMarker>v</VerseMarker>
<Versification>4</Versification>
</ScriptureText>

Next, update the run.bat file that came with SFMToOSIS to point to your SFM files and SSF file. Then run run.bat.

checkbible  ?

In Stage 1, the command checkbible is generally unknown and requires explaining. David Haslam 09:51, 14 January 2012 (MST)

At a guess, it's the name of a script file that calls xmllint with command line parameters supplied to validate the specified OSIS XML file against the schema. David Haslam 09:56, 14 January 2012 (MST)
See http://xmlsoft.org/xmllint.html David Haslam 09:58, 14 January 2012 (MST)

The Every Tribe Every Nation initiative

As there is a real prospect for CrossWire to eventually become an approved End-User-Ministry-Partner (EUMP) within the framework of the Every Tribe Every Nation initiative, it makes good sense to gain a better knowledge of how UBS & SIL are using OSIS. This is why I have just added further details about the OSIS files exported from Paratext. We need to learn more. David Haslam 01:25, 23 April 2012 (MDT)

Haiola?

Should we include anything about Michael Johnson's Haiola Scripture Publishing Software? David Haslam 01:50, 8 May 2013 (MDT)

Module creation process - improving this section

Need to replace all the references to usfm2osis.pl by usfm2osis.py, making further changes as and where required. David Haslam 06:56, 17 May 2013 (MDT)

Removed subsections

I just removed these two subsections that were misleading developers. David Haslam 04:43, 18 May 2013 (MDT)

Scribe wrote in sword devel...
Please forget about x-preverse. It was never intended for module developers. It is an internal attribute we add to help the SWORD engine process OSIS. Brief overview without details: SWORD must keep all data in 'verse' chunks. Hence, the title for a verse goes WITHIN the verse chunk and we need to mark it somehow that it was really meant be be rendered 'pre-verse'. This should never be known to anyone but a developer of the internals of the SWORD engine.
I realize it was added manual by module developers to pragmatically get titles to display how they wished when their markup was not showing properly because their titles were not getting the x-preverse tag added by osis2mod. This was the wrong long-term solution (though I don't need to be lectured on WHY it was used by module developers; I sympathize that you just wanted to get something working). The correct solution is 2 fold: to use correct OSIS and make sure osis2mod "Does The Right Thing"(tm).
David Haslam 04:48, 18 May 2013 (MDT)

Titles

Unfortunately, (right now) from such a generated OSIS file, the next step via osis2mod does not ensure that certain titles display correctly with SWORD.[1][2]

A further perl script, title-cleanup.pl may be used (with care) to add the argument subType="x-preverse" to some titles.[3][4][5]

Notes:

  1. The main issue as regards whether titles are displayed by any front-end seems to be for titles at the start of a chapter.
  2. Modules made from OSIS files generated by usfm2osis.py do display titles when they are not at the start of a chapter, and they are in the right position.
  3. This attribute should only be used when the title element is placed within the verse. As it stands, title-cleanup.pl does not restrict the changes to this pattern.
  4. We have encountered a few modules in which this script has been mistakenly applied where it was unnecessary. The effect has been to misplace some titles to above the verse prior to the one they should be displayed as headings for. In some cases, this shifted the titles to before the last verse of the previous chapter!
  5. There is still an unsolved issue that caused some preverse titles to be misplaced in TurNTB version 2.0. The symptoms are described in modules issue 189.

New lines

usfm2osis.py does not insert a new line above each section title. That shouldn't have mattered, but SWORD mysteriously appends the title text to the end of the previous verse text. The following workaround should fix this, but the replacements should be restricted to titles other than those at the start of a chapter.

Replace

<div type="section"><title

by

<div type="section"><p></p>
<title

using any compatible text editor.[1][2][3]

Notes:

  1. The omission of the ">" for the title element was just in case any of these had an attribute.
  2. Similar workarounds are required for other kinds of title (such as acrostic stanza headings).
  3. Care must be taken to ensure that the XML remains valid.

Nested tags

These (as yet undocumented) 'nested' USFM tags were encountered in the source files for the complete Welsh Beibl project (2013). David Haslam 05:09, 18 May 2013 (MDT)

Count	SFM tag	Description
-----	-------------------
00004	\+it	Italics text style begin (nested)
00004	\+it*	Italics text style end   (nested)
00026	\+nd	Name of deity begin (nested)
00026	\+nd*	Name of deity end   (nested)
00003	\+qt	Quoted text begin (nested)
00003	\+qt*	Quoted text end   (nested)
00206	\+sc	Small-cap text begin (nested)
00206	\+sc*	Small-cap text end   (nested)
00122	\+tl	Transliterated (or foreign) word[s] begin (nested)
00122	\+tl*	Transliterated (or foreign) word[s] end   (nested)

usfm2osis.py moved to github

Osk moved development to GitHub in April: https://github.com/chrislit/usfm2osis David Haslam 03:02, 30 December 2014 (MST)