Talk:Modules in the beta repository

From CrossWire Bible Society
Revision as of 14:46, 27 June 2008 by Dmsmith (Talk | contribs) (ABU)

Jump to: navigation, search

I thought I'd move discussion about the implementation of modules here. It was cluttering the other page and when we'd get a new version or address a problem, we'd reset the row. Here we can keep the info until we are done.--Dmsmith 05:35, 22 June 2008 (MDT)

ABU

In Matt 5 there is a WoC display problem. The WoC has a start in verse 3 and ends at the end of the last chapter. Fortunately, the WoC start and finish in this module is on chapter boundaries. If it had started in chapter 5 and finished in chapter 7 then the display of chapter 6 would never highlight the WoC.

The SWORD Engine currently terminates WoC at a verse boundary, regardless of how it is encoded. This is because it does not keep state regarding WoC across a single verse. No frontend will display it correctly, because it is not a frontend problem.

I (DM) see several solutions (there may be others):

  • Change osis2mod to accept a wider range of valid OSIS inputs, creating a module that the SWORD Engine can handle, specifically allowing how this is encoded. Then it can work with 1.5.9.
  • Change the SWORD engine to handle this input at a chapter level. This is not a complete solution. Currently, when the KJV is searched, the WoC are highlighted in the search results list. This input will only highlight Matt 5:3 for any hit in Matt 5. Changing the SWORD Engine will require this module to be 1.5.12.
  • Change the module so that it is encoded as recommended in the wiki for OSIS Bibles. Then it can work with 1.5.9.

The easiest is to change the module. If not that, I'd suggest changing osis2mod, which probably is the best solution, resulting in easier module definition. I don't like the SWORD Engine change, because it is incomplete.


  • It looks to me as if it doesn't really matter how I encode the text, it won't render correctly on some set of frontends. ... was the original encoding, and that got complaints, so I switched to milestoned ... encoding (1.2), and that got more complaints, so I switched back to containers (1.3). Play with both. Tell me which is less wrong. Osk 14:03, 21 June 2008 (MDT)

There have been 3 versions:

The version 1.1 had for Matt 5:3-4:
(Note: my comment on this was that the sIDs and the eIDs were not properly encoded.)

5.3:
<q marker="" sID="q1" who="Jesus"/><br/>
   Happy the poor in spirit;
   for theirs is the kingdom of heaven.
<q eID="q1++" marker="" who="Jesus"/>
5.4:
<q marker="" sID="q1" who="Jesus"/>
   Happy they that mourn;
   for they shall be comforted.
<q eID="q1++" marker="" who="Jesus"/>

Version 1.2 has:

5.3:
<q marker="" who="Jesus" sID="q7"/>
   Happy the poor in spirit;
   for theirs is the kingdom of heaven.
5.4:
   Happy they that mourn;
   for they shall be comforted.

Version 1.3 has:

<q marker="" who="Jesus">
   Happy the poor in spirit;
   for theirs is the kingdom of heaven.
5.4:
   Happy they that mourn;
   for they shall be comforted.

Of the above, 1.1 is in my opinion, the best. It can work in the search result.

The following variant of 1.1 will work for all frontends and given how simple the ABU is, it should produce well-formed valid XML. The way to think about this is that is not a quotation marker but is a WoC marker as in <woc>...</woc> that has to be started and stopped in each verse and surround each word/phrase that Jesus uttered.

5.3:
<q marker=""  who="Jesus">
   Happy the poor in spirit;
   for theirs is the kingdom of heaven.
</q>
5.4:
<q marker=""  who="Jesus">
   Happy they that mourn;
   for they shall be comforted.
</q>

Dmsmith 19:14, 21 June 2008 (MDT)


There's some confusion here.

  • 1.0 was encoded with ....
  • 1.1 was encoded with .... That was just a bug in that I forgot to execute the ++ expression, so it got copied into the text. If I ever did ... in a way that was contained within the verse, it was probably at this stage and was an error stemming from the ++ execution bug.
  • 1.2 just executed the ++ interator, so it has ... and iterates the number.
  • 1.3 goes back to 1.0 (it's quite possibly bit-for-bit identical) with ....
  • ABU_1_2 is identical to 1.2 because it is literally the same files. I just moved the directory on the server from ABU to ABU_1_2 before uploading the new version.
  • Whenever milestones are used, <verse/> appears as a container, and vice versa.
  • The basic hierarchy of OSIS Bibles is Book-Section-Paragraph rather than Book-Chapter-Verse. We asked SIL/Wycliffe (and maybe some UBS guys); they said use BSP, not BCV. We chose to prefer BSP. That's why <div/> and <p/> (the BSP hierarchy elements) aren't milestonable but <chapter/> and <verse/> are.

Osk 22:35, 21 June 2008 (MDT)


OK, I've got my numbers off. The version that had q++, I thought was 1.2. I guess I never saw 1.2. I've corrected my statement above to your information.

Regarding your comment about BCV, the &ltdiv> element is milestoneable.

If BSP is the proper way to encode an OSIS Bible, then I think:

  1. <verse> should always be milesoned.
  2. osis2mod should preserve the verse element (start and end) in the text and get rid of the pre-verse hack. With BSP, this will occur more and more.

Whether we encode OSIS Bible texts as BCV or BSP, the resulting module needs to work for Bible applications. In the SWORD engine verse is the indexable unit in the SWORD engine. All of our applications display verses in isolation, at least in the search result, some elsewhere.

I think the following is the best short term solution (which is a minor variation of 1.1):

  1. If quotation marks are to be displayed in the module, mark the beginning of the quote in chapter 5 with , and the end of the quote in Matt 7 with . Add marker attribute to be UTF-8 curly quotes if desired. Also, if the quote is interrupted, such that quotation marks should appear in the span of the Sermon on the Mount, then put the same there.

    If quotations are not needed then these are entirely unnecessary for our code as it stands today, but they might come in handy if we had each quote in the scripture marked with who as we could analyze the text for who said what.
  1. Within each verse surround the actual words of Christ with <woc>...</woc>. Obviously, if these cross a BSP boundary, then they stop and start on either side of the boundary. Finally, to make valid OSIS replace those with and respectively. The milestoned version (i.e. your 1.1) should have worked for all SWORD apps as of 1.5.9. But it didn't.
    It does not work for JSword because it uses xslt to do the processing, which cannot handle it.
  1. If the module should not show quote marks, use OSISQToTick=false (From memory. So, I may have goofed this.) This makes the empty marker="" unnecessary.

Ultimately, it is the responsibility of osis2mod to placate the SWORD Engine by transforming modules to what it wants to hear. I think the best long term solution is for osis2mod to handle all properly encoded documents, such as 1.2 and 1.3. (Version 1.1 was a placation.) Obviously, if one can tediously encode 1.1, that processing can be put into osis2mod.

Dmsmith 05:35, 22 June 2008 (MDT)

---

One of the longstanding principles of our employment of OSIS has been that we should accept any valid OSIS, but that we need not maintain valid OSIS in our data. So osis2mod should accept anything that is valid, but the contents of the modules themselves need not be valid OSIS. (I'm more concerned with actual markup here. The cases where people actually want to use UTF-16 encoded OSIS or single quotes instead of double in attribute values aren't significant enough for me to care.)

  • <verse> should generally be the first element that gets milestoned any time there is a well-formedness problem. The purpose for allowing both milestoned and container forms was to allow simply Bibles to be encoded simply.
  • I agree that <verse> elements should be preserved. That's why I made the change, committed it, and posted a Bible or two using this format. Troy had objections and rolled back the changes, though they had no negative effects on any existing or future content. If you want to pursue this, talk to Troy. I see not preserving <verse> as universally bad. I don't see any problem with maintaining the pre-verse title system.
  • There should be no problem with using marker="". This is actual OSIS, whereas OSISQToTick=false was added to handle OSIS docs that don't actually conform to the standard. The marker attribute came in one of the last OSIS releases (2.1 or 2.1.1) and solved (in a more official capacity) a problem for which we had developed a hack (OSISQToTick). If we're concerned with accepting any valid OSIS, we should definitely accept marker="" and should probably deprecate OSISQToTick.

We should probably define a non-standard method of encoding that fits within <verse> elements but that can be easily derived (by osis2mod) from either standard, valid encoding. I suspect we should just use (though <milestone/> is another possibility). Given the following input:

<verse>
  cdata
  <q osisID="q1" sID="q1" who="Jesus" marker=""/>
    cdata
</verse>
<verse>
    cdata
  <q osisID="q1" eID="q1" who="Jesus" marker=""/>
  cdata
</verse>

We could generate:

<verse>
  cdata
  <q osisID="q1" sID="q1" who="Jesus" marker=""/>
    cdata
  <q type="x-continuation" eID="" who="Jesus" marker=""/>
</verse>
<verse>
  <q type="x-continuation" sID="" who="Jesus" marker=""/>
    cdata
  <q osisID="q1" eID="q1" who="Jesus" marker=""/>
  cdata
</verse>

And given the following input:

<verse sID="v1"/>
  cdata
  <q osisID="q1" who="Jesus" marker="">
    cdata
<verse eID="v1"/>
<verse sID="v2"/>
    cdata
  </q>
  cdata
<verse eID="v2"/>

We could generate:

<verse sID="v1"/>
  cdata
  <q osisID="q1" who="Jesus" marker="">
    cdata
  <q type="x-continuation" eID="" who="Jesus" marker=""/>
<verse eID="v1"/>
<verse sID="v2"/>
  <q type="x-continuation" sID="" who="Jesus" marker=""/>
    cdata
  </q>
  cdata
<verse eID="v2"/>

I believe these will validate as OSIS and would work in BibleCS without modification. I suspect they will work in HTMLHREF frontends with little (if any) modification, but I haven't looked at the code lately. It seems to me that XSLT ought to be able to convert milestone elements to container element starts/ends for JSword, but I haven't touched XSLT in about 6 years.

Osk 20:49, 26 June 2008 (MDT) --- I think we agree that osis2mod is the proper place to solve the problem and I think we agree on what osis2mod should accept. While I am not overly concerned whether the resulting OSIS is valid, I'd caveat that by saying that each element and attribute should be valid defined in OSIS.

In my opinion, osis2mod should mark up the text in such a way that we know what is created by osis2mod. That way we can reasonably reconstruct the original. We can use x- attribute values to type and subType for this purpose. In this example, the start and end of the Sermon on the Mount could be marked as original, with x-start and x-end.

Currently the milestoned version of the WoC is not handled by MS, GS or BD. That it is handled by SW indicates that the problem may be in MS or GS not using the HTMLHREF filter or that there is a minor difference between it and what SW actually uses. BD is altogether a different issue. The point is that a filter change is a 1.5.12 change.

We have a couple of choices in changing osis2mod (as discussed above):

  1. Change it to produce as a container. This is how ESV and KJV are encoded and it works with 1.5.9 and with JSword/BD.
  2. Change it to produce as a milestone and also fix HTMLHREF to accept it and not release the modules until 1.5.12 is released. I might be able to figure out how to get JSword/BD to handle it.

In both of these the location of the start and end markers are the same. The difference is the form.

Regarding XSLT, the guarantee of XSLT is that all output is well-formed. I have not found a way to specify the output a begin element without also outputting an end element. The processing that is necessary is to collect the content and elements between two arbitrary points (represented by milestones) and style it. I simply don't know how to do it or whether it can be done, let alone whether it can perform well.

I may be able to pre-filter and do a transformation.

On a different note: Regarding the inclusion of <verse> and the pre-verse handling:

Right now the pre-verse handling is solely for the sake of headings. These are pulled out of the stream and stored as verse heading attributes. It cannot handle other kinds of pre-verse content, such as white-space (e.g. new lines) and notes.

With regard to whitespace before a verse, generally it is appended to the prior verse. If it occurs after the heading, osis2mod yanks it and treats it as if it came before the heading.

With regard to notes it is perfectly valid OSIS to have a note attached to a heading by immediately following it. SWORD cannot handle it. One way to handle it is for osis2mod to move the note into the heading. There used to be a problem with having notes in a heading, but it may be fixed now.

--Dmsmith 08:46, 27 June 2008 (MDT)