Difference between revisions of "File Formats"

From CrossWire Bible Society
Jump to: navigation, search
(ThML)
m (The SWORD Project Utilities: <BR>)
 
(95 intermediate revisions by 5 users not shown)
Line 1: Line 1:
Bible study programs use a plethora of markup formats. Even more have been suggested for use in creating Bibles and other religious material.
+
This page lists some of the more common file formats ''relevant'' to The SWORD Project, associated utilities, and other CrossWire projects.
  
 
CrossWire Bible Society respects [[copyright]].  As such, conversion of material that is under copyright without permission from the copyright holders is not supported by The SWORD Project.
 
CrossWire Bible Society respects [[copyright]].  As such, conversion of material that is under copyright without permission from the copyright holders is not supported by The SWORD Project.
  
This page lists some of the more common file formats ''relevant'' to The SWORD Project and associated utilities.
+
== SWORD modules ==
 +
Other than the source code for the SWORD API, there is no documentation for the file format of a '''SWORD module'''. The intention is that the [[DevTools:SWORD|SWORD API]] (or the [[DevTools:JSword|JSword]] implementation) is used directly or via other language bindings.
  
==SWORD Input formats==
+
Our module file format is proprietary in the sense that we see no need to document it and certainly no need to stick to it. We change it when we need to. We therefore do not encourage direct interaction with it, but firmly recommend use of the API (either C++ or Java). This is the place where we seek stability and consistency.
The SWORD Project supports the following markup: OSIS, ThML, GBF and plain text.
 
  
===OSIS===
+
The SWORD Project supports currently and actively the following markup for module creation: OSIS, [https://tei-c.org/ TEI], ThML and plain text.
''Open Scripture Information Standard''
 
 
 
The Open Scripture Information Standard (OSIS) is "a common format for many visions." It is an XML format for marking up scripture and related text, part of an initiative composed of translators, publishers, scholars, software manufacturers, and technical experts, coordinated by the [http://www.bibletechnologies.net/ Bible Technologies Group]. It is co-sponsored by the [http://www.americanbible.org/ American Bible Society] and the [http://www.sbl-site.org/ Society of Biblical Literature].
 
 
 
The most recent XML schema is [http://www.bibletechnologies.net/osisCore.2.1.1.xsd OSIS 2.1.1], and a [http://www.bibletechnologies.net/20Manual.dsp manual]  is also available. There are some examples of OSIS files at [http://www.bibletechnologies.net/osistext/ Bibles in OSIS 2.0].
 
 
 
This markup format is recommended by the CrossWire Bible Society and can be used for creating all types of resources for The SWORD Project. Support for OSIS is actively maintained and support for any unsupported elements or features needed for a module you may be working on may be requested.
 
 
 
[http://www.princexml.com/ Prince XML] is a proprietary software program that converts XML and HTML documents into PDF files by applying Cascading Style Sheets (CSS). It is developed by YesLogic, a small company based in Melbourne, Australia. It can be used to create high quality PDF Bibles from OSIS files[http://www.princexml.com/samples/]. A paper by [http://www.bibletechconference.com/speakers/ Jim Albright] of Wycliffe Bible Translators was presented at [http://www.bibletechconference.com/ BibleTech 2010] on using the open-source GUI companion for Prince XML, called [http://code.google.com/p/princess-2010/ Princess].
 
 
 
===ThML===
 
''Theological Markup Language''
 
 
 
This format is a variant of XML based on TEI and ThML, developed by and for the [http://www.ccel.org/ Christian Classics Ethereal Library]. The specifications for this markup format are available at http://www.ccel.org/ThML/.
 
 
 
This markup format is used in some SWORD resources, but only the creation of free-form "General book" modules based on existing CCEL resources is currently supported. Other works and new works should be created using the OSIS or TEI format.
 
 
 
===GBF===
 
''General Bible Format''
 
 
 
This markup format is intended as an aid to preparing Bible texts (specifically the WEB and WEB:ME) for use with various Bible search programs. The complete specification is at http://www.ebible.org/bible/gbf.htm.
 
 
 
This markup format was previously used for some SWORD modules but is now [http://en.wikipedia.org/wiki/Deprecation deprecated] in favor of OSIS. The rudimentary [http://crosswire.org/ftpmirror/pub/sword/utils/perl/gbf2osis.pl gbf2osis.pl] Perl utility may be used to convert GBF to OSIS for import to SWORD's native format. Adyeth hosts a [http://sites.google.com/site/adyeths/theswordproject/gbf2osis.py?attredirects=0 gbf2osis] Python utility that he wrote to convert the GBF texts from [http://ebible.org/ ebible.org] to OSIS. See [http://sites.google.com/site/adyeths/theswordproject].
 
 
 
===VPL===
 
''Verse-Per-Line''
 
 
 
This plain-text format is used for by SWORD for import of Bibles. It consists of one verse per line, with an optional verse reference at the beginning. The [[#VPL_Tools|vpl2mod]] utility may be used for import. VPL is deprecated in favor of the IMP format, which is more widely useful. The [[#VPL_Tools|mod2vpl]] utility may be used for export to VPL. There is a command line switch to prepend the verse reference to each line.
 
 
 
===IMP===
 
''Import Format''
 
 
 
This proprietary file format is used by SWORD for import of all types of modules. The three utilities '''imp2vs''' (for Bibles and verse-indexed commentaries), '''imp2ld''' (for lexicons, dictionaries, and daily-devotionals), and '''imp2gbs''' (for all other types of books) can be used to import IMP files to SWORD's native formats.
 
 
 
An IMP file consists of any number of entries. Each entry consists of a key line and any number of content lines. The key line consists of a line beginning with "$$$". For example, "$$$Gen 1:1" would be the key line for the Genesis 1:1 entry of a Bible or commentary module.
 
 
 
The content lines of an entry may consist of any text (provided that the first three characters of the line are not "$$$"). The internal markup of the content may be in any format supported by SWORD, namely OSIS for any module type or ThML for freeform books from CCEL.
 
 
 
There is a CrossWire tool called [http://crosswire.org/ftpmirror/pub/sword/utils/perl/imp2osis.pl imp2osis.pl], which will convert IMP to OSIS fairly well (except a few 'corner cases'). Whenever CrossWire receives an IMP submission, this is the first thing that is run, allowing CrossWire to do validation and other OSIS sanity checks. Some editing is usually necessary after converting an IMP file to an OSIS XML file. For example, the attribute '''canonical''' is omitted from the osisText and all &lt;div> elements, and the language attribute '''xml:lang''' defaults to "en".
 
  
 
==The SWORD Project Utilities==
 
==The SWORD Project Utilities==
Precompiled versions of many of these programs are available in most Linux distributions, using the distribution's package installer. For Windows, they can be found [http://crosswire.org/ftpmirror/pub/sword/utils/win32 here].<ref>If you have Xiphos installed in Windows, the Sword utilities are available in the Xiphos\bin folder.</ref>  
+
Precompiled versions of many of these programs are available in most '''Linux''' distributions, using the distribution's package installer.<BR>For '''Windows''', they can be found [https://github.com/devroles/mingw_sword_package here].<ref>If you have '''Xiphos''' installed in Windows, the Sword utilities are available in the Xiphos\bin folder.</ref><ref>The latest binaries may be found [https://github.com/devroles/mingw_sword_package/releases/tag/1.9.0a here], though currently without cipherraw.exe</ref>
  
 
===Module Creation Tools===
 
===Module Creation Tools===
 
It is recommended that Unicode text files used for module creation be [[Encoding|encoded]] as UTF-8.<ref>[http://en.wikipedia.org/wiki/Newline EOLs] should be either Unix style (LF) or Windows style (CRLF). Text files with Mac style EOLs (CR) may give rise to errors or other unexpected behaviour.</ref>
 
It is recommended that Unicode text files used for module creation be [[Encoding|encoded]] as UTF-8.<ref>[http://en.wikipedia.org/wiki/Newline EOLs] should be either Unix style (LF) or Windows style (CRLF). Text files with Mac style EOLs (CR) may give rise to errors or other unexpected behaviour.</ref>
* imp2gbs - imports free-form General books in IMP format to SWORD format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
+
* imp2gbs &ndash; imports free-form General books in IMP format to SWORD format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
* imp2ld - imports lexicons, dictionaries, and daily devotionals in IMP format to SWORD format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
+
* imp2ld &ndash; imports lexicons, dictionaries, and daily devotionals in IMP format to SWORD format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
* imp2vs - imports Bibles and commentaries in IMP format to SWORD format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
+
* imp2vs &ndash; imports Bibles and commentaries in IMP format to SWORD format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
* vpl2mod - imports Bibles and commentaries in Verse-Per-Line format to SWORD format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
+
* vpl2mod &ndash; imports Bibles and commentaries in Verse-Per-Line format to SWORD format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
* [[osis2mod]] - imports Bibles and commentaries in OSIS format to SWORD format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
+
* [[osis2mod]] &ndash; imports Bibles and commentaries in OSIS format to SWORD format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
* xml2gbs - imports free-form General books in OSIS or ThML format to SWORD format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
+
* tei2mod &ndash; imports lexicons, dictionaries in TEI format to SWORD format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
 +
* xml2gbs &ndash; imports free-form General books in OSIS or ThML format to SWORD format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
  
 
===Diagnostic Tools===
 
===Diagnostic Tools===
* mod2imp - creates an IMP file<ref>The IMP file may contain a residue of XML markup</ref> from an installed module [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
+
* mod2imp &ndash; creates an IMP file<ref>The IMP file may contain a residue of XML markup</ref> from an installed module [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
* stepdump - dumps the contents of a STEP book [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
+
* emptyvss &ndash; exports a list of verses missing from the module (useful for testing modules during development) [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
* emptyvss - exports a list of verses missing from the module (useful for testing modules during development) [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
 
  
===Conversion Tools===
+
===Legacy format conversion Tools===
* gbf2osis.pl - a PERL utility for converting GBF to OSIS [http://crosswire.org/ftpmirror/pub/sword/utils/perl/gbf2osis.pl &dagger;]
+
* gbf2osis.pl &ndash; a PERL utility for converting GBF to OSIS [http://crosswire.org/ftpmirror/pub/sword/utils/perl/gbf2osis.pl &dagger;]
* step2vpl - export a STEP book in Verse-Per-Line (VPL) format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
+
* step2vpl &ndash; export a STEP book in Verse-Per-Line (VPL) format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
 
* [[DevTools:Misc#thml2osis|thml2osis]] - converts ThML to OSIS format.
 
* [[DevTools:Misc#thml2osis|thml2osis]] - converts ThML to OSIS format.
* zef2osis.pl - a PERL utility for converting Zefania XML to OSIS [http://crosswire.org/ftpmirror/pub/sword/utils/perl/zef2osis.pl &dagger;]
 
  
 
===OSIS Utilities===
 
===OSIS Utilities===
* [[mod2osis]] - creates an OSIS file from an installed module [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
+
* vs2osisref &ndash; returns the osisRef of a given (text form) verse reference [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
* vs2osisref - returns the osisRef of a given (text form) verse reference [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
+
* xml2gbs &ndash; imports free-form General books in OSIS or ThML format to SWORD format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
* xml2gbs - imports free-form General books in OSIS or ThML format to SWORD format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
 
  
 
===Miscellaneous===
 
===Miscellaneous===
* cipherraw - used to encipher SWORD modules [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
+
* cipherraw &ndash; used to encipher SWORD modules [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
* diatheke - a basic CLI SWORD front-end [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
+
* [[Frontends:Diatheke|diatheke]] &ndash; a basic CLI SWORD front-end [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
* mkfastmod - creates a search index for a module [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
+
* [[mkfastmod]] &ndash; creates a search index for a module<ref>Aside: To create a list of installed modules with descriptions, enter the following command, optionally redirecting stderr to a log file.<pre>mkfastmod /? 2>mkfastmod.log</pre></ref> [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
* [[mod2zmod]] - creates a compressed module from an installed module [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
+
* [[mod2zmod]] &ndash; creates a compressed module from an installed module [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
* mod2vpl - exports the module to VPL format<ref>The VPL file may contain a residue of XML markup</ref> [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
 
* modwrite - outputs the module contents in VPL format [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
 
* treeidxutil - ''needs a description'' [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
 
* genbookutil - ''needs a description'' [http://crosswire.org/ftpmirror/pub/sword/utils/win32 &dagger;]
 
 
 
 
==== Notes on SWORD Tools ====
 
==== Notes on SWORD Tools ====
  
Line 92: Line 46:
  
 
===Recommended Non-SWORD Utilities===
 
===Recommended Non-SWORD Utilities===
* uconv - a utility from [http://icu-project.org/ ICU] for converting between various character encodings, perform normalization, transliterate texts, etc. (It's similar to iconv, but much, much more powerful.) uconv.exe is part of the [http://crosswire.org/ftpmirror/pub/sword/utils/win32 sword utilities]
+
* uconv &ndash; a utility from [http://icu-project.org/ ICU] for converting between various character encodings, perform normalization, transliterate texts, etc. (It's similar to iconv, but much, much more powerful.) uconv.exe is part of the [http://crosswire.org/ftpmirror/pub/sword/utils/win32 sword utilities]
* xmllint - a utility (part of the [http://xmlsoft.org/ libxml2] distribution) for validating XML documents [http://crosswire.org/ftpmirror/pub/sword/utils/win32 *]
+
* xmllint &ndash; a utility (part of the [http://xmlsoft.org/ libxml2] distribution) for validating XML documents [http://crosswire.org/ftpmirror/pub/sword/utils/win32 *]
  
==Other Formats==
+
==Formats for which CrossWire maintains converters==
The SWORD Project will utilize primary source e-texts. These e-texts may come in any number of formats. Here is a listing of formats in which Biblical e-texts have been found. <br>Note: the mention of a format does not indicate that The SWORD Project will create a module from that format.
+
The SWORD Project uses primary source e-texts. These texts come in numerous formats. CrossWire maintains converters for a number of formats, described below. The converters may target other markup formats, e.g. TEI or OSIS, or may simply export binary data to text, as is the case with our STEP exporter. Specific discussion of each of the available converters is found elsewhere on this page.
  
===HTML===
+
===USFM===
''Hyper Text Markup Language''
+
[http://paratext.org/usfm ''Unified Standard Format Markers'']
 
 
This is the basic markup language of the World Wide Web. Most SWORD front-ends, such as [http://www.bibletime.info/ BibleTime], [http://gnomesword.sourceforge.net/ GnomeSword], [http://www.bpbible.com BPBible], [http://www.crosswire.org/bibledesktop/ Bible Desktop], [http://www.kiyut.com/products/alkitab/ Alkitab] and [http://thegoan.com/firebible/ FireBible] use HTML for presentation.
 
 
 
===XHTML===
 
''Extensible Hyper Text Markup Language''
 
 
 
[http://en.wikipedia.org/wiki/XHTML XHTML] (Extensible Hypertext Markup Language) is a family of XML markup languages that mirror or extend versions of the widely used Hypertext Markup Language (HTML), the language in which web pages  are written.
 
 
 
Of particular interest is '''XHTML_TE''' which is the extension developed by SIL for their [http://www.sil.org/computing/fieldworks/TE/ FieldWorks Translation Editor].
 
 
 
Of more recent interest is '''Scripture XHTML''' being developed by SIL as part of the [http://pathway.sil.org/ Pathway] project. See [http://pathway.sil.org/features/standards/scripture-xhtml-proposed-standard/ Scripture XHTML Proposed Standard] by Jim Albright. See also [http://pathway.sil.org/features/standards/dictionary-xhtml-proposed-standard/] for dictionaries.
 
 
 
===LitML===
 
''Liturgical Markup Language''
 
 
 
This markup format is a descendant of, and complement to ThML, described [http://hildormen.org/docs/LitML/Guidelines-LitML10-1.0.html here].
 
 
 
The markup reflects its orientation towards liturgy and hymns.
 
 
 
===PDF===
 
''Portable Document Format''
 
 
 
This is an ISO track file format for platform independent rendering of documents. It is derived from Postscript and is maintained by Adobe. As such, it is designed to be substantially a "read only" format. Documents may be text, images, or scanned images of text. Many textual documents cannot reasonably be expected to allow plain-text export.  Even so, the open-source tool called [http://www.mobipocket.com/dev/pdf2xml/ PDF2XML] may turn out to be useful.
 
 
 
===RTF===
 
''[http://en.wikipedia.org/wiki/Rich_Text_Format Rich Text Format]''
 
 
 
This is a markup format designed and maintained by Microsoft for the encoding of formatted text and graphics and easy transfer between applications. It is used as the markup language for presentation in The SWORD Project for Windows. It is also the internal markup format used within STEP books (see below). The format is of limited use as an archival format and there are no plans for SWORD to support it beyond its current use for presentation. On Windows systems, RTF files can be saved as Unicode files using the [http://en.wikipedia.org/wiki/WordPad Wordpad] program, the resulting text file being encoded as UTF-16 (LE) with BOM.
 
 
 
The RTF specification is updated with each release of Microsoft Word (to keep it in parity with Word's native serialized data format). The latest version, [http://www.microsoft.com/downloads/details.aspx?FamilyId=DD422B8D-FF06-4207-B476-6B5396A18A2B&displaylang=en 1.9.1 (Word 2007)], is available from Microsoft as a Word document. More easily searched HTML versions of the specification include [http://msdn2.microsoft.com/en-us/library/aa140277(office.10).aspx 1.6 (Word 2000) from Microsoft] and [http://www.biblioscape.com/rtf15_spec.htm 1.5 (Word 97) from Biblioscape]. For SWORD, it is improbable that any features of specifications later than RTF 1.5 will be necessary.
 
 
 
[[User:David Haslam|David Haslam]] has developed [http://www.autoitscript.com/site/autoit/ AutoIt] scripts for MS Word and WordPad to perform the following conversions on multiple files:
 
* '''MSWord to RTF''' &ndash; Convert multiple MS Word documents to Rich Text Format by means of MS Word.
 
* '''RTF to RTF''' &ndash; simplifies the RTF markup for files saved from MS Word, with significant reductions in file size.
 
* '''RTF to Unicode''' text &ndash; removes all formating and converts the file to UTF-16 LE encoding.
 
See [[Projects:Go Bible#AutoIt_Scripts]].
 
 
 
===ODF===
 
''[http://en.wikipedia.org/wiki/OpenDocument OpenDocument] format''
 
 
 
The Open Document Format for Office Applications (also known as OpenDocument or ODF) is an XML-based file format for representing electronic documents such as spreadsheets, charts, presentations and word processing documents.
 
 
 
'''.odt''' is the file extension for OpenDocument text format, used in word processing documents.
 
 
 
An ODF file can be opened with a compressed archive manager (such as 7-Zip), and the file '''contents.xml''' may be extracted. Some XML programs such as '''XML Copy Editor''' have an XML option called '''Pretty-print''' which can convert the linearized XML to indented format, which is easier to read and perhaps more suitable for further processing on a line by line basis.
 
  
===ABW===
+
This plain-text format is a common internal-use format within Bible translation agencies and Bible societies. It is the native format of [http://paratext.org/ ParaTExt]. Paratext is used by more than 60% of all Bible translators world-wide. The current release is [https://pt8.paratext.org/ ParaTExt 8.0].
  
''[http://www.abisource.com/ AbiWord] format''
+
Though '''USFM 2.4''' suffices for most Bibles, [https://ubsicap.github.io/usfm/ USFM 3.0] is now available and has several new features. The standard is open source and is maintained at [https://github.com/ubsicap/usfm ubsicap/usfm].
  
'''AbiWord''' is a free, open source word processor. More information about AbiWord is available at [http://www.abisource.com/]. ABW files are a form of XML.
+
CrossWire now has a Python script called usfm2osis.py<ref>This replaces our earlier Perl script [http://crosswire.org/ftpmirror/pub/sword/utils/perl/usfm2osis.pl usfm2osis.pl].</ref> which converts USFM to OSIS for subsequent import to SWORD's native format. See [[Converting SFM Bibles to OSIS]].
  
''Peter has had some success in converting ABW files to OSIS using scripting tools''.
+
USFM uses a separate file for each Bible book. USFM is also supported by the open-source program called [http://bibledit.org/ Bibledit]. There are examples of Bibles in USFM format available for download at [http://ebible.org/]. These include the [http://ebible.org/bible/kjv/kjvsf.zip KJV], [http://ebible.org/bible/asv/asvsf.zip ASV], and [http://ebible.org/bible/web/websf.zip WEB] Bibles.
 
 
===LaTeX===
 
 
 
[http://en.wikipedia.org/wiki/LaTeX LaTeX] is a document markup language and document preparation system for the TeX typesetting program. Some third party source texts for Bible related content made available in PDF format may have been typeset using LaTeX. Sometimes it may be worthwhile asking the owner if the source text might be made available in LaTeX format, especially if there is no other alternative suitable as a starting point for conversion towards making a SWORD module. There are currently no plans for SWORD to support it.
 
 
 
The [http://www.myanmarbible.com/bible/ Myanmar Bible Society] has a utility called bibleTec2osis.pl for converting from TeX into OSIS. Observation: In OSIS files generated by this script, many XML attributes are wrapped between 'apostrophes' rather than "quotation marks".
 
 
 
===STEP===
 
''[http://en.wikipedia.org/wiki/STEP_Library Standard Template for Electronic Publishing]''
 
 
 
This file format was formerly used by [http://www.quickverse.com/ QuickVerse] and [http://www.wordsearchbible.com/ WORDsearch], and is currently used for some [http://www.e-sword.net/ e-Sword] books.
 
 
 
While not an open standard, the publicly released documentation and specifications for this format can be found partially mirrored at
 
http://www.crosswire.org/bsisg/. Some utilities for working with this format are listed below. It is unlikely that the SWORD Project will support this format in the future as it is largely dead.
 
 
 
''Not to be confused with STEP (Scripture Tools for Every Pastor) &ndash; and the new front-end application ([[Frontends:TyndaleStep|Tyndale STEP]]) being developed by Tyndale House, Cambridge in collaboration with CrossWire''.
 
 
 
===Unbound Bible Format===
 
''Unbound Bible Format''
 
 
 
The [http://unbound.biola.edu/ BIOLA's Unbound Bible] offers many of their resources for download in a proprietary, but relatively simple [http://en.wikipedia.org/wiki/Tab_delimited tab-delimited] plain-text format (TDT). There are usually two variants, one with versification mapping to the [http://en.wikipedia.org/wiki/American_Standard_Version ASV], and the other without verse mapping.
 
 
 
There is no widespread use of this format, but the rudimentary [http://crosswire.org/ftpmirror/pub/sword/utils/perl/unb2osis.pl unb2osis.pl] utility may be used to convert Unbound Bible format to OSIS for import to SWORD's native format.
 
 
 
It is a relatively simple task to create a script or filter to convert TDT format to [http://en.wikipedia.org/wiki/Comma-separated_values CSV] format and/or ''vice versa''.
 
 
 
===USFM===
 
[http://confluence.ubs-icap.org/display/USFM/Home ''Unified Standard Format Markers'']
 
 
 
This plain-text format is a common internal-use format within Bible translation agencies and Bible societies. It is the native format of [http://paratext.ubs-translations.org/ Paratext]. Paratext is used by more than 60% of all Bible translators world-wide. The current release is Paratext 7.2. Our own Perl script [http://crosswire.org/ftpmirror/pub/sword/utils/perl/usfm2osis.pl usfm2osis.pl] may be used to convert USFM to OSIS for import to SWORD's native format. See [[Converting SFM Bibles to OSIS]]. USFM uses a separate file for each Bible book. USFM is also supported by the open-source program called [http://sites.google.com/site/bibledit/ Bibledit]. There are examples of Bibles in USFM format available for download at [http://ebible.org/]. These include the [http://ebible.org/bible/kjv/kjvsf.zip KJV], [http://ebible.org/bible/asv/asvsf.zip ASV], [http://ebible.org/bible/web/websf.zip WEB], [http://ebible.org/bible/hnv/hnvsf.zip HNV] and [http://ebible.org/pdg/tokpisinsf.zip PNG] Bibles.
 
  
 
USFM is one of the formats that can be used by [[Projects:Go Bible/Go Bible Creator|Go Bible Creator]].
 
USFM is one of the formats that can be used by [[Projects:Go Bible/Go Bible Creator|Go Bible Creator]].
  
===USX===
+
'''Note:'''
 
+
<references />
'''USX''' is an XML schema that will be the underlying data structure in the next release of '''UBS Paratext''', which is in beta now. SIL's Language Software Development team is working along with UBS on this. This version of Paratext can take in USFM projects and export USX files.
 
 
 
USX was defined to support the [http://everytribeeverynation.org/ Every Tribe Every Nation] Digital Bible Library alliance. The alliance brings together the United Bible Society, SIL/WBT, American Bible Society and other Bible Agencies.
 
 
 
The [https://bitbucket.org/paratext/dblvalidation USX schema] is available in both .xsd and .rnc formats. The latter denotes the compact [http://relaxng.org/ Relax NG] Schema language.
 
 
 
===USFX===
 
''Unified Scripture Format XML''
 
 
 
This XML file format is designed to provide clean conversions from Scripture to USFM compliant file formats. A more comprehensive description can be found at [http://ebible.org/usfx/]. Despite the similar names, this USFX is not the same as USX. There is no widespread use of this format and there are no plans for SWORD to support it in any way.
 
 
 
===LIFT===
 
''[http://code.google.com/p/lift-standard/ Lexicon Interchange FormaT]''
 
 
 
'''LIFT''' is an XML format for storing lexical information, as used in the creation of dictionaries. It's not necessarily the format for your lexicon. That can be tied to whatever program you're using. But LIFT allows you to move that data between programs (hence the term 'interchange'). Programs that support LIFT include [http://www.wesay.org/ WeSay], [http://www.sil.org/computing/fieldworks/flex/ FieldWorks Language Explorer (FLEx)] and [http://lexiquepro.com/ Lexique Pro].
 
 
 
===XSEM===
 
''XML Scripture Encoding Model''
 
 
 
This XML format was proposed by SIL. A comprehensive description of the markup language can be found
 
[http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=XSEM&_sc=1 here].
 
 
 
The formal specifications can be downloaded as a ZIP file
 
[http://scripts.sil.org/cms/scripts/render_download.php?site_id=nrsi&format=file&media_id=XSEM_Source&filename=XSEM_Source.zip here].
 
 
 
The designers of this markup language were instrumental in the writing of the OSIS Specification and it has largely been [http://en.wikipedia.org/wiki/Deprecation deprecated] in favor of using OSIS. There is no widespread use of this format and there are no plans for SWORD to support it in any way.
 
 
 
=== OXES ===
 
''Open XML for Editing Scripture''
 
 
 
This is a new markup language related to OSIS. The project is administered by Michael Cochran of SIL International. The draft schema is maintained by Jim Albright of JAARS, with contributions from several SIL personnel and others. OXES was developed to add back translations, translator notes, consultant notes, and status of translation. OSIS is highly ''extensible''. OXES is ''restrictive''. All options are explicitly named. OSIS focuses on the ''finished'' translation. OXES includes ''process information'' so in the future translators will know why a passage was translated the way it is.
 
 
 
===DTX===
 
DTX is a local format, probably used only in Japanese Bible study softwares such as '''JBible''' and '''Seino no Tatujin'''. It's a very simple format with each line of "bbcccvvv \t CONTENT"  where (bb as book id, ccc as chapter, vvv as verse no.; e.g. 01001001 for Genesis 1:1).
 
 
 
Kunio has developed a GUI conversion utility for Windows to convert DTX format to OSIS XML and thence to Sword module.
 
 
 
Download from
 
* http://openlp.4j4u.net/jbible2osis/ &ndash; in Japanese, or direct download from
 
* http://openlp.4j4u.net/jbible2osis_download/setup.exe
 
 
 
So far, it has been tested successfully for the following translations:
 
* Shinkaiyaku (New Revised Japanese?) 4th edition
 
* Kougoyaku (Colloquial Japanese) 4th edition
 
 
 
'''Notes:'''
 
# The conversion utility has been placed on the public domain, with the software provided under GNU license GPL.
 
# It probably works for other Bible editions too but never tested due to lack of data.
 
# For the Shin kyodoyaku translation, he couldn't get it work because it's too inconsistent with verse orders.
 
# The terms of conditions for use of the data in these programs indicate that it's OK to convert the data for personal use, as long as the user has purchased their software.
 
 
 
===XML===
 
''eXtensible Markup Language''
 
 
 
This is generic family of markup formats.  Links to a number of XML specifications can be found [http://xml.coverpages.org/xmlApplications.html here].  Each flavor has its own specifications. SWORD supports markup in the XML formats OSIS and ThML internally.
 
 
 
===Zefania XML===
 
Zefania is an XML format for Bible markup with only the most simple structural tags for book/chapter/verse, notes, etc.
 
The project is now hosted on [http://sourceforge.net/projects/zefania-sharp/ SourceForge].
 
The [http://sourceforge.net/projects/zefbiblereader/ Zefania Bible Reader] may be used to display Zefania XML Bibles through XSL transformation in browsers.
 
See also the related [http://zefania.blogspot.com/ Bible Resources Archive].
 
 
 
The CrossWire utility [http://crosswire.org/ftpmirror/pub/sword/utils/perl/zef2osis.pl zef2osis.pl] may be used to convert Zefania XML to OSIS for import to SWORD's native format.
 
 
 
===Go Bible===
 
Following an agreement made in July 2008 with the program's author Jolon Faichney, [[Projects:Go Bible|Go Bible]] was adopted by CrossWire as its Java ME software project.
 
 
 
To achieve the navigation speed and general ease of use on even the simplest of Java mobile phones, Go Bible data is fully indexed, as well as being compressed (as are all JAR files).  The format is described in [http://code.google.com/p/gobible/wiki/GoBibleDataFormat Go Bible data format]. Go Bible data is structured as Book | Chapter | Verse text and does not support notes, headings and cross-references, etc. The developer kit [http://gobible.jolon.org/developer/welcome.html Go Bible Creator] can take either USFM, ThML or OSIS as the source text format, but they usually have to be made specially suitable. For example, OSIS files produced by Snowfall Software's SFMToOSIS script are not structured the same. Work has begun to make an [http://en.wikipedia.org/wiki/XSL_Transformations XSLT] script to convert such OSIS XML files to the format suitable for Go Bible. [[Projects:Go Bible/Go Bible Creator|Go Bible Creator]] version 2.3.2 and onwards can take a folder of USFM files as the source text format.
 
 
 
Go Bible source code is now available [https://crosswire.org/svn/gobible/ here] on the CrossWire Repository. ''To access this you will need to have an account''.
 
 
 
GoBibleDataFormat is being extended in the [[Projects:Go Bible/SymScroll|SymScroll]] branch.
 
 
 
=== MediaWiki ===
 
[http://www.dsmedia.org/ Distant Shores Media] is pioneering the use of the [http://www.mediawiki.org/ MediaWiki] format for encoding Bible translations via its [http://door43.org/ Door43] portal. See [http://www.dsmedia.org/blog/publishing-usfm-encoded-bible-translations-mobile-phones-instantly Publishing USFM-encoded Bible translations for mobile phones. Instantly]. One of their programmers has developed an extension to the MediaWiki server (which powers Door43) called '''USFMtag''' that implements this concept. USFM-encoded Bible translations can be copied-and-pasted into any page on Door43 and the raw text is rendered in the browser as formatted text.
 
 
 
=== PMD ===
 
 
 
''Adobe PageMaker Document''
 
 
 
This is a DeskTop Publishing (DTP) program. See [http://en.wikipedia.org/wiki/Adobe_PageMaker] for history & description. It has been superseded by Adobe [http://en.wikipedia.org/wiki/Adobe_InDesign InDesign], but some Bible Societies or translators may still be using it. ''We hope to post details of how to extract usable text from a PMD file''.
 
 
 
=== EPUB ===
 
 
 
''Electronic Publication'' [http://en.wikipedia.org/wiki/Epub]
 
 
 
EPUB (electronic publication) is an e-book standard, by the [http://www.idpf.org/ International Digital Publishing Forum] (IDPF), which consists of three file format standards (files have the extension .epub). It supersedes the Open eBook standard. This format can be read by a number of desktop OS readers (e.g. Adobe Digital Editions and FBReader) as well as some e-book readers (e.g. Sony Reader and the iPad).
 
 
 
=== ABW ===
 
 
 
''AbiWord format''
 
 
 
[http://www.abisource.com/ AbiWord] is a multi-platform open-source word-processor. Transformation from MS Word into AbiWord format can render documents very nicely into a flat XML file which appears to be more than accessible to subsequent processing by Perl scripts or XSLT. Many other open file and save file formats are supported.
 
 
 
In a bash shell on a *nix desktop you can do a batch convert like this:
 
 
 
<code>$ for i in `ls *.doc`; do abiword --to=abw $i; done </code>
 
 
 
As with any MS Word texts, the output will vary depending on how cleanly people have used styles or not. Style names and info are transferred into AbiWord's format. But even if no proper styling was used the information on fonts, etc can likely be used for transformation purposes.
 
 
 
=== SGML ===
 
''Standard Generalized Markup Language''
 
 
 
SGML is an ISO-standard technology for defining generalized markup languages for documents.[http://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language]
 
 
 
Generalized markup is based on two postulates:
 
* Markup should describe a document's structure and other attributes, rather than specify the processing to be performed on it, as descriptive markup need be done only once, and will suffice for future processing.
 
* Markup should be rigorous so that the techniques available for processing rigorously-defined objects like programs and data bases, can be used for processing documents as well.
 
 
 
e.g. Some Bible study resources use [http://www.rocketsoftware.com/section/views Folio Views]. Folio can export data as "FFF files" which are loosely in SGML format. FFF denotes Folio Flat Format.
 
 
 
A Google search finds several converters for SGML to XML. This program may prove useful.
 
 
 
* [http://www.jclark.com/sp/ SP] &ndash; developed by James Clark. "An open-source SGML parser written in C++. I wrote this from scratch to overcome the limitations of sgmls. This is now used in numerous SGML products and is widely regarded as the best SGML parser."
 
 
 
=== SMIL ===
 
''Synchronized Multimedia Integration Language''
 
 
 
'''SMIL''', the [http://en.wikipedia.org/wiki/Synchronized_Multimedia_Integration_Language Synchronized Multimedia Integration Language], is a W3C recommended XML markup language for describing multimedia presentations. It defines markup for timing, layout, animations, visual transitions, and media embedding, among other things. SMIL allows the presentation of media items such as text, images, video, and audio, as well as links to other SMIL presentations, and files from multiple web servers.
 
 
 
Examples of integrating synchronized audio recordings with Bible text (achieved using SMIL) may be found in this Japanese [http://bible.salterrae.net/kougo/daisy/ audio Bibles] site.
 
 
 
SMIL is supported by third-party software such as that from [http://www.daisy.org/ DAISY].
 
  
 
==Other Utilities==
 
==Other Utilities==
Line 319: Line 75:
  
 
* [http://gbcpreprocessor.codeplex.com/ Go Bible Creator USFM Preprocessor] &ndash; This is a tool to parse through and identify, correct and publish USFM file formats into a file format that can easily be put into the Go Bible mobile phone program.
 
* [http://gbcpreprocessor.codeplex.com/ Go Bible Creator USFM Preprocessor] &ndash; This is a tool to parse through and identify, correct and publish USFM file formats into a file format that can easily be put into the Go Bible mobile phone program.
 
===GBF Tools===
 
* gbfconvertor, including gbf2osis, gbf2xsem, & gbf2sf - utilities for converting GBF to OSIS, XSEM, and SFM [http://ebible.org/translation/gbf.html]
 
* gbfsrc - utilities for converting GBF to "HTML, RTF, TeX, plain ASCII text, a format readable by BibleWorks 5 or later, and a couple of less useful formats" [http://ebible.org/translation/gbf.html]
 
 
===STEP Utilities===
 
* step2rtf - extracts the internal RTF text from STEP books  [http://www.customconsulting.us/step2rtf.zip]
 
* stepr - a rudimentary STEP reader [http://www.customconsulting.us/stepr-0.3.1.tgz]
 
  
 
===ThML Utilities===
 
===ThML Utilities===
 
* CCEL Desktop - a program for viewing and developing CCEL books [http://ccel-desktop.sourceforge.net/]
 
* CCEL Desktop - a program for viewing and developing CCEL books [http://ccel-desktop.sourceforge.net/]
 
===Zefania Utilities===
 
* Zefania_2_sword_win32 - ''sed based scripts maintained by JensG'' [http://www.grabner-online.de/download/zefania_2_sword_win32.zip]
 
 
== Optical Character Recognition ==
 
[[Non-CrossWire Text-Development Projects|Text development]] activities may be greatly assisted by using [http://en.wikipedia.org/wiki/Optical_character_recognition OCR] software. This section will list OCR programs that CrossWire volunteers have found useful. Proprietary programs should not be listed here, the preference at CrossWire being to use free and/or open-source software.
 
 
=== Tessaract ===
 
 
* [http://en.wikipedia.org/wiki/Tesseract_(software) Tesseract] is a free optical character recognition engine. It was originally developed at Hewlett-Packard from 1985 until 1995. After ten years with no development, Hewlett Packard and UNLV released it in 2005. Tesseract is currently developed by Google and released under the Apache License, Version 2.0. Tesseract is considered one of the the most accurate free software OCR engines currently available.
 
 
* [http://vietocr.sourceforge.net/ VietOCR] &ndash; A Java/.NET GUI frontend for Tesseract OCR engine. Supports optical character recognition for Vietnamese language. Tip: Visit [http://www.moheb.de/ocr.html] to read how Moheb Mekhaiel adapted VietOCR to scan Coptic documents.
 
  
 
== See also ==
 
== See also ==
 +
* [[DevTools:IMP Format|IMP Format]] &ndash; general import format used for various module types
 +
* [[DevTools:GBF|General Bible Format (GBF)]] &ndash; legacy format now deprecated
 +
* [[DevTools:ThML|Theological Markup Language (ThML)]] &ndash; legacy format now deprecated
 
* [[Frontends:Bookmarks Standard]]
 
* [[Frontends:Bookmarks Standard]]
 +
* [[File Formats Cruft]]
  
 +
[[Category:Development tools]]
 +
[[Category:File formats]]
 
[[Category:OSIS]]
 
[[Category:OSIS]]
 
[[Category:ThML]]
 
[[Category:ThML]]

Latest revision as of 12:11, 20 February 2021

This page lists some of the more common file formats relevant to The SWORD Project, associated utilities, and other CrossWire projects.

CrossWire Bible Society respects copyright. As such, conversion of material that is under copyright without permission from the copyright holders is not supported by The SWORD Project.

SWORD modules

Other than the source code for the SWORD API, there is no documentation for the file format of a SWORD module. The intention is that the SWORD API (or the JSword implementation) is used directly or via other language bindings.

Our module file format is proprietary in the sense that we see no need to document it and certainly no need to stick to it. We change it when we need to. We therefore do not encourage direct interaction with it, but firmly recommend use of the API (either C++ or Java). This is the place where we seek stability and consistency.

The SWORD Project supports currently and actively the following markup for module creation: OSIS, TEI, ThML and plain text.

The SWORD Project Utilities

Precompiled versions of many of these programs are available in most Linux distributions, using the distribution's package installer.
For Windows, they can be found here.[1][2]

Module Creation Tools

It is recommended that Unicode text files used for module creation be encoded as UTF-8.[3]

  • imp2gbs – imports free-form General books in IMP format to SWORD format
  • imp2ld – imports lexicons, dictionaries, and daily devotionals in IMP format to SWORD format
  • imp2vs – imports Bibles and commentaries in IMP format to SWORD format
  • vpl2mod – imports Bibles and commentaries in Verse-Per-Line format to SWORD format
  • osis2mod – imports Bibles and commentaries in OSIS format to SWORD format
  • tei2mod – imports lexicons, dictionaries in TEI format to SWORD format
  • xml2gbs – imports free-form General books in OSIS or ThML format to SWORD format

Diagnostic Tools

  • mod2imp – creates an IMP file[4] from an installed module
  • emptyvss – exports a list of verses missing from the module (useful for testing modules during development)

Legacy format conversion Tools

  • gbf2osis.pl – a PERL utility for converting GBF to OSIS
  • step2vpl – export a STEP book in Verse-Per-Line (VPL) format
  • thml2osis - converts ThML to OSIS format.

OSIS Utilities

  • vs2osisref – returns the osisRef of a given (text form) verse reference
  • xml2gbs – imports free-form General books in OSIS or ThML format to SWORD format

Miscellaneous

  • cipherraw – used to encipher SWORD modules
  • diatheke – a basic CLI SWORD front-end
  • mkfastmod – creates a search index for a module[5]
  • mod2zmod – creates a compressed module from an installed module

Notes on SWORD Tools

  1. If you have Xiphos installed in Windows, the Sword utilities are available in the Xiphos\bin folder.
  2. The latest binaries may be found here, though currently without cipherraw.exe
  3. EOLs should be either Unix style (LF) or Windows style (CRLF). Text files with Mac style EOLs (CR) may give rise to errors or other unexpected behaviour.
  4. The IMP file may contain a residue of XML markup
  5. Aside: To create a list of installed modules with descriptions, enter the following command, optionally redirecting stderr to a log file.
    mkfastmod /? 2>mkfastmod.log

Recommended Non-SWORD Utilities

  • uconv – a utility from ICU for converting between various character encodings, perform normalization, transliterate texts, etc. (It's similar to iconv, but much, much more powerful.) uconv.exe is part of the sword utilities
  • xmllint – a utility (part of the libxml2 distribution) for validating XML documents *

Formats for which CrossWire maintains converters

The SWORD Project uses primary source e-texts. These texts come in numerous formats. CrossWire maintains converters for a number of formats, described below. The converters may target other markup formats, e.g. TEI or OSIS, or may simply export binary data to text, as is the case with our STEP exporter. Specific discussion of each of the available converters is found elsewhere on this page.

USFM

Unified Standard Format Markers

This plain-text format is a common internal-use format within Bible translation agencies and Bible societies. It is the native format of ParaTExt. Paratext is used by more than 60% of all Bible translators world-wide. The current release is ParaTExt 8.0.

Though USFM 2.4 suffices for most Bibles, USFM 3.0 is now available and has several new features. The standard is open source and is maintained at ubsicap/usfm.

CrossWire now has a Python script called usfm2osis.py[1] which converts USFM to OSIS for subsequent import to SWORD's native format. See Converting SFM Bibles to OSIS.

USFM uses a separate file for each Bible book. USFM is also supported by the open-source program called Bibledit. There are examples of Bibles in USFM format available for download at [1]. These include the KJV, ASV, and WEB Bibles.

USFM is one of the formats that can be used by Go Bible Creator.

Note:

  1. This replaces our earlier Perl script usfm2osis.pl.

Other Utilities

These are not part of The SWORD Project, but may be useful. A link is given for each.

Go Bible utilities

  • Go Bible Creator - a Java SE program for converting either ThML or OSIS or USFM to Go Bible. It is being enhanced by SIL to be capable of converting source text in XHTML-TE format.
  • Go Bible Creator USFM Preprocessor – This is a tool to parse through and identify, correct and publish USFM file formats into a file format that can easily be put into the Go Bible mobile phone program.

ThML Utilities

  • CCEL Desktop - a program for viewing and developing CCEL books [2]

See also