Difference between revisions of "File Formats Cruft"
(storing cruft on an out-of the way, unlinked page, to be deleted at a later date, when everyone realizes this is just noise)
Revision as of 21:47, 5 April 2012
This page is for cruft from the File Formats page. Specifically, formats that are not and never will be employed or supported by CrossWire or Sword in any meaningful way can be described on this page. Likewise, discussion of completely obvious stuff, like "What is HTML/XML" can live on this page. Thus, the File Formats page can be pared down to useful information, sans the cruft.
The SWORD Project will utilize primary source e-texts. These e-texts may come in any number of formats. Here is a listing of formats in which Biblical e-texts have been found.
Note: the mention of a format does not indicate that The SWORD Project will create a module from that format.
Hyper Text Markup Language
Extensible Hyper Text Markup Language
XHTML (Extensible Hypertext Markup Language) is a family of XML markup languages that mirror or extend versions of the widely used Hypertext Markup Language (HTML), the language in which web pages are written.
Of particular interest is XHTML_TE which is the extension developed by SIL for their FieldWorks Translation Editor.
Liturgical Markup Language
This markup format is a descendant of, and complement to ThML, described here.
The markup reflects its orientation towards liturgy and hymns.
Portable Document Format
This is an ISO track file format for platform independent rendering of documents. It is derived from Postscript and is maintained by Adobe. As such, it is designed to be substantially a "read only" format. Documents may be text, images, or scanned images of text. Many textual documents cannot reasonably be expected to allow plain-text export. Even so, the open-source tool called PDF2XML may turn out to be useful.
This is a markup format designed and maintained by Microsoft for the encoding of formatted text and graphics and easy transfer between applications. It is used as the markup language for presentation in The SWORD Project for Windows. It is also the internal markup format used within STEP books (see below). The format is of limited use as an archival format and there are no plans for SWORD to support it beyond its current use for presentation. On Windows systems, RTF files can be saved as Unicode files using the Wordpad program, the resulting text file being encoded as UTF-16 (LE) with BOM.
The RTF specification is updated with each release of Microsoft Word (to keep it in parity with Word's native serialized data format). The latest version, 1.9.1 (Word 2007), is available from Microsoft as a Word document. More easily searched HTML versions of the specification include 1.6 (Word 2000) from Microsoft and 1.5 (Word 97) from Biblioscape. For SWORD, it is improbable that any features of specifications later than RTF 1.5 will be necessary.
- MSWord to RTF – Convert multiple MS Word documents to Rich Text Format by means of MS Word.
- RTF to RTF – simplifies the RTF markup for files saved from MS Word, with significant reductions in file size.
- RTF to Unicode text – removes all formating and converts the file to UTF-16 LE encoding.
The Open Document Format for Office Applications (also known as OpenDocument or ODF) is an XML-based file format for representing electronic documents such as spreadsheets, charts, presentations and word processing documents.
.odt is the file extension for OpenDocument text format, used in word processing documents.
An ODF file can be opened with a compressed archive manager (such as 7-Zip), and the file contents.xml may be extracted. Some XML programs such as XML Copy Editor have an XML option called Pretty-print which can convert the linearized XML to indented format, which is easier to read and perhaps more suitable for further processing on a line by line basis.
AbiWord is a free, open source word processor. More information about AbiWord is available at . ABW files are a form of XML.
Peter has had some success in converting ABW files to OSIS using scripting tools.
LaTeX is a document markup language and document preparation system for the TeX typesetting program. Some third party source texts for Bible related content made available in PDF format may have been typeset using LaTeX. Sometimes it may be worthwhile asking the owner if the source text might be made available in LaTeX format, especially if there is no other alternative suitable as a starting point for conversion towards making a SWORD module. There are currently no plans for SWORD to support it.
The Myanmar Bible Society has a utility called bibleTec2osis.pl for converting from TeX into OSIS. Observation: In OSIS files generated by this script, many XML attributes are wrapped between 'apostrophes' rather than "quotation marks".
Unified Scripture Format XML
This XML file format is designed to provide clean conversions from Scripture to USFM compliant file formats. A more comprehensive description can be found at . Despite the similar names, this USFX is not the same as USX. There is no widespread use of this format and there are no plans for SWORD to support it in any way.
LIFT is an XML format for storing lexical information, as used in the creation of dictionaries. It's not necessarily the format for your lexicon. That can be tied to whatever program you're using. But LIFT allows you to move that data between programs (hence the term 'interchange'). Programs that support LIFT include WeSay, FieldWorks Language Explorer (FLEx) and Lexique Pro.
XML Scripture Encoding Model
This XML format was proposed by SIL. A comprehensive description of the markup language can be found here.
The formal specifications can be downloaded as a ZIP file here.
The designers of this markup language were instrumental in the writing of the OSIS Specification and it has largely been deprecated in favor of using OSIS. There is no widespread use of this format and there are no plans for SWORD to support it in any way.
Open XML for Editing Scripture
This is a new markup language related to OSIS. The project is administered by Michael Cochran of SIL International. The draft schema is maintained by Jim Albright of JAARS, with contributions from several SIL personnel and others. OXES was developed to add back translations, translator notes, consultant notes, and status of translation. OSIS is highly extensible. OXES is restrictive. All options are explicitly named. OSIS focuses on the finished translation. OXES includes process information so in the future translators will know why a passage was translated the way it is.
DTX is a local format, probably used only in Japanese Bible study softwares such as JBible and Seino no Tatujin. It's a very simple format with each line of "bbcccvvv \t CONTENT" where (bb as book id, ccc as chapter, vvv as verse no.; e.g. 01001001 for Genesis 1:1).
Kunio has developed a GUI conversion utility for Windows to convert DTX format to OSIS XML and thence to Sword module.
- http://openlp.4j4u.net/jbible2osis/ – in Japanese, or direct download from
So far, it has been tested successfully for the following translations:
- Shinkaiyaku (New Revised Japanese?) 4th edition
- Kougoyaku (Colloquial Japanese) 4th edition
- The conversion utility has been placed on the public domain, with the software provided under GNU license GPL.
- It probably works for other Bible editions too but never tested due to lack of data.
- For the Shin kyodoyaku translation, he couldn't get it work because it's too inconsistent with verse orders.
- The terms of conditions for use of the data in these programs indicate that it's OK to convert the data for personal use, as long as the user has purchased their software.
eXtensible Markup Language
This is generic family of markup formats. Links to a number of XML specifications can be found here. Each flavor has its own specifications. SWORD supports markup in the XML formats OSIS and ThML internally.
Distant Shores Media is pioneering the use of the MediaWiki format for encoding Bible translations via its Door43 portal. See Publishing USFM-encoded Bible translations for mobile phones. Instantly. One of their programmers has developed an extension to the MediaWiki server (which powers Door43) called USFMtag that implements this concept. USFM-encoded Bible translations can be copied-and-pasted into any page on Door43 and the raw text is rendered in the browser as formatted text.
Adobe PageMaker Document
This is a DeskTop Publishing (DTP) program. See  for history & description. It has been superseded by Adobe InDesign, but some Bible Societies or translators may still be using it. We hope to post details of how to extract usable text from a PMD file.
Electronic Publication 
EPUB (electronic publication) is an e-book standard, by the International Digital Publishing Forum (IDPF), which consists of three file format standards (files have the extension .epub). It supersedes the Open eBook standard. This format can be read by a number of desktop OS readers (e.g. Adobe Digital Editions and FBReader) as well as some e-book readers (e.g. Sony Reader and the iPad).
AbiWord is a multi-platform open-source word-processor. Transformation from MS Word into AbiWord format can render documents very nicely into a flat XML file which appears to be more than accessible to subsequent processing by Perl scripts or XSLT. Many other open file and save file formats are supported.
In a bash shell on a *nix desktop you can do a batch convert like this:
$ for i in `ls *.doc`; do abiword --to=abw $i; done
As with any MS Word texts, the output will vary depending on how cleanly people have used styles or not. Style names and info are transferred into AbiWord's format. But even if no proper styling was used the information on fonts, etc can likely be used for transformation purposes.
Standard Generalized Markup Language
SGML is an ISO-standard technology for defining generalized markup languages for documents.
Generalized markup is based on two postulates:
- Markup should describe a document's structure and other attributes, rather than specify the processing to be performed on it, as descriptive markup need be done only once, and will suffice for future processing.
- Markup should be rigorous so that the techniques available for processing rigorously-defined objects like programs and data bases, can be used for processing documents as well.
e.g. Some Bible study resources use Folio Views. Folio can export data as "FFF files" which are loosely in SGML format. FFF denotes Folio Flat Format.
A Google search finds several converters for SGML to XML. This program may prove useful.
- SP – developed by James Clark. "An open-source SGML parser written in C++. I wrote this from scratch to overcome the limitations of sgmls. This is now used in numerous SGML products and is widely regarded as the best SGML parser."
Synchronized Multimedia Integration Language
SMIL, the Synchronized Multimedia Integration Language, is a W3C recommended XML markup language for describing multimedia presentations. It defines markup for timing, layout, animations, visual transitions, and media embedding, among other things. SMIL allows the presentation of media items such as text, images, video, and audio, as well as links to other SMIL presentations, and files from multiple web servers.
Examples of integrating synchronized audio recordings with Bible text (achieved using SMIL) may be found in this Japanese audio Bibles site.
SMIL is supported by third-party software such as that from DAISY.