File Formats Cruft

From CrossWire Bible Society
Revision as of 22:16, 8 January 2018 by Refdoc (Talk | contribs) (Removed a whole bunch of spurious or irrelevant formats.)

Jump to: navigation, search

This page is for cruft from the File Formats page. Specifically, formats that are not and never will be employed or supported by CrossWire or Sword in any meaningful way can be described on this page. Likewise, discussion of completely obvious stuff, like "What is HTML/XML" can live on this page. Thus, the File Formats page can be pared down to useful information, sans the cruft.


Other Formats

The SWORD Project will utilize primary source e-texts. These e-texts may come in any number of formats. Here is a listing of formats in which Biblical e-texts have been found.
Note: the mention of a format does not indicate that The SWORD Project will create a module from that format.


Portable Document Format

This is an ISO track file format for platform independent rendering of documents. It is derived from Postscript and is maintained by Adobe. As such, it is designed to be substantially a "read only" format. Documents may be text, images, or scanned images of text. Many textual documents cannot reasonably be expected to allow plain-text export. Even so, the open-source tool called PDF2XML may turn out to be useful.

RTF

Rich Text Format

This is a markup format designed and maintained by Microsoft for the encoding of formatted text and graphics and easy transfer between applications. It is used as the markup language for presentation in The SWORD Project for Windows. It is also the internal markup format used within STEP books (see below). The format is of limited use as an archival format and there are no plans for SWORD to support it beyond its current use for presentation. On Windows systems, RTF files can be saved as Unicode files using the Wordpad program, the resulting text file being encoded as UTF-16 (LE) with BOM.

The RTF specification is updated with each release of Microsoft Word (to keep it in parity with Word's native serialized data format). The latest version, 1.9.1 (Word 2007), is available from Microsoft as a Word document. More easily searched HTML versions of the specification include 1.6 (Word 2000) from Microsoft and 1.5 (Word 97) from Biblioscape. For SWORD, it is improbable that any features of specifications later than RTF 1.5 will be necessary.

David Haslam has developed AutoIt scripts for MS Word and WordPad to perform the following conversions on multiple files:

  • MSWord to RTF – Convert multiple MS Word documents to Rich Text Format by means of MS Word.
  • RTF to RTF – simplifies the RTF markup for files saved from MS Word, with significant reductions in file size.
  • RTF to Unicode text – removes all formating and converts the file to UTF-16 LE encoding.

See Projects:Go Bible#AutoIt_Scripts.


ABW

AbiWord format

AbiWord is a free, open source word processor. More information about AbiWord is available at [1]. ABW files are a form of XML.

Peter has had some success in converting ABW files to OSIS using scripting tools.

USFX

Unified Scripture Format XML

This XML file format is designed to provide clean conversions from Scripture to USFM compliant file formats. A more comprehensive description can be found at [2]. Despite the similar names, this USFX is not the same as USX. There is no widespread use of this format and there are no plans for SWORD to support it in any way.

LIFT

Lexicon Interchange FormaT

LIFT is an XML format for storing lexical information, as used in the creation of dictionaries. It's not necessarily the format for your lexicon. That can be tied to whatever program you're using. But LIFT allows you to move that data between programs (hence the term 'interchange'). Programs that support LIFT include WeSay, FieldWorks Language Explorer (FLEx) and Lexique Pro.


OXES

Open XML for Editing Scripture

This is a new markup language related to OSIS. The project is administered by Michael Cochran of SIL International. The draft schema is maintained by Jim Albright of JAARS, with contributions from several SIL personnel and others. OXES was developed to add back translations, translator notes, consultant notes, and status of translation. OSIS is highly extensible. OXES is restrictive. All options are explicitly named. OSIS focuses on the finished translation. OXES includes process information so in the future translators will know why a passage was translated the way it is.

Jim Albright of SIL has already developed a utility (using XSLT) for converting USX to OXES (Open XML for Editing Scripture).

DTX

DTX is a local format, probably used only in Japanese Bible study softwares such as JBible and Seino no Tatujin. It's a very simple format with each line of "bbcccvvv \t CONTENT" where (bb as book id, ccc as chapter, vvv as verse no.; e.g. 01001001 for Genesis 1:1).

Kunio has developed a GUI conversion utility for Windows to convert DTX format to OSIS XML and thence to Sword module.

Download from

So far, it has been tested successfully for the following translations:

  • Shinkaiyaku (New Revised Japanese?) 4th edition
  • Kougoyaku (Colloquial Japanese) 4th edition

Notes:

  1. The conversion utility has been placed on the public domain, with the software provided under GNU license GPL.
  2. It probably works for other Bible editions too but never tested due to lack of data.
  3. For the Shin kyodoyaku translation, he couldn't get it work because it's too inconsistent with verse orders.
  4. The terms of conditions for use of the data in these programs indicate that it's OK to convert the data for personal use, as long as the user has purchased their software.


MediaWiki

Distant Shores Media is pioneering the use of the MediaWiki format for encoding Bible translations via its Door43 portal. See Publishing USFM-encoded Bible translations for mobile phones. Instantly. One of their programmers has developed an extension to the MediaWiki server (which powers Door43) called USFMtag that implements this concept. USFM-encoded Bible translations can be copied-and-pasted into any page on Door43 and the raw text is rendered in the browser as formatted text.

PMD

Adobe PageMaker Document

This is a DeskTop Publishing (DTP) program. See [3] for history & description. It has been superseded by Adobe InDesign, but some Bible Societies or translators may still be using it. We hope to post details of how to extract usable text from a PMD file.

EPUB

Electronic Publication [4]

EPUB (electronic publication) is an e-book standard, by the International Digital Publishing Forum (IDPF), which consists of three file format standards (files have the extension .epub). It supersedes the Open eBook standard. This format can be read by a number of desktop OS readers (e.g. Adobe Digital Editions and FBReader) as well as some e-book readers (e.g. Sony Reader and the iPad).

ABW

AbiWord format

AbiWord is a multi-platform open-source word-processor. Transformation from MS Word into AbiWord format can render documents very nicely into a flat XML file which appears to be more than accessible to subsequent processing by Perl scripts or XSLT. Many other open file and save file formats are supported.

In a bash shell on a *nix desktop you can do a batch convert like this:

$ for i in `ls *.doc`; do abiword --to=abw $i; done

As with any MS Word texts, the output will vary depending on how cleanly people have used styles or not. Style names and info are transferred into AbiWord's format. But even if no proper styling was used the information on fonts, etc can likely be used for transformation purposes.