Difference between revisions of "File Formats Cruft"

From CrossWire Bible Society
Jump to: navigation, search
(SMIL)
(EPUB)
 
(2 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
The SWORD Project will utilize primary source e-texts. These e-texts may come in any number of formats. Here is a listing of formats in which Biblical e-texts have been found. <br>Note: the mention of a format does not indicate that The SWORD Project will create a module from that format.
 
The SWORD Project will utilize primary source e-texts. These e-texts may come in any number of formats. Here is a listing of formats in which Biblical e-texts have been found. <br>Note: the mention of a format does not indicate that The SWORD Project will create a module from that format.
  
===HTML===
 
''Hyper Text Markup Language''
 
  
This is the basic markup language of the World Wide Web. Most SWORD front-ends, such as [http://www.bibletime.info/ BibleTime], [http://gnomesword.sourceforge.net/ GnomeSword], [http://www.bpbible.com BPBible], [http://www.crosswire.org/bibledesktop/ Bible Desktop], [http://www.kiyut.com/products/alkitab/ Alkitab] and [http://thegoan.com/firebible/ FireBible] use HTML for presentation.
 
 
===XHTML===
 
''Extensible Hyper Text Markup Language''
 
 
[http://en.wikipedia.org/wiki/XHTML XHTML] (Extensible Hypertext Markup Language) is a family of XML markup languages that mirror or extend versions of the widely used Hypertext Markup Language (HTML), the language in which web pages  are written.
 
 
Of particular interest is '''XHTML_TE''' which is the extension developed by SIL for their [http://www.sil.org/computing/fieldworks/TE/ FieldWorks Translation Editor].
 
 
Of more recent interest is '''Scripture XHTML''' being developed by SIL as part of the [http://pathway.sil.org/ Pathway] project. See [http://pathway.sil.org/features/standards/scripture-xhtml-proposed-standard/ Scripture XHTML Proposed Standard] by Jim Albright. See also [http://pathway.sil.org/features/standards/dictionary-xhtml-proposed-standard/] for dictionaries.
 
 
===LitML===
 
''Liturgical Markup Language''
 
 
This markup format is a descendant of, and complement to ThML, described [http://hildormen.org/docs/LitML/Guidelines-LitML10-1.0.html here].
 
 
The markup reflects its orientation towards liturgy and hymns.
 
 
===PDF===
 
 
''Portable Document Format''
 
''Portable Document Format''
  
Line 44: Line 23:
 
See [[Projects:Go Bible#AutoIt_Scripts]].
 
See [[Projects:Go Bible#AutoIt_Scripts]].
  
===ODF===
 
''[http://en.wikipedia.org/wiki/OpenDocument OpenDocument] format''
 
 
The Open Document Format for Office Applications (also known as OpenDocument or ODF) is an XML-based file format for representing electronic documents such as spreadsheets, charts, presentations and word processing documents.
 
 
'''.odt''' is the file extension for OpenDocument text format, used in word processing documents.
 
 
An ODF file can be opened with a compressed archive manager (such as 7-Zip), and the file '''contents.xml''' may be extracted. Some XML programs such as '''XML Copy Editor''' have an XML option called '''Pretty-print''' which can convert the linearized XML to indented format, which is easier to read and perhaps more suitable for further processing on a line by line basis.
 
  
 
===ABW===
 
===ABW===
Line 60: Line 31:
  
 
''Peter has had some success in converting ABW files to OSIS using scripting tools''.
 
''Peter has had some success in converting ABW files to OSIS using scripting tools''.
 
===LaTeX===
 
 
[http://en.wikipedia.org/wiki/LaTeX LaTeX] is a document markup language and document preparation system for the TeX typesetting program. Some third party source texts for Bible related content made available in PDF format may have been typeset using LaTeX. Sometimes it may be worthwhile asking the owner if the source text might be made available in LaTeX format, especially if there is no other alternative suitable as a starting point for conversion towards making a SWORD module. There are currently no plans for SWORD to support it.
 
 
The [http://www.myanmarbible.com/bible/ Myanmar Bible Society] has a utility called bibleTec2osis.pl for converting from TeX into OSIS. Observation: In OSIS files generated by this script, many XML attributes are wrapped between 'apostrophes' rather than "quotation marks".
 
  
 
===USFX===
 
===USFX===
Line 77: Line 42:
 
'''LIFT''' is an XML format for storing lexical information, as used in the creation of dictionaries. It's not necessarily the format for your lexicon. That can be tied to whatever program you're using. But LIFT allows you to move that data between programs (hence the term 'interchange'). Programs that support LIFT include [http://www.wesay.org/ WeSay], [http://www.sil.org/computing/fieldworks/flex/ FieldWorks Language Explorer (FLEx)] and [http://lexiquepro.com/ Lexique Pro].
 
'''LIFT''' is an XML format for storing lexical information, as used in the creation of dictionaries. It's not necessarily the format for your lexicon. That can be tied to whatever program you're using. But LIFT allows you to move that data between programs (hence the term 'interchange'). Programs that support LIFT include [http://www.wesay.org/ WeSay], [http://www.sil.org/computing/fieldworks/flex/ FieldWorks Language Explorer (FLEx)] and [http://lexiquepro.com/ Lexique Pro].
  
===XSEM===
 
''XML Scripture Encoding Model''
 
 
This XML format was proposed by SIL. A comprehensive description of the markup language can be found
 
[http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=XSEM&_sc=1 here].
 
 
The formal specifications can be downloaded as a ZIP file
 
[http://scripts.sil.org/cms/scripts/render_download.php?site_id=nrsi&format=file&media_id=XSEM_Source&filename=XSEM_Source.zip here].
 
 
The designers of this markup language were instrumental in the writing of the OSIS Specification and it has largely been [http://en.wikipedia.org/wiki/Deprecation deprecated] in favor of using OSIS. There is no widespread use of this format and there are no plans for SWORD to support it in any way.
 
  
 
=== OXES ===
 
=== OXES ===
Line 114: Line 69:
 
# The terms of conditions for use of the data in these programs indicate that it's OK to convert the data for personal use, as long as the user has purchased their software.
 
# The terms of conditions for use of the data in these programs indicate that it's OK to convert the data for personal use, as long as the user has purchased their software.
  
===XML===
 
''eXtensible Markup Language''
 
 
This is generic family of markup formats.  Links to a number of XML specifications can be found [http://xml.coverpages.org/xmlApplications.html here].  Each flavor has its own specifications. SWORD supports markup in the XML formats OSIS and ThML internally.
 
  
 
=== MediaWiki ===
 
=== MediaWiki ===
 
[http://www.dsmedia.org/ Distant Shores Media] is pioneering the use of the [http://www.mediawiki.org/ MediaWiki] format for encoding Bible translations via its [http://door43.org/ Door43] portal. See [http://www.dsmedia.org/blog/publishing-usfm-encoded-bible-translations-mobile-phones-instantly Publishing USFM-encoded Bible translations for mobile phones. Instantly]. One of their programmers has developed an extension to the MediaWiki server (which powers Door43) called '''USFMtag''' that implements this concept. USFM-encoded Bible translations can be copied-and-pasted into any page on Door43 and the raw text is rendered in the browser as formatted text.
 
[http://www.dsmedia.org/ Distant Shores Media] is pioneering the use of the [http://www.mediawiki.org/ MediaWiki] format for encoding Bible translations via its [http://door43.org/ Door43] portal. See [http://www.dsmedia.org/blog/publishing-usfm-encoded-bible-translations-mobile-phones-instantly Publishing USFM-encoded Bible translations for mobile phones. Instantly]. One of their programmers has developed an extension to the MediaWiki server (which powers Door43) called '''USFMtag''' that implements this concept. USFM-encoded Bible translations can be copied-and-pasted into any page on Door43 and the raw text is rendered in the browser as formatted text.
 
=== PMD ===
 
 
''Adobe PageMaker Document''
 
 
This is a DeskTop Publishing (DTP) program. See [http://en.wikipedia.org/wiki/Adobe_PageMaker] for history & description. It has been superseded by Adobe [http://en.wikipedia.org/wiki/Adobe_InDesign InDesign], but some Bible Societies or translators may still be using it. ''We hope to post details of how to extract usable text from a PMD file''.
 
 
=== EPUB ===
 
 
''Electronic Publication'' [http://en.wikipedia.org/wiki/Epub]
 
 
EPUB (electronic publication) is an e-book standard, by the [http://www.idpf.org/ International Digital Publishing Forum] (IDPF), which consists of three file format standards (files have the extension .epub). It supersedes the Open eBook standard. This format can be read by a number of desktop OS readers (e.g. Adobe Digital Editions and FBReader) as well as some e-book readers (e.g. Sony Reader and the iPad).
 
  
 
=== ABW ===
 
=== ABW ===
Line 146: Line 85:
 
As with any MS Word texts, the output will vary depending on how cleanly people have used styles or not. Style names and info are transferred into AbiWord's format. But even if no proper styling was used the information on fonts, etc can likely be used for transformation purposes.
 
As with any MS Word texts, the output will vary depending on how cleanly people have used styles or not. Style names and info are transferred into AbiWord's format. But even if no proper styling was used the information on fonts, etc can likely be used for transformation purposes.
  
=== SGML ===
 
''Standard Generalized Markup Language''
 
 
SGML is an ISO-standard technology for defining generalized markup languages for documents.[http://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language]
 
 
Generalized markup is based on two postulates:
 
* Markup should describe a document's structure and other attributes, rather than specify the processing to be performed on it, as descriptive markup need be done only once, and will suffice for future processing.
 
* Markup should be rigorous so that the techniques available for processing rigorously-defined objects like programs and data bases, can be used for processing documents as well.
 
 
e.g. Some Bible study resources use [http://www.rocketsoftware.com/section/views Folio Views]. Folio can export data as "FFF files" which are loosely in SGML format. FFF denotes Folio Flat Format.
 
 
A Google search finds several converters for SGML to XML. This program may prove useful.
 
 
* [http://www.jclark.com/sp/ SP] &ndash; developed by James Clark. "An open-source SGML parser written in C++. I wrote this from scratch to overcome the limitations of sgmls. This is now used in numerous SGML products and is widely regarded as the best SGML parser."
 
  
 
[[Category:File formats]]
 
[[Category:File formats]]

Latest revision as of 10:06, 11 January 2018

This page is for cruft from the File Formats page. Specifically, formats that are not and never will be employed or supported by CrossWire or Sword in any meaningful way can be described on this page. Likewise, discussion of completely obvious stuff, like "What is HTML/XML" can live on this page. Thus, the File Formats page can be pared down to useful information, sans the cruft.


Other Formats

The SWORD Project will utilize primary source e-texts. These e-texts may come in any number of formats. Here is a listing of formats in which Biblical e-texts have been found.
Note: the mention of a format does not indicate that The SWORD Project will create a module from that format.


Portable Document Format

This is an ISO track file format for platform independent rendering of documents. It is derived from Postscript and is maintained by Adobe. As such, it is designed to be substantially a "read only" format. Documents may be text, images, or scanned images of text. Many textual documents cannot reasonably be expected to allow plain-text export. Even so, the open-source tool called PDF2XML may turn out to be useful.

RTF

Rich Text Format

This is a markup format designed and maintained by Microsoft for the encoding of formatted text and graphics and easy transfer between applications. It is used as the markup language for presentation in The SWORD Project for Windows. It is also the internal markup format used within STEP books (see below). The format is of limited use as an archival format and there are no plans for SWORD to support it beyond its current use for presentation. On Windows systems, RTF files can be saved as Unicode files using the Wordpad program, the resulting text file being encoded as UTF-16 (LE) with BOM.

The RTF specification is updated with each release of Microsoft Word (to keep it in parity with Word's native serialized data format). The latest version, 1.9.1 (Word 2007), is available from Microsoft as a Word document. More easily searched HTML versions of the specification include 1.6 (Word 2000) from Microsoft and 1.5 (Word 97) from Biblioscape. For SWORD, it is improbable that any features of specifications later than RTF 1.5 will be necessary.

David Haslam has developed AutoIt scripts for MS Word and WordPad to perform the following conversions on multiple files:

  • MSWord to RTF – Convert multiple MS Word documents to Rich Text Format by means of MS Word.
  • RTF to RTF – simplifies the RTF markup for files saved from MS Word, with significant reductions in file size.
  • RTF to Unicode text – removes all formating and converts the file to UTF-16 LE encoding.

See Projects:Go Bible#AutoIt_Scripts.


ABW

AbiWord format

AbiWord is a free, open source word processor. More information about AbiWord is available at [1]. ABW files are a form of XML.

Peter has had some success in converting ABW files to OSIS using scripting tools.

USFX

Unified Scripture Format XML

This XML file format is designed to provide clean conversions from Scripture to USFM compliant file formats. A more comprehensive description can be found at [2]. Despite the similar names, this USFX is not the same as USX. There is no widespread use of this format and there are no plans for SWORD to support it in any way.

LIFT

Lexicon Interchange FormaT

LIFT is an XML format for storing lexical information, as used in the creation of dictionaries. It's not necessarily the format for your lexicon. That can be tied to whatever program you're using. But LIFT allows you to move that data between programs (hence the term 'interchange'). Programs that support LIFT include WeSay, FieldWorks Language Explorer (FLEx) and Lexique Pro.


OXES

Open XML for Editing Scripture

This is a new markup language related to OSIS. The project is administered by Michael Cochran of SIL International. The draft schema is maintained by Jim Albright of JAARS, with contributions from several SIL personnel and others. OXES was developed to add back translations, translator notes, consultant notes, and status of translation. OSIS is highly extensible. OXES is restrictive. All options are explicitly named. OSIS focuses on the finished translation. OXES includes process information so in the future translators will know why a passage was translated the way it is.

Jim Albright of SIL has already developed a utility (using XSLT) for converting USX to OXES (Open XML for Editing Scripture).

DTX

DTX is a local format, probably used only in Japanese Bible study softwares such as JBible and Seino no Tatujin. It's a very simple format with each line of "bbcccvvv \t CONTENT" where (bb as book id, ccc as chapter, vvv as verse no.; e.g. 01001001 for Genesis 1:1).

Kunio has developed a GUI conversion utility for Windows to convert DTX format to OSIS XML and thence to Sword module.

Download from

So far, it has been tested successfully for the following translations:

  • Shinkaiyaku (New Revised Japanese?) 4th edition
  • Kougoyaku (Colloquial Japanese) 4th edition

Notes:

  1. The conversion utility has been placed on the public domain, with the software provided under GNU license GPL.
  2. It probably works for other Bible editions too but never tested due to lack of data.
  3. For the Shin kyodoyaku translation, he couldn't get it work because it's too inconsistent with verse orders.
  4. The terms of conditions for use of the data in these programs indicate that it's OK to convert the data for personal use, as long as the user has purchased their software.


MediaWiki

Distant Shores Media is pioneering the use of the MediaWiki format for encoding Bible translations via its Door43 portal. See Publishing USFM-encoded Bible translations for mobile phones. Instantly. One of their programmers has developed an extension to the MediaWiki server (which powers Door43) called USFMtag that implements this concept. USFM-encoded Bible translations can be copied-and-pasted into any page on Door43 and the raw text is rendered in the browser as formatted text.

ABW

AbiWord format

AbiWord is a multi-platform open-source word-processor. Transformation from MS Word into AbiWord format can render documents very nicely into a flat XML file which appears to be more than accessible to subsequent processing by Perl scripts or XSLT. Many other open file and save file formats are supported.

In a bash shell on a *nix desktop you can do a batch convert like this:

$ for i in `ls *.doc`; do abiword --to=abw $i; done

As with any MS Word texts, the output will vary depending on how cleanly people have used styles or not. Style names and info are transferred into AbiWord's format. But even if no proper styling was used the information on fonts, etc can likely be used for transformation purposes.