Difference between revisions of "File Formats"

From CrossWire Bible Society
Jump to: navigation, search
m (IMP: utilities (iso redlink))
m (Reverted edits by David Haslam (Talk) to last version by Osk)
Line 39: Line 39:
 
''Import Format''
 
''Import Format''
  
This proprietary file format is used by SWORD for import of all types of modules. The three utilities '''imp2vs''' (for Bibles and verse-indexed commentaries), '''imp2ld''' (for lexicons, dictionaries, and daily-devotionals), and '''imp2gbs''' (for all other types of books) can be used to import IMP files to SWORD's native formats.
+
This proprietary file format is used by SWORD for import of all types of modules. The three [[File_formats#IMP_Tools|utilities]] '''imp2vs''' (for Bibles and verse-indexed commentaries), '''imp2ld''' (for lexicons, dictionaries, and daily-devotionals), and '''imp2gbs''' (for all other types of books) can be used to import IMP files to SWORD's native formats.
  
An IMP file consists of any number of entries. Each entry consists of a key line and any number of content lines. The key line consists of a line beginning with "$$$". For example, "$$$" would be the key line for the entry of a Bible or commentary module.
+
An IMP file consists of any number of entries. Each entry consists of a key line and any number of content lines. The key line consists of a line beginning with "$$$". For example, "$$$Gen 1:1" would be the key line for the Genesis 1:1 entry of a Bible or commentary module.
  
 
The content lines of an entry may consist of any text (provided that the first three characters of the line are not "$$$"). The internal markup of the content may be in any format supported by SWORD, namely OSIS for any module type or ThML for freeform books from CCEL.
 
The content lines of an entry may consist of any text (provided that the first three characters of the line are not "$$$"). The internal markup of the content may be in any format supported by SWORD, namely OSIS for any module type or ThML for freeform books from CCEL.
Line 161: Line 161:
  
 
===Go Bible===
 
===Go Bible===
To achieve the navigation speed and general ease of use on even the simplest of Java mobile phones, Go Bible data is fully indexed, as well as being compressed (as are all JAR files).  The format is described in [http://code.google.com/p/gobible/wiki/GoBibleDataFormat Go Bible data format]. Go Bible data is structured as Book | Chapter | Verse text and does not support notes, headings and cross-references, etc. The developer kit [http://gobible.jolon.org/developer/welcome.html Go Bible Creator] can take either ThML or OSIS as the source text format, but they usually have to be made specially suitable. For example, OSIS files produced by Snowfall Software's SFMToOSIS Python script are not structured the same. Work has begun to make an [http://en.wikipedia.org/wiki/XSL_Transformations XSLT] script to convert such OSIS XML files to the format suitable for Go Bible. Go Bible Creator version 2.3.2 and onwards can also take a folder of USFM files as the source text format.
+
To achieve the navigation speed and general ease of use on even the simplest of Java mobile phones, Go Bible data is fully indexed, as well as being compressed (as are all JAR files).  The format is described in [http://code.google.com/p/gobible/wiki/GoBibleDataFormat Go Bible data format]. Go Bible data is structured as Book | Chapter | Verse text and does not support notes, headings and cross-references, etc. The developer kit [http://gobible.jolon.org/developer/welcome.html Go Bible Creator] can take either ThML or OSIS as the source text format, but they usually have to be made specially suitable. For example, OSIS files produced by Snowfall Software USFM2OSIS script are not structured the same. Work has begun to make an [http://en.wikipedia.org/wiki/XSL_Transformations XSLT] script to convert such OSIS XML files to the format suitable for Go Bible. Go Bible Creator version 2.3.2 and onwards can also take a folder of USFM files as the source text format.
  
 
Following an agreement made in July 2008 with the program's author Jolon Faichney, Go Bible is being adopted by CrossWire as its Java ME software project. See [[User:David Haslam|here]] for preliminary information. ''Volunteers wanted''.
 
Following an agreement made in July 2008 with the program's author Jolon Faichney, Go Bible is being adopted by CrossWire as its Java ME software project. See [[User:David Haslam|here]] for preliminary information. ''Volunteers wanted''.
Line 177: Line 177:
  
 
===Go Bible utilities===
 
===Go Bible utilities===
* Go Bible Creator - a Java SE program for converting either ThML or OSIS or USFM to Go Bible. Go Bible Creator version 2.3.4 may be downloaded [http://go-bible.googlegroups.com/web/GoBibleCreator_Version_2.3.4.zip here], while still unavailable at the main Go Bible [http://gobible.jolon.org/developer/welcome.html website].
+
* Go Bible Creator - a Java SE program for converting either ThML or OSIS or USFM to Go Bible. Go Bible Creator version 2.3.2 may be downloaded [http://go-bible.googlegroups.com/web/GoBibleCreator_Version_2.3.2.zip here], while still unavailable at the main Go Bible [http://gobible.jolon.org/developer/welcome.html website].
  
 
* [http://gbcpreprocessor.codeplex.com/ Go Bible Creator USFM Preprocessor] – This is a tool to parse through and identify, correct and publish USFM file formats into a file format that can easily be put into the Go Bible mobile phone program.
 
* [http://gbcpreprocessor.codeplex.com/ Go Bible Creator USFM Preprocessor] – This is a tool to parse through and identify, correct and publish USFM file formats into a file format that can easily be put into the Go Bible mobile phone program.
Line 188: Line 188:
 
* step2rtf - extracts the internal RTF text from STEP books  [http://www.customconsulting.us/step2rtf.zip]
 
* step2rtf - extracts the internal RTF text from STEP books  [http://www.customconsulting.us/step2rtf.zip]
 
* stepr - a rudimentary STEP reader [http://www.customconsulting.us/stepr-0.3.1.tgz]
 
* stepr - a rudimentary STEP reader [http://www.customconsulting.us/stepr-0.3.1.tgz]
 
===XSLT Processors===
 
 
* [http://saxon.sourceforge.net/ The SAXON XSLT and XQuery Processor] for transforming XML files, author Michael Kay.
 
  
 
===ThML Utilities===
 
===ThML Utilities===
Line 198: Line 194:
 
===Zefania Utilities===
 
===Zefania Utilities===
 
* Zefania_2_sword_win32 - ''sed based scripts maintained by JensG'' [http://www.grabner-online.de/download/zefania_2_sword_win32.zip]
 
* Zefania_2_sword_win32 - ''sed based scripts maintained by JensG'' [http://www.grabner-online.de/download/zefania_2_sword_win32.zip]
 
===Ruby Annotation Utilities===
 
 
* How to convert a Japanese Meiji-yaku bible into Sword module. See [http://bible.50webs.org/sword/tools-en.html]
 
  
 
== Optical Character Recognition ==
 
== Optical Character Recognition ==

Revision as of 22:40, 23 July 2009

Bible study programs use a plethora of markup formats. Even more have been suggested for use in creating Bibles and other religious material.

The SWORD Project respects copyright. As such, conversion of material that is under copyright is not supported by The SWORD Project.

This page lists some of the more common file formats relevant to The SWORD Project and associated utilities.

SWORD Input formats

The SWORD Project supports the following markup: OSIS, ThML, GBF and plain text.

OSIS

Open Scripture Information Standard

The Open Scripture Information Standard (OSIS) is "a common format for many visions." It is an XML format for marking up scripture and related text, part of an initiative composed of translators, publishers, scholars, software manufacturers, and technical experts, coordinated by the Bible Technologies Group. It is co-sponsored by the American Bible Society and the Society of Biblical Literature.

The most recent XML schema is OSIS 2.1.1, and a manual is also available.

This markup format is recommended by the CrossWire Bible Society and can be used for creating all types of resources for The SWORD Project. Support for OSIS is actively maintained and support for any unsupported elements or features needed for a module you may be working on may be requested.

ThML

Theological Markup Language

This format is a variant of XML based on TEI and ThML, developed by and for the Christian Classics Ethereal Library. The specifications for this markup format are available at http://www.ccel.org/ThML/.

This markup format is used in some SWORD resources, but only the creation of free-form "General book" modules based on existing CCEL resources is currently supported. Other works and new works should be created using the OSIS format.

GBF

General Bible Format

This markup format is intended as an aid to preparing Bible texts (specifically the WEB and WEB:ME) for use with various Bible search programs. The complete specification is at http://www.ebible.org/bible/gbf.htm.

This markup format was previously used for some SWORD modules but is now deprecated in favor of OSIS. The rudimentary gbf2osis.pl utility may be used to convert GBF to OSIS for import to SWORD's native format.

VPL

Verse-Per-Line

This plain-text format is used for by SWORD for import of Bibles. It consists of one verse per line, with an optional verse reference at the beginning. The vpl2mod utility may be used for import. VPL is deprecated in favor of the IMP format, which is more widely useful.

IMP

Import Format

This proprietary file format is used by SWORD for import of all types of modules. The three utilities imp2vs (for Bibles and verse-indexed commentaries), imp2ld (for lexicons, dictionaries, and daily-devotionals), and imp2gbs (for all other types of books) can be used to import IMP files to SWORD's native formats.

An IMP file consists of any number of entries. Each entry consists of a key line and any number of content lines. The key line consists of a line beginning with "$$$". For example, "$$$Gen 1:1" would be the key line for the Genesis 1:1 entry of a Bible or commentary module.

The content lines of an entry may consist of any text (provided that the first three characters of the line are not "$$$"). The internal markup of the content may be in any format supported by SWORD, namely OSIS for any module type or ThML for freeform books from CCEL.

The SWORD Project Utilities

Precompiled versions of many of these programs are available in most Linux distributions, using the distribution's package installer. For Windows, they can be found here.

Module Creation Tools

  • imp2gbs - imports free-form General books in IMP format to SWORD format
  • imp2ld - imports lexicons, dictionaries, and daily devotionals in IMP format to SWORD format
  • imp2vs - imports Bibles and commentaries in IMP format to SWORD format
  • vpl2mod - imports Bibles and commentaries in Verse-Per-Line format to SWORD format
  • osis2mod - imports Bibles and commentaries in OSIS format to SWORD format
  • xml2gbs - imports free-form General books in OSIS or ThML format to SWORD format

Diagnostic Tools

  • mod2imp - creates an IMP file from an installed module
  • stepdump - dumps the contents of a STEP book

Conversion Tools

  • gbf2osis.pl - a PERL utility for converting GBF to OSIS
  • step2vpl - export a STEP book in Verse-Per-Line (VPL) format
  • thml2osis - converts ThML to OSIS format.
  • zef2osis.pl - a PERL utility for converting Zefania XML to OSIS

OSIS Utilities

  • mod2osis - creates an OSIS file from an installed module
  • vs2osisref - returns the osisRef of a given (text form) verse reference
  • xml2gbs - imports free-form General books in OSIS or ThML format to SWORD format

Miscellaneous

  • cipherraw - used to encipher SWORD modules
  • diatheke - a basic CLI SWORD frontend
  • mkfstmod - creates a search index for a module
  • mod2zmod - creates a compressed module from an installed module

Recommended Non-SWORD Utilities

  • uconv - a utility from ICU for converting between various character encodings, perform normalization, transliterate texts, etc. (It's similar to iconv, but much, much more powerful.) uconv.exe is part of the sword utilities ZIPs
  • xmllint - a utility (part of the libxml2 distribution) for validating XML documents *

Other Formats

The SWORD Project will utilize primary source e-texts. These e-texts may come in any number of formats. Here is a listing of formats in which Biblical e-texts have been found.
Note: the mention of a format does not indicate that The SWORD Project will create a module from that format.

HTML

Hyper Text Markup Language

This is the basic markup language of the World Wide Web. Some SWORD front-ends, such as BibleTime, GnomeSword, and Bible Desktop, use HTML for presentation.

LitML

Liturgical Markup Language

This markup format is a descendant of, and complement to ThML, described here.

The markup reflects its orientation towards liturgy and hymns.

PDF

Portable Document Format

This is an ISO track file format for platform independent rendering of documents. It is derived from Postscript and is maintained by Adobe. Documents may be text, images, or scanned images of text. Even textual documents cannot reasonably be expected to allow plain-text export. As such, it is designed to be a "read only" format.

RTF

Rich Text Format

This is a markup format designed by Microsoft. It is used as the markup language for presentation The SWORD Project for Windows. It is also the internal markup format used within STEP books (see below). The format is of limited use as an archival format and there are no plans for SWORD to support it beyond its current use for presentation. On Windows systems, RTF files can be saved as Unicode files using the Wordpad program, the resulting text file being encoded as UTF-16 with BOM.

LaTeX

LaTeX is a document markup language and document preparation system for the TeX typesetting program. Some third party source texts for Bible related content made available in PDF format may have been typeset using LaTeX. Sometimes it may be worthwhile asking the owner if the source text might be made available in LaTeX format, especially if there is no other alternative suitable as a starting point for conversion towards making a SWORD module. There are currently no plans for SWORD to support it.

The Myanmar Bible Society has a utility called bibleTec2osis.pl for converting from TeX into OSIS.

STEP

Standard Template for Electronic Publishing

This file format was formerly used by QuickVerse and WORDsearch, and is currently used for some e-Sword books.

While not an open standard, the publicly released documentation and specifications for this format can be found partially mirrored at http://www.crosswire.org/bsisg/. Some utilities for working with this format are listed below. It is unlikely that the SWORD Project will support this format in the future as it is largely dead.

Unbound Bible Format

Unbound Bible Format

The BIOLA's Unbound Bible offers many of their resources for download in a proprietary, but relatively simple tab-delimited plain-text format (TDT). There are usually two variants, one with versification mapping to the ASV, and the other without verse mapping.

There is no widespread use of this format, but the rudimentary unb2osis.pl utility may be used to convert Unbound Bible format to OSIS for import to SWORD's native format.

It is a relatively simple task to create a script or filter to convert TDT format to CSV format and/or vice versa.

USFM

Unified Standard Format Markers

This plain-text format is a common internal-use format within Bible translation agencies and Bible societies. It is the native format of Paratext. The rudimentary usfm2osis.pl utility may be used to convert USFM to OSIS for import to SWORD's native format. USFM uses a separate file for each Bible book.

See also: Converting SFM Bibles to OSIS

USFX

Unified Scripture Format XML

This XML file format is designed to provide clean conversions from Scripture to USFM compliant file formats. A more comprehensive description can be found at http://ebt.cx/usfx/. There is no widespread use of this format and there are no plans for SWORD to support it in any way.

XSEM

XML Scripture Encoding Model

This XML format was proposed by SIL. A comprehensive description of the markup language can be found here.

The formal specifications can be downloaded as a ZIP file here.

The designers of this markup language were instrumental in the writing of the OSIS Specification and it has largely been deprecated in favor of using OSIS. There is no widespread use of this format and there are no plans for SWORD to support it in any way.

XML

eXtensible Markup Language

This is generic family of markup formats. Links to a number of XML specifications can be found here. Each flavor has its own specifications. SWORD supports markup in the XML formats OSIS and ThML internally.

Zefania XML

Zefania was an XML format for Bible markup with only the most simple structural tags for book/chapter/verse, notes, etc. The zef2osis.pl utility may be used to convert Zefania XML to OSIS for import to SWORD's native format.

Go Bible

To achieve the navigation speed and general ease of use on even the simplest of Java mobile phones, Go Bible data is fully indexed, as well as being compressed (as are all JAR files). The format is described in Go Bible data format. Go Bible data is structured as Book | Chapter | Verse text and does not support notes, headings and cross-references, etc. The developer kit Go Bible Creator can take either ThML or OSIS as the source text format, but they usually have to be made specially suitable. For example, OSIS files produced by Snowfall Software USFM2OSIS script are not structured the same. Work has begun to make an XSLT script to convert such OSIS XML files to the format suitable for Go Bible. Go Bible Creator version 2.3.2 and onwards can also take a folder of USFM files as the source text format.

Following an agreement made in July 2008 with the program's author Jolon Faichney, Go Bible is being adopted by CrossWire as its Java ME software project. See here for preliminary information. Volunteers wanted.

Go Bible source code is now available here on the CrossWire Repository. To access this you will need to have an account.

PMD

Adobe PageMaker Document

This is a DeskTop Publishing (DTP) program. See [1] for history & description. It has been superseded by Adobe InDesign, but some Bible Societies or translators may still be using it. We hope to post details of how to extract usable text from a PMD file.

Other Utilities

These are not part of The SWORD Project, but may be useful. A link is given for each.

Go Bible utilities

  • Go Bible Creator - a Java SE program for converting either ThML or OSIS or USFM to Go Bible. Go Bible Creator version 2.3.2 may be downloaded here, while still unavailable at the main Go Bible website.
  • Go Bible Creator USFM Preprocessor – This is a tool to parse through and identify, correct and publish USFM file formats into a file format that can easily be put into the Go Bible mobile phone program.

GBF Tools

  • gbfconvertor, including gbf2osis, gbf2xsem, & gbf2sf - utilities for converting GBF to OSIS, XSEM, and SFM [2]
  • gbfsrc - utilities for converting GBF to "HTML, RTF, TeX, plain ASCII text, a format readable by BibleWorks 5 or later, and a couple of less useful formats" [3]

STEP Utilities

  • step2rtf - extracts the internal RTF text from STEP books [4]
  • stepr - a rudimentary STEP reader [5]

ThML Utilities

  • CCEL Desktop - a program for viewing and developing CCEL books [6]

Zefania Utilities

  • Zefania_2_sword_win32 - sed based scripts maintained by JensG [7]

Optical Character Recognition

Text development activities may be greatly assisted by using OCR software. This section will list OCR programs that CrossWire volunteers have found useful. Proprietary programs should not be listed here, the preference at CrossWire being to use free and/or open-source software.

Tessaract

  • Tesseract is a free optical character recognition engine. It was originally developed at Hewlett-Packard from 1985 until 1995. After ten years with no development, Hewlett Packard and UNLV released it in 2005. Tesseract is currently developed by Google and released under the Apache License, Version 2.0. Tesseract is considered one of the the most accurate free software OCR engines currently available.