Difference between revisions of "Validate OSIS or TEI text"
(Created page with "An OSIS or TEI test is an XML Document that must be: # <b>Well formed</b>, it means that its syntax must conforms to the XML specs. An XML file that is not well formed is not...") |
David Haslam (talk | contribs) m (→xmlstarlet: -LF) |
||
(21 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | = Syntax Check and Valid OSIS/TEI files = | ||
+ | |||
An OSIS or TEI test is an XML Document that must be: | An OSIS or TEI test is an XML Document that must be: | ||
# <b>Well formed</b>, it means that its syntax must conforms to the XML specs. An XML file that is not well formed is not an XML file. | # <b>Well formed</b>, it means that its syntax must conforms to the XML specs. An XML file that is not well formed is not an XML file. | ||
# <b>Valid</b>. A valid XML document is well-formed and conforms to the formal definition provided in a schema (or DTD). A document cannot have elements, attributes, or entities not defined in the schema. A schema can also define how entities may be nested, the possible values of attributes, etc. | # <b>Valid</b>. A valid XML document is well-formed and conforms to the formal definition provided in a schema (or DTD). A document cannot have elements, attributes, or entities not defined in the schema. A schema can also define how entities may be nested, the possible values of attributes, etc. | ||
− | + | There are online facilities for XML validation, many programs capable of schema validation exist and most [http://en.wikipedia.org/wiki/XML_editor XML editors] ([http://xml-copy-editor.sourceforge.net/ XML Copy Editor], [http://en.wikipedia.org/wiki/Oxygen_XML_Editor Oxygen], [http://en.wikipedia.org/wiki/XMLSpy XMLSpy], [http://www.topologi.com/ Topologi], [https://notepad-plus-plus.org/ Notepad++] with the plugin [https://github.com/morbac/xmltools XMLTools], etc.) support some sort of XML schema validation. | |
+ | |||
+ | = Bible Technologies Group = | ||
The '''BTG''' that sponsored the OSIS committee and hosted the OSIS schema no longer exists. | The '''BTG''' that sponsored the OSIS committee and hosted the OSIS schema no longer exists. | ||
The schema location therefore now needs to be for a local copy on your computer or to a copy hosted by CrossWire or elsewhere. | The schema location therefore now needs to be for a local copy on your computer or to a copy hosted by CrossWire or elsewhere. | ||
Line 9: | Line 13: | ||
For more up to date details, see [[OSIS 211 CR]] which includes CrossWire's own updated schema. | For more up to date details, see [[OSIS 211 CR]] which includes CrossWire's own updated schema. | ||
− | + | == Schema == | |
− | Before validating XML files, | + | Before validating XML files, you first need to download a schema from Crosswire. |
− | + | * For OSIS encoded source files: osisCore.2.1.1-cw-latest.xsd: | |
− | + | :https://www.crosswire.org/~dmsmith/osis/osisCore.2.1.1-cw-latest.xsd | |
− | + | * For TEI encoded source files: teiP5osis.2.5.0.xsd: | |
− | + | :http://www.crosswire.org/OSIS/teiP5osis.2.5.0.xsd | |
− | https://www.crosswire.org/~dmsmith/osis/osisCore.2.1.1-cw-latest.xsd | ||
− | http://www.crosswire.org/OSIS/teiP5osis.2.5.0.xsd | ||
− | + | = Online validators = | |
− | |||
The first and simpliest option for checking an XML file is to use | The first and simpliest option for checking an XML file is to use | ||
online validators. They will check if your XML is both well-formed and | online validators. They will check if your XML is both well-formed and | ||
valid. | valid. | ||
− | Here are two websites, | + | Here are two websites, you'll find others on the Internet. |
- Core Filing XML Schema Validator | - Core Filing XML Schema Validator | ||
Line 35: | Line 36: | ||
With these validators, you have to upload the XML File and the schema | With these validators, you have to upload the XML File and the schema | ||
− | (.xsd) file before validating. | + | (.xsd) file to the website before validating. |
− | + | We no not recommend online validation, as it may raise privacy | |
− | concerns | + | concerns for copyright texts, and although it may be fine for a one shot |
− | validation task, it becomes | + | validation task, it soon becomes tedious when you're creating and |
− | + | editing a text and want to periodically validate your work. | |
− | + | = CLI Validators = | |
− | |||
When you're editing a text, one of the fastest option for checking your | When you're editing a text, one of the fastest option for checking your | ||
XML is to use a CLI tool. | XML is to use a CLI tool. | ||
− | + | == xmllint == | |
The simplest way is to use the xmllint program included with libxml2. | The simplest way is to use the xmllint program included with libxml2. | ||
For Mac and Linux users, you likely already have xmllint installed. | For Mac and Linux users, you likely already have xmllint installed. | ||
Line 62: | Line 62: | ||
xmllint --noout --schema teiP5osis.2.5.0.xsd test.tei.xml | xmllint --noout --schema teiP5osis.2.5.0.xsd test.tei.xml | ||
− | + | == xmlstarlet == | |
− | |||
XMLStarlet is an open source XML toolkit that you can use with Linux, | XMLStarlet is an open source XML toolkit that you can use with Linux, | ||
Mac or Windows. XMLStarlet is linked statically to both libxml2 and | Mac or Windows. XMLStarlet is linked statically to both libxml2 and | ||
Line 78: | Line 77: | ||
xmlstarlet val --xsd ../../schemas/teiP5osis.2.5.0.xsd test.tei.xml | xmlstarlet val --xsd ../../schemas/teiP5osis.2.5.0.xsd test.tei.xml | ||
− | + | == Xerces == | |
− | |||
Xerces is Apache's collection of software libraries for parsing, | Xerces is Apache's collection of software libraries for parsing, | ||
validating, serializing and manipulating XML. The implementation is | validating, serializing and manipulating XML. The implementation is | ||
Line 85: | Line 83: | ||
version having the most features. | version having the most features. | ||
+ | === xerces-c === | ||
On Ubuntu/debian, you can install xerces-c tools: | On Ubuntu/debian, you can install xerces-c tools: | ||
Line 96: | Line 95: | ||
https://xerces.apache.org/xerces-c/stdinparse-3.html | https://xerces.apache.org/xerces-c/stdinparse-3.html | ||
− | + | === xsd-validator by Adrian Mouat === | |
− | |||
There isn’t a simple way to immediately run the Xerces validator in | There isn’t a simple way to immediately run the Xerces validator in | ||
Java from the command line. For that reason, Adrian Mouat wrote a Java | Java from the command line. For that reason, Adrian Mouat wrote a Java | ||
Line 120: | Line 118: | ||
sudo ant install | sudo ant install | ||
− | + | = Editors Supporting Validation = | |
− | |||
The final choice is to use an editor with validation on the fly. If | The final choice is to use an editor with validation on the fly. If | ||
you’re doing a lot of XML editing and validation it may well be worth | you’re doing a lot of XML editing and validation it may well be worth | ||
− | looking into one of the | + | looking into one of the editors listed below. |
− | + | NOTE: If for any reason they do not find a schema, many editor silently fallback to only checking if the file is well-formed, which may generate false-positive results. To be sure, run the Solomon test: | |
+ | Add tag <code><solomonTest /></code> in your text. | ||
+ | This tag conforms to the XML specifications but is not part of our schemas, so the editor must show up an error. | ||
+ | |||
+ | == Notepad++ == | ||
− | With XML Tools plugin for Notepad++, Notepad++ will allow you to clean | + | With '''XML Tools''' plugin for '''Notepad++''', Notepad++ will allow you to clean |
up unformatted files, check XML syntax function if you want just to | up unformatted files, check XML syntax function if you want just to | ||
check your existing XML file for errors, or use Enable Auto Validation | check your existing XML file for errors, or use Enable Auto Validation | ||
Line 140: | Line 141: | ||
You must restart Notepad ++ after installation. | You must restart Notepad ++ after installation. | ||
+ | == Sublime Text == | ||
+ | |||
+ | Validate XML files on the fly with this [http://www.sublimetext.com/ Sublime Text 3] plugin:<br/> | ||
+ | https://packagecontrol.io/packages/Exalt | ||
+ | |||
+ | == Emacs == | ||
− | |||
It's a little bit tricky, but you can configure Emacs to provide the | It's a little bit tricky, but you can configure Emacs to provide the | ||
following features: | following features: | ||
− | + | * Easy navigation | |
− | + | * Validation on the fly | |
− | + | * Auto completion | |
− | + | === Use nxml-mode for editing XML === | |
The first thing to do is to force Emacs to use nxml mode instead of xml | The first thing to do is to force Emacs to use nxml mode instead of xml | ||
mode when editing XML files. nxml-mode uses the nXML extension to | mode when editing XML files. nxml-mode uses the nXML extension to | ||
Line 156: | Line 162: | ||
Add the following lines to your ~/.emacs file: | Add the following lines to your ~/.emacs file: | ||
− | + | <pre> | |
(setq auto-mode-alist (cons '("\\.xml$" . nxml-mode) auto-mode-alist)) | (setq auto-mode-alist (cons '("\\.xml$" . nxml-mode) auto-mode-alist)) | ||
(setq auto-mode-alist (cons '("\\.xsl$" . nxml-mode) auto-mode-alist)) | (setq auto-mode-alist (cons '("\\.xsl$" . nxml-mode) auto-mode-alist)) | ||
Line 166: | Line 172: | ||
(eval-after-load 'rng-loc | (eval-after-load 'rng-loc | ||
'(add-to-list 'rng-schema-locating-files "~/.schema/schemas.xml")) | '(add-to-list 'rng-schema-locating-files "~/.schema/schemas.xml")) | ||
+ | </pre> | ||
If you are using Emacs 24 or higher, you will also need this line that | If you are using Emacs 24 or higher, you will also need this line that | ||
Line 172: | Line 179: | ||
(global-set-key [C-return] 'completion-at-point) | (global-set-key [C-return] 'completion-at-point) | ||
− | + | === Set-up Crosswire Schemas === | |
nxml-mode validates XML files using schemas in relaxng compact format | nxml-mode validates XML files using schemas in relaxng compact format | ||
(.rnc). We have to convert our files from .xsd format to .rnc. | (.rnc). We have to convert our files from .xsd format to .rnc. | ||
+ | |||
+ | Converting XSD is a very hard task; the XSD specification is complex. It seems that available command line tools that convert from xsd (XML Schema) to rng (RelaxNG) have problems of some sort. | ||
We use Sun RELAX NG Converter, nowadays bundled with the (Sun) | We use Sun RELAX NG Converter, nowadays bundled with the (Sun) | ||
Line 181: | Line 190: | ||
Install on Fedora: | Install on Fedora: | ||
− | + | <pre> | |
# sudo dnf install msv-rngconv trang | # sudo dnf install msv-rngconv trang | ||
− | + | </pre> | |
− | + | ==== Convert from .xsd to .rng: ==== | |
rngconv osisCore.2.1.1-cw-latest.xsd > osisCore.2.1.1-cw-latest.rng | rngconv osisCore.2.1.1-cw-latest.xsd > osisCore.2.1.1-cw-latest.rng | ||
− | + | ==== Convert from .rng to .rnc ==== | |
− | + | <pre> | |
trang -I rng -O rnc osisCore.2.1.1-cw-latest.rng osisCore.2.1.1-cw-latest.rnc | trang -I rng -O rnc osisCore.2.1.1-cw-latest.rng osisCore.2.1.1-cw-latest.rnc | ||
+ | </pre> | ||
− | + | === Tell nxml where to find our schemas === | |
− | |||
− | |||
− | |||
− | |||
We have already (see above) set the variable rng-schema-locating-files | We have already (see above) set the variable rng-schema-locating-files | ||
to "~/.schema/schemas.xml | to "~/.schema/schemas.xml | ||
Line 206: | Line 212: | ||
and create ~/.schema/schemas.xml: | and create ~/.schema/schemas.xml: | ||
− | + | <pre> | |
<locatingRules xmlns="http://thaiopensource.com/ns/locating-rules/1.0"> | <locatingRules xmlns="http://thaiopensource.com/ns/locating-rules/1.0"> | ||
<namespace ns="http://www.crosswire.org/2013/TEIOSIS/namespace" uri="teiP5osis.2.5.0.rnc"/> | <namespace ns="http://www.crosswire.org/2013/TEIOSIS/namespace" uri="teiP5osis.2.5.0.rnc"/> | ||
<namespace ns="http://www.bibletechnologies.net/2003/OSIS/namespace" uri="osisCore.2.1.1-cw-latest.rnc"/> | <namespace ns="http://www.bibletechnologies.net/2003/OSIS/namespace" uri="osisCore.2.1.1-cw-latest.rnc"/> | ||
</locatingRules> | </locatingRules> | ||
+ | </pre> | ||
− | + | === Auto-completion === | |
− | |||
Type a < character and hit Ctrl+Enter for a list of valid tags. You can | Type a < character and hit Ctrl+Enter for a list of valid tags. You can | ||
type a few letters and hit Tab to use auto-completion. Hit Enter to | type a few letters and hit Tab to use auto-completion. Hit Enter to | ||
Line 219: | Line 225: | ||
space after the tag, and hit Ctrl+Enter for attribute auto-completion. | space after the tag, and hit Ctrl+Enter for attribute auto-completion. | ||
− | + | === Links === | |
https://fedoraproject.org/wiki/How_to_use_Emacs_for_XML_editing | https://fedoraproject.org/wiki/How_to_use_Emacs_for_XML_editing | ||
https://lgfang.github.io/mynotes/emacs/emacs-xml.html#sec-9 | https://lgfang.github.io/mynotes/emacs/emacs-xml.html#sec-9 | ||
https://www.emacswiki.org/emacs/NxmlMode | https://www.emacswiki.org/emacs/NxmlMode | ||
− | + | = Validating from Windows Explorer = | |
− | |||
Here is a simple application for validating XML files from within | Here is a simple application for validating XML files from within | ||
Windows Explorer. | Windows Explorer. | ||
Line 231: | Line 236: | ||
https://www.codeproject.com/Articles/8431/A-Simple-XML-Validator | https://www.codeproject.com/Articles/8431/A-Simple-XML-Validator | ||
− | + | = Python = | |
− | |||
It's relatively straighforward to validate a file with Python: | It's relatively straighforward to validate a file with Python: | ||
Let's create simplest validator.py | Let's create simplest validator.py | ||
− | + | <pre> | |
from lxml import etree | from lxml import etree | ||
Line 248: | Line 252: | ||
return result | return result | ||
− | + | </pre> | |
then write and run main.py | then write and run main.py | ||
− | + | <pre> | |
from validator import validate | from validator import validate | ||
Line 257: | Line 261: | ||
else: | else: | ||
print('Not valid! :(') | print('Not valid! :(') | ||
− | + | </pre> | |
− | |||
− | |||
− |
Latest revision as of 15:01, 1 March 2021
Contents
Syntax Check and Valid OSIS/TEI files
An OSIS or TEI test is an XML Document that must be:
- Well formed, it means that its syntax must conforms to the XML specs. An XML file that is not well formed is not an XML file.
- Valid. A valid XML document is well-formed and conforms to the formal definition provided in a schema (or DTD). A document cannot have elements, attributes, or entities not defined in the schema. A schema can also define how entities may be nested, the possible values of attributes, etc.
There are online facilities for XML validation, many programs capable of schema validation exist and most XML editors (XML Copy Editor, Oxygen, XMLSpy, Topologi, Notepad++ with the plugin XMLTools, etc.) support some sort of XML schema validation.
Bible Technologies Group
The BTG that sponsored the OSIS committee and hosted the OSIS schema no longer exists. The schema location therefore now needs to be for a local copy on your computer or to a copy hosted by CrossWire or elsewhere.
For more up to date details, see OSIS 211 CR which includes CrossWire's own updated schema.
Schema
Before validating XML files, you first need to download a schema from Crosswire.
- For OSIS encoded source files: osisCore.2.1.1-cw-latest.xsd:
- For TEI encoded source files: teiP5osis.2.5.0.xsd:
Online validators
The first and simpliest option for checking an XML file is to use online validators. They will check if your XML is both well-formed and valid.
Here are two websites, you'll find others on the Internet.
- Core Filing XML Schema Validator
https://www.corefiling.com/opensource/schemaValidate/ Accept huge files (tested with a 5.5MB file)
- FreeFormatter Validator
https://www.freeformatter.com/xml-validator-xsd.html The maximum size limit for file upload is 2MB
With these validators, you have to upload the XML File and the schema (.xsd) file to the website before validating.
We no not recommend online validation, as it may raise privacy concerns for copyright texts, and although it may be fine for a one shot validation task, it soon becomes tedious when you're creating and editing a text and want to periodically validate your work.
CLI Validators
When you're editing a text, one of the fastest option for checking your XML is to use a CLI tool.
xmllint
The simplest way is to use the xmllint program included with libxml2. For Mac and Linux users, you likely already have xmllint installed. Windows users willing to try xmllint will find interesting instructions here: https://techrina.net/2019/01/25/using-xmllint-program-for-windows-7/
To validate an OSIS xml file enter:
xmllint --noout --schema osisCore.2.1.1-cw-latest.xsd test.osis.xml
To validate a TEI xml file enter:
xmllint --noout --schema teiP5osis.2.5.0.xsd test.tei.xml
xmlstarlet
XMLStarlet is an open source XML toolkit that you can use with Linux, Mac or Windows. XMLStarlet is linked statically to both libxml2 and libxslt, so generally all you need to process XML documents is one executable file, it may be a better option for Windows users.
On Linux, xmlstarlet is available as a regular package.
For Mac or Windows, the download page is at: http://xmlstar.sourceforge.net/download.php
To validate a TEI XML file enter:
xmlstarlet val --xsd ../../schemas/teiP5osis.2.5.0.xsd test.tei.xml
Xerces
Xerces is Apache's collection of software libraries for parsing, validating, serializing and manipulating XML. The implementation is available in the Java, C++ and Perl programming languages, the Java version having the most features.
xerces-c
On Ubuntu/debian, you can install xerces-c tools:
apt install libxerces-c-samples
To validate an OSIS XML file enter:
StdInParse -v=always -n -s < test.osis.xml.
You'll find the full syntax here: https://xerces.apache.org/xerces-c/stdinparse-3.html
xsd-validator by Adrian Mouat
There isn’t a simple way to immediately run the Xerces validator in Java from the command line. For that reason, Adrian Mouat wrote a Java program to solve this issue.
It's called 'xsd-validator', for installing it:
Either clone the git repository at: https://github.com/amouat/xsd-validator.git or download xsd-validator zip from: https://github.com/amouat/xsd-validator/releases/download/v1.0/xsdv-1.0.zip
To validate an OSIS XML file enter:
cd xsd-validator ./xsdv.sh osisCore.2.1.1-cw-latest.xsd test.osis.xml
There is also a cmd file that you can use to run xsdv from a windows Command Prompt.
For installing xsd-validator, run:
sudo ant install
Editors Supporting Validation
The final choice is to use an editor with validation on the fly. If you’re doing a lot of XML editing and validation it may well be worth looking into one of the editors listed below.
NOTE: If for any reason they do not find a schema, many editor silently fallback to only checking if the file is well-formed, which may generate false-positive results. To be sure, run the Solomon test:
Add tag <solomonTest />
in your text.
This tag conforms to the XML specifications but is not part of our schemas, so the editor must show up an error.
Notepad++
With XML Tools plugin for Notepad++, Notepad++ will allow you to clean up unformatted files, check XML syntax function if you want just to check your existing XML file for errors, or use Enable Auto Validation for automatic validation of code as it is being written among other features.
Go to the “Plugins” menu, then to “Plugin Manager”, then “Show Plugin Manager”. Look for XML Tools in the opened window, set the checkbox, and click the button “Install”.
You must restart Notepad ++ after installation.
Sublime Text
Validate XML files on the fly with this Sublime Text 3 plugin:
https://packagecontrol.io/packages/Exalt
Emacs
It's a little bit tricky, but you can configure Emacs to provide the following features:
- Easy navigation
- Validation on the fly
- Auto completion
Use nxml-mode for editing XML
The first thing to do is to force Emacs to use nxml mode instead of xml mode when editing XML files. nxml-mode uses the nXML extension to provide automatic validation and lots of other helpful functions for editing XML files.
Add the following lines to your ~/.emacs file:
(setq auto-mode-alist (cons '("\\.xml$" . nxml-mode) auto-mode-alist)) (setq auto-mode-alist (cons '("\\.xsl$" . nxml-mode) auto-mode-alist)) (setq auto-mode-alist (cons '("\\.xhtml$" . nxml-mode) auto-mode-alist)) (setq auto-mode-alist (cons '("\\.page$" . nxml-mode) auto-mode-alist)) (autoload 'xml-mode "nxml" "XML editing mode" t) (eval-after-load 'rng-loc '(add-to-list 'rng-schema-locating-files "~/.schema/schemas.xml"))
If you are using Emacs 24 or higher, you will also need this line that will give you auto-completion:
(global-set-key [C-return] 'completion-at-point)
Set-up Crosswire Schemas
nxml-mode validates XML files using schemas in relaxng compact format (.rnc). We have to convert our files from .xsd format to .rnc.
Converting XSD is a very hard task; the XSD specification is complex. It seems that available command line tools that convert from xsd (XML Schema) to rng (RelaxNG) have problems of some sort.
We use Sun RELAX NG Converter, nowadays bundled with the (Sun) 'Multi-Schema Validator' to convert .xsd to .rng, then we use trang to convert .rng to .rnc:
Install on Fedora:
# sudo dnf install msv-rngconv trang
Convert from .xsd to .rng:
rngconv osisCore.2.1.1-cw-latest.xsd > osisCore.2.1.1-cw-latest.rng
Convert from .rng to .rnc
trang -I rng -O rnc osisCore.2.1.1-cw-latest.rng osisCore.2.1.1-cw-latest.rnc
Tell nxml where to find our schemas
We have already (see above) set the variable rng-schema-locating-files to "~/.schema/schemas.xml
Now, we have to copy our new .rnc schemas in the .schema dir
mkdir -p ~/.schema cp osisCore.2.1.1-cw-latest.rnc teiP5osis.2.5.0.rnc ~/.schema
and create ~/.schema/schemas.xml:
<locatingRules xmlns="http://thaiopensource.com/ns/locating-rules/1.0"> <namespace ns="http://www.crosswire.org/2013/TEIOSIS/namespace" uri="teiP5osis.2.5.0.rnc"/> <namespace ns="http://www.bibletechnologies.net/2003/OSIS/namespace" uri="osisCore.2.1.1-cw-latest.rnc"/> </locatingRules>
Auto-completion
Type a < character and hit Ctrl+Enter for a list of valid tags. You can type a few letters and hit Tab to use auto-completion. Hit Enter to insert the given tag. This also works with attributes: simply add a space after the tag, and hit Ctrl+Enter for attribute auto-completion.
Links
https://fedoraproject.org/wiki/How_to_use_Emacs_for_XML_editing https://lgfang.github.io/mynotes/emacs/emacs-xml.html#sec-9 https://www.emacswiki.org/emacs/NxmlMode
Validating from Windows Explorer
Here is a simple application for validating XML files from within Windows Explorer.
https://www.codeproject.com/Articles/8431/A-Simple-XML-Validator
Python
It's relatively straighforward to validate a file with Python:
Let's create simplest validator.py
from lxml import etree def validate(xml_path: str, xsd_path: str) -> bool: xmlschema_doc = etree.parse(xsd_path) xmlschema = etree.XMLSchema(xmlschema_doc) xml_doc = etree.parse(xml_path) result = xmlschema.validate(xml_doc) return result
then write and run main.py
from validator import validate if validate("path/to/file.xml", "path/to/scheme.xsd"): print('Valid! :)') else: print('Not valid! :(')