Xml_Checker

Parser version: V48.0
Generator version: V19.3
Dtd_checker version: V4.0
Dtd_generator version: V3.1

Xml_checker is a XML parser that checks the syntax (versus DTD) and formats the output.

1. Syntax

xml_checker[ { <option> } ] [ { <file> } ]
 <option> ::= <silent> | <progress> | <dump> | <raw> | <width> | <one> | <split>
            | <expand> | <keep> | <namespace> | <canonical> | <normalize>
            | <compat> | <check_dtd> | <no_dtd> | <validate> | <tree>
            | <update_mix> | <copytree> | <warnings>
            | <help> | <version>
 <help>        ::= -h | --help      : Put this help
 <version>     ::= -v | --version   : Put versions
 <canonical>   ::= -C | --canonical : Canonicalize xml
 <compat>      ::= -c | --compat    : Make text compatible ('>' -> "&gt;")
 <copytree>    ::= -c | --copytree  : Copy tree then dump the copy
 <dump>        ::= -D | --dump      : Dump expanded Xml tree
 <check_dtd>   ::= -d <Dtd> | --dtd=<Dtd>
                                    : Use a specific external dtd file
 <expand>      ::= -E | --expand    : Expand general entities
                                    :  and attributes with default
 <keep>        ::= -k <c|d|n|a> | --keep=<comments|cdata|none|all>
                                    : Keep comments and Cdata
                                    : Keep CDATA sections unchanged
                                    : Keep none (remove comments and CDATA
                                    :  markers)
                                    : Keep all (default)
 <update_mix>  ::= -m | --mixed     : Update Mixed tag in tree
 <no_dtd>      ::= -N | --no_dtd    : Do not use dtd
 <namespace>   ::= -n | --namespace : Put Namespace^Suffix
 <normalize>   ::= --no-normalize   : Do not normalize attributes and text
 <one>         ::= -1 | --one       : Put one attribute per line
 <progress>    ::= -p | --progress  : Only show a progress bar
 <raw>         ::= -r | --raw       : Put all on one line
 <silent>      ::= -s | --silent    : No output, only exit code
 <split>       ::= --split-nochild  : Element without child is put on 2 lines
 <tree>        ::= -t | --tree      : Build tree then dump it
 <validate>    ::= -V | --validate  : Process for validation (-E -k n)
 <width>       ::= -W <Width> | --width=<Width>
                                    : Put attributes up to Width
 <warnings>    ::= -w | --warnings  : Check for warnings

All options except expand, keep, dtd, warnings, namespace, tree and copytree are exclusive.

1.1. Output format

-s | --silent:

for no XML output (so only error and warning messages). The exit code indicates success or failure.

-p | --progress:

for only output a progress bar like this:

|--------------------------------------------------|
|=================================

then exit.

Note that the '=' bar reflects the progression of the parsing of the lines. After the last '=' and before the ending '|', xml_checker checks that all the IDREFs are defined as IDs, which may take a significant time.

Note that it is not possible to monitor the progress in "tree" mode, so these options are exclusive.

-D | --dump:

for a dump of the line numbers, of XML structure and "Mixed" information, example:

Prologue:
00000001 - - N xml : version=1.1
00000002 - - -  <!-- Definition of variables for test of program "comp_vars" -->
00000003 - - -  =><=
00000004 - - -  <!-- After Doctype -->
Elements:
00000006 - - N Variables
00000007 - - -  <!-- Below root -->
00000008 M - N  Var : Name=V1 Type=Int
00000008 - M -   =>5<=

A M in first column indicates that the element has Mixed content while in second column it denotes a field that is part of a Mixed content (they are not exclusive). The third column indicates if the element is a "<EmptyElementTag/>" (T), otherwise if it is defined as EMPTY in the Dtd (D), or if it is not empty (N).

Note that this options is incompatible with the expand option and with the keep option. Dump implies to keep all as is.

-r | --raw:

for a XML output, all on one line (no indentation and no line feed).

-W <Width> | --width=<Width>:

for limiting the number of attributes per line to up to <Width> columns. Default format is '-W 80'.

-1 | --one:

for a XML output with at most one attribute per line.

--split-no-child:

put non-empty elements that have no child on two lines instead of <Elt></Elt> (in width and one modes).

1.2. Transformations

-E | --expand:: for expand general entities and expand attributes with their default value (if they have).
-k c|d|n|a | --keep=comments|cdata|none|all:: for keep comments, or keep CDATA sections unchanged, or keep none (remove comments and CDATA markers) or keep all unchanged, which is the default.
-n | --namespace:: for namespace checks and expansion. Add checks that the input flow is namespace-well-formed and namespace-valid (see http://www.w3.org/TR/REC-xml-names) and expand namespaces as <namespace>^<suffix>.
This option should be used together with option Expand in order to obtain the appropriate result.
-C | --canonical:: for canonicalization of the XML (see http://www.w3.org/TR/xml-c14n). This option only allows options dtd, warnings and keep-comments.
It expands, removes CDATA markers, does not normalize, does not expand namespaces and by default removes comments. It is not supported in tree mode.
--no-normalize:: for no normalization of attributes and text. By default the text is normalized and, if the expand option is set, the attributes are normalized as well.
This option forces no normalization of the text and of the expanded attributes.
-c | --compat:: for make text "compatible" by replacing, for example, '>' by ">".
-V | --validate:: shortcut for "-E -kc", usefull for validation of the XML versus DTD.

1.3. Checks

-N | --no_dtd:

Do not check versus the DTD (internal and/or external) the XML content.

-d [ <Dtd> ] | --dtd=[<Dtd>]:

for check with a specific external DTD file. Use this option to set the external DTD if the XML file has no DOCTYPE, or to overwrite the one of DOCTYPE directive (SYSTEM or PUBLIC). The internal DTD (if any) is still used.

-w | --warnings:

for check and report warnings. The warnings that can be reported are:

multiple definition of the same ATTLIST (⇒ definitions are merged),
ATTLIST for unknown element (⇒ not used),
multiple definition of the same attribute for an element (⇒ new definition is discarded),
multiple definition of the same ENTITY (⇒ new definition is discarded),
reference to a child that is not defined (⇒ child is not useable),
inconsistency between the EMPTY definition of an element in DTD and the use of empty element tag (<element/>) in XML.

1.4. Other options

-t | --tree:: By default xml_checker sets a callback that the XML parser invokes for each element that it finds. With this option the XML parser builds the complete tree then xml_checker iterates on the tree and displays its content.

Note that this option is not recommended for big files. It is not compatible with canonical mode or progress bar.

Please also consider increasing the process stack size (ulimit -s) to avoid stack overflow and Storage_Error, especially in this mode.
--copytree:: In tree mode, copy the tree that results from the parsing into a new tree (using the generator) and dump this new tree (usefull to test the generator).
-m | --mixed:: When there is no Dtd, the XML parser tries to guess if an element is mixed by checking if its first child is text. In tree mode, this option forces a global scan of the tree so that this information is updated for each element, taking into account all its children.

Note that such element re-tagged as mixed had its children being parsed without the information that they were part of a mixed element, so line-feeds and indentation had been skipped. Now they will be displayed with the new information that the element is mixed, so line-feeds and indentation will not be generated.
-h | --help:: for put online help.
-v | --version:: for put versions of the Xml_Parser, the Xml_Parser.Generator (see XML parser and generator) and xml_checker.

2. Behavior

2.1. Inputs

Xml_checker takes 0, 1 or several arguments to specify the XML flow to process.

With no argument xml_checker processes stdin. An empty argument ("") also denotes stdin.

When one or several files are provided xml_checker processes these files one after another.

2.2. Output

Xml_checker outputs the dump or the formatted XML text on stdout. In silent mode there is no output at all. In progress mode there is just a progress bar.

It outputs warnings (if activated) and errors on stderr.

2.3. Exit code

Xml_checker exits with 0 if no error is detected and with 1 otherwise (warnings don’t affect the exit code).

3. Use of XML parser and generator

XML parser

Xml_checker uses a standalone package (Xml_Parser) to parse the XML flow.

This parser has the following limitations versus the XML standards 1.0 and 1.1 (see http://www.w3.org/TR/REC-xml and http://www.w3.org/TR/2006/REC-xml11-20060816):

Only the System Id of the DOCTYPE and of external parsed ENTITY is used, Public Id (if any) is skipped.
Only the local file, the "file://" and the "http://" schemes are supported in URIs (otherwise it raises parsing error).
Only (xx-)ASCII, UTF-8, UTF-16 and ISO-8859-1 encodings are natively supported. Some other encodings may be handled by defining the environment variable XML_PARSER_MAP_DIR to where Byte_To_Unicode can find the mapping file named <ENCODING>.xml (in uppercase, ex: ISO-8859-9.xml). This file maps each byte code (from 0 top 255) to a Unicode number and must have the format:
```
<?xml version="1.0" encoding="UTF-8"?>
  <Map>
    <Code Byte="16#00#">16#0000#</Code>
    ...
    <Code Byte="16#FF#">16#00FF#</Code>
 </Map>
```

On option the parser provides partial support for XML namespaces (see http://www.w3.org/TR/2009/REC-xml-names-20091208). Namespaces are not checked for validity of URI references.

XML generator

Xml_checker also uses a XML generator that is a child package of Xml_Parser (Xml_Parser.Generator) to build the formatted XML output of the tree or of each node. The generator is not used in Dump and in Canonical modes.

With copytree option, xml_checker also uses the XML generator to build the copy of the tree.

Default behavior

By default xml_checker invokes the XML parser in callback mode (to allow large files and specific options), and with options that ensure that the output is the same as the input except the formatting (indentation of elements and attributes).
On the other hand the parser by default provides the caller with a tree, where the comments and CDATA markers are removed, where entities are expanded, and where attributes are initialized with their default values. This is the most convenient for an application that wants to process the content of the XML file.
The following table summarizes the defaults of xml_checker (and Generator) and of Xml_Parser. The differences are highlighted:

Option	Xml_Parser default	xml_checker default
File name	None, ("" for stdin)	Stdin
Comments	False	True
Cdata	Remove markers	Keep markers
Expand	True	False
Normalize	True	True
Compatible	False	False
Use Dtd	True	True
Dtd file	""	""
Namespace	False	False
Warnings	False	False
Callback/Tree	Tree	Callback
Format	Fill width	Fill width
Width	80	80
Split	False	False

In order to trigger the default behaviour of the parser, xml_checker shall be called with options:
--tree --keep=none --expand

Check versus Dtd

Here is how xml_checker instructs Xml_Parser to process the DOCTYPE directive, depending on the -d option:

By default, Xml_Parser parses the DOCTYPE directive and the corresponding DTDs (internal and external) if any, and checks the compliance of the XML file versus this DTD.
With option -d <URI>, Xml_Parser parses the internal DTD (if any) and the provided URI, and checks the compliance of the XML file versus this DTD.
With option -d "", Xml_Parser skips the DOCTYPE directive and does not check compliance.

Some XML files cannot be validated without being expanded. For example, this occurs if they contain references to internal entities (&ent;) that expand as sub-elements, attribute contents…
In this case, the validation (with expansion) is possible with options -V [ -d <external_dtd> ], and the formatting (without validation nor expansion) is obtained with options -d "".

Canonicalization

As explained about canonical option, xml_checker supports canonicalization of XML documents. It invokes the XML parser with fixed options: remove CDATA makers, expand, normalize, no namespace and with a specific callback. This callback performs the XML canonicalization, for which the output format is fixed. So xml_checker does not use the generator in this mode.

4. DTD checker

Dtd_Checker is a standalone tool that checks a DTD file for errors and on option for warnings.

dtd_checker [ -w | --warnings ] [ <dtd_file> ]

By default dtd_checker only checks for errors in the DTD. With the option "-w" or "--warnings" it also reports the same warnings as xml_checker (except inconsistencies between EMPTY definition and empty tag).

If no file or an empty argument is provided then dtd_checker parses stdin.

Dtd_checker exits with 0 if no error is detected and with 1 otherwise (warnings don’t affect the exit code).

5. DTD generator

Dtd_generator is a standalone tool that parses one or several XML files and generates a DTD file representing the structure (elements and attributes) of these files.

dtd_generator [ <options> ] [ { <xml_file> } ]
  <options> ::= [ [ <deviation> ] [ <elements> ] [ <enums> ] [ <width> ]
  <deviation> ::= --deviation=<val>    // default 11
  <elements>  ::= --elements=<val>     // default 210
  <enums>     ::= --enums=<val>        // default 21
  <width>     ::= --width=<val>        // min 40

The options tune the behavior as follows (a value of 0 disables the check):

Deviation: The maximum number of changes introduced by a new set of children, after which the precise sequence of children is replaced by a choice ((Elt|Elt|Elt…)*).
Elements: The maximum number of children of an element, after which the sequence or choice of children is replaced by ANY.
Enums: The maximum number of values of an enumerated attribute, after which the list of values is replaced by NMTOKEN.
Width: The maximum width in output flow for the lists of sub-elements and for the lists of attribute values.

If no file is provided then dtd_generator parses stdin.
It puts the generated DTD on stdout.
Dtd_generator exits with 0 if no error is detected and with 1 otherwise.

Note	Dtd_generator uses the tree mode of the XML parser, so it may not be usable with very big files.