Difference between revisions of "Parsing XML with Prolog"

From This Prolog Life
Jump to navigation Jump to search
(Created page with "__NOTOC__ <blockquote> "I have a hard time arguing that anything in XML is generically useful any more except for the basic syntax, which lets us apply some very handy low-lev...")
 
(Removed direct link to B-Prolog page, which is broken.)
 
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
__NOTOC__
__NOTOC__
<blockquote>
<blockquote>
"I have a hard time arguing that anything in XML is generically useful any more except for the basic syntax, which lets us apply some very handy low-level tools like parsers and XSLT. The rest (XLink, schemas, etc.) has been a pointless trip into complexity."
&ldquo;I have a hard time arguing that anything in XML is generically useful any more except for the basic syntax, which lets us apply some very handy low-level tools like parsers and XSLT. The rest (XLink, schemas, etc.) has been a pointless trip into complexity.&rdquo;
<cite>[http://lists.w3.org/Archives/Public/www-tag/2002Sep/0303.html Simon St.Laurent]</cite>
<cite>[https://lists.w3.org/Archives/Public/www-tag/2002Sep/0303.html Simon St.Laurent]</cite>
</blockquote>
</blockquote>
<blockquote>
<blockquote>
"My own experience is that having Prolog, Scheme, and Haskell available it'll take a gun pointed at my head or an extremely large bribe to make me use XSLT for anything."
&ldquo;My own experience is that having Prolog, Scheme, and Haskell available it'll take a gun pointed at my head or an extremely large bribe to make me use XSLT for anything.&rdquo;
<cite>[http://clip.dia.fi.upm.es/Mail/ciao-users/2001/000207.html Richard A. O'Keefe]</cite>
<cite>[http://cliplab.org/Mail/ciao-users/2001/000206.html Richard A. O'Keefe]</cite>
</blockquote>
</blockquote>
==Background==
==Background==
xml.pl is a module for parsing XML with Prolog, which provides Prolog applications with a simple 'Document Value Model' interface to XML documents. It has been used successfully in a number of applications.
xml.pl is a module for parsing XML with Prolog, which provides Prolog applications with a simple ''Document Value Model'' interface to XML documents. It has been used successfully in a number of applications.


It supports a subset of XML suitable for XML Data and Worldwide Web applications, but is neither as strict nor as comprehensive as the [http://www.w3.org/TR/2000/REC-xml-20001006 XML 1.0 Specification] mandates.
It supports a subset of XML suitable for XML Data and Worldwide Web applications but it is neither as strict nor as comprehensive as the [https://www.w3.org/TR/xml/ XML 1.0 Specification] mandates.


* It is not as strict because, while the specification must eliminate ambiguities, not all errors need to be regarded as faults, and some reasonable examples of real XML usage would have to be rejected if they were.
* It is not as strict because, while the specification must eliminate ambiguities, not all errors need to be regarded as faults, and some reasonable examples of real XML usage would have to be rejected if they were.
* It is not as comprehensive because, where the XML specification makes provision for more or less complete DTDs to be provided as part of a document, xml.pl actions the local definition of ENTITIES only. Other DTD extensions are treated as commentary.
* It is not as comprehensive because, where the XML specification makes provision for more or less complete DTDs to be provided as part of a document, xml.pl actions the local definition of ENTITIES only. Other DTD extensions are treated as commentary.


===Download [[xml.pl and plxml]]===
===Download the [[XML Module]] (xml.pl and plxml)===


xml.pl and plxml, a small Windows application which embodies xml.pl, have been placed into the public domain to encourage the use of Prolog with XML.
xml.pl and plxml, a small Windows application which embodies xml.pl, have been placed into the public domain to encourage the use of Prolog with XML.
Line 22: Line 22:


==Specification==
==Specification==
Three predicates are exported by the module: <code>xml_parse/[2,3]</code> ,
Three predicates are exported by the module: <code>xml_parse/[2,3]</code>,
<code>xml_subterm/2</code>  and <code>xml_pp/1</code> .
<code>xml_subterm/2</code>  and <code>xml_pp/1</code> .
====xml_parse( {+Controls,} +?Chars, ?+Document )====
====xml_parse( {+Controls,} +?Chars, ?+Document )====
Line 28: Line 28:
<var>Document</var>, a data structure of the form
<var>Document</var>, a data structure of the form
<code>xml(Attributes, Content)</code>, where:
<code>xml(Attributes, Content)</code>, where:
''Attributes'' is a list of ''Name''=''CharData''  attributes from the
''Attributes'' is a list of ''Name''=''CharData''  attributes from the
(possibly implicit) XML signature of the document.
(possibly implicit) XML signature of the document.
''Content'' is a (possibly empty) list comprising occurrences of:
''Content'' is a (possibly empty) list comprising occurrences of:
; <code>pcdata(CharData)</code>
; <code>pcdata(CharData)</code>
: Text
: Text
; <code>comment(CharData)</code>
; <code>comment(CharData)</code>
: An xml comment;
: An XML comment;
; <code>namespace(URI, Prefix, Element)</code>
; <code>namespace(URI, Prefix, Element)</code>
: a Namespace
: a Namespace
Line 40: Line 42:
: <Tag>..</Tag> encloses Content or <Tag/> if Content is empty <nowiki>[]</nowiki>.
: <Tag>..</Tag> encloses Content or <Tag/> if Content is empty <nowiki>[]</nowiki>.
; <code>instructions(Name, CharData)</code>
; <code>instructions(Name, CharData)</code>
: A PI <?Name CharData?>
: A [https://www.w3.org/TR/xml/#sec-pi Processing Instruction] ''<''?Name CharData?''>'';
; <code>cdata(CharData)</code>
; <code>cdata(CharData)</code>
: <![CDATA[CharData]]>
: <![CDATA[CharData]]>
; <code>doctype(Tag, DoctypeId)</code>
; <code>doctype(Tag, DoctypeId)</code>
: DTD <!DOCTYPE .. >
: DTD <!DOCTYPE .. >
The conversions are not completely symmetrical in that weaker
The conversions are not completely symmetrical in that weaker
XML is accepted than can be generated. Specifically, in-bound
XML is accepted than can be generated. Specifically, in-bound
Line 50: Line 53:
well-formed XML. If <var>Chars</var> does not represent
well-formed XML. If <var>Chars</var> does not represent
well-formed XML, <var>Document</var> is instantiated to the term
well-formed XML, <var>Document</var> is instantiated to the term
<code>malformed(Attributes, Content)</code> .
<code>malformed(Attributes, Content)</code>.
 
The ''Content'' of a ''malformed/2'' structure can include:
The ''Content'' of a ''malformed/2'' structure can include:
; <code>unparsed( CharData )</code>
; <code>unparsed( CharData )</code>
Line 57: Line 61:
: <Tag> is not closed
: <Tag> is not closed
in addition to the parsed-term types.
in addition to the parsed-term types.
Out-bound ''(<var>Document</var> -> <var>Chars</var>)'' parsing
Out-bound ''(<var>Document</var> -> <var>Chars</var>)'' parsing
''does'' require that
''does'' require that
Line 65: Line 70:
error as <code>Tag{(Id)}</code> terms - where ''Id'' is the
error as <code>Tag{(Id)}</code> terms - where ''Id'' is the
value of any attribute named id.
value of any attribute named id.
The <var>Controls</var> applying to in-bound ''(<var>Chars</var> -> <var>Document</var>)'' parsing are:
The <var>Controls</var> applying to in-bound ''(<var>Chars</var> -> <var>Document</var>)'' parsing are:
; <code>extended_characters(Boolean)</code>
; <code>extended_characters(Boolean)</code>
Line 71: Line 77:
: Remove layouts when no non-layout character data appears between elements (default true).
: Remove layouts when no non-layout character data appears between elements (default true).
; <code>remove_attribute_prefixes(Boolean)</code>
; <code>remove_attribute_prefixes(Boolean)</code>
: Remove redundant prefixes from attributes - i.e. prefixes denoting
: Remove redundant prefixes from attributes - i.e. prefixes denoting the namespace of the parent element (default false).
the namespace of the parent element (default false).
; <code> allow_ampersand(Boolean) </code>
; <code> allow_ampersand(Boolean) </code>
: Allow unescaped ampersand characters (&) to occur in PCDATA (default false).
: Allow unescaped ampersand characters (&) to occur in PCDATA (default false).
For out-bound ''(<var>Document</var> -> <var>Chars</var>)'' parsing, the only available
 
option is:
For out-bound ''(<var>Document</var> -> <var>Chars</var>)'' parsing, the only available option is:
; <code>format(Boolean)</code>
; <code>format(Boolean)</code>
: Indent the element content, (default true)
: Indent the element content, (default true)
===Types===
===Types===
; Tag
; Tag
Line 91: Line 97:
: one of: <code>public(CharData,&nbsp;CharData&nbsp;{,&nbsp;DTDLiterals})</code>, <code>system(CharData&nbsp;{,&nbsp;DTDLiterals})</code> or <code>local{(DTDLiterals)}</code>
: one of: <code>public(CharData,&nbsp;CharData&nbsp;{,&nbsp;DTDLiterals})</code>, <code>system(CharData&nbsp;{,&nbsp;DTDLiterals})</code> or <code>local{(DTDLiterals)}</code>
; DTDLiterals
; DTDLiterals
: A non-empty list of <code>dtd_literal(CharData)</code> terms - e.g. [http://www.w3.org/TR/2000/REC-xml-20001006#NT-AttlistDecl attribute-list declarations].
: A non-empty list of <code>dtd_literal(CharData)</code> terms - e.g. [https://www.w3.org/TR/xml/#NT-AttlistDecl attribute-list declarations].
; Boolean
; Boolean
: one of <code>true</code> or <code>false</code>
: one of <code>true</code> or <code>false</code>
Line 98: Line 104:
sub-term of <var>XMLTerm</var>. This can be especially useful when
sub-term of <var>XMLTerm</var>. This can be especially useful when
trying to test or retrieve a deeply-nested subterm from a document,
trying to test or retrieve a deeply-nested subterm from a document,
as demonstrated in this [xml_example.html example program]. Note that
as demonstrated in the [[XML Query Use Cases with xml.pl]] examples. Note that
<var>XMLTerm</var> is a sub-term of itself.
<var>XMLTerm</var> is a sub-term of itself.
====xml_pp( +XMLDocument )====
====xml_pp( +XMLDocument )====
"pretty prints" <var>XMLDocument</var> on the current output stream.
"pretty prints" <var>XMLDocument</var> on the current output stream.
==Availability==
==Availability==
On this site, you can download the [[XML Module]].
On this site, you can download the [[XML Module]].


The module is also supplied as a library with the following Prologs:
The module is also supplied as a library with the following Prologs:
* It has been adapted for the [http://logtalk.org/ Logtalk Open source object-oriented extension to Prolog] by Paulo Moura. (See the folder "contributions/xml_parser" from release 2.29.1);
* It has been adapted for the [https://logtalk.org/ Logtalk Open source object-oriented extension to Prolog] by Paulo Moura. (See the folder "contributions/xml_parser" from release 2.29.1);
* It is available in the [http://eclipseclp.org/ ECLiPSe Constraint Programming System], as a third-party library;
* It is available in the [http://eclipseclp.org/ ECLiPSe Constraint Programming System], as a third-party library;
* It has been ported to [http://www.probp.com/ B-Prolog] by Neng-Fa Zhou.
* It has been ported to [[Wikipedia:B-Prolog|B-Prolog]] by Neng-Fa Zhou.
* It has been adapted for [http://sicstus.sics.se/thirdparty.html SICStus Prolog] by Mats Carlsson.
* It has been adapted for [https://sicstus.sics.se/thirdparty.html SICStus Prolog] by Mats Carlsson.
* It is included in [http://quintus.sics.se/ Quintus Prolog Release 3.5].
* It is included in [https://quintus.sics.se/ Quintus Prolog Release 3.5].
[[XML Query Use Cases with xml.pl]] is provided as an example to illustrate the way that the code can be used.
[[XML Query Use Cases with xml.pl]] provides examples of the ways that the code can be used.


==Features of xml.pl==
==Features of xml.pl==
Line 120: Line 128:
The Prolog term representing a document has the same structure as the document itself, which makes the correspondence between the literal representation of the Prolog term and the XML source readily apparent.
The Prolog term representing a document has the same structure as the document itself, which makes the correspondence between the literal representation of the Prolog term and the XML source readily apparent.
For example, this simple [http://www.w3.org/Graphics/SVG/ SVG] image:
For example, this simple [http://www.w3.org/Graphics/SVG/ SVG] image:
<syntaxhighlight lang="xml">
<pre class="xml">
<?xml version="1.0" standalone="no"?>
<?xml version="1.0" standalone="no"?>
  <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.0//EN" "http://www.w3.org/.../svg10.dtd"
  <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.0//EN" "http://www.w3.org/.../svg10.dtd"
Line 129: Line 137:
   <circle cx=" 25 " cy=" 25 " r=" 24 " style="&redblue;"/>
   <circle cx=" 25 " cy=" 25 " r=" 24 " style="&redblue;"/>
  </svg>
  </svg>
</syntaxhighlight>
</pre>
... translates into this Prolog term:
... translates into this Prolog term:
<syntaxhighlight lang="prolog">
<pre class="prolog">
xml( [version="1.0", standalone="no"],
xml( [version="1.0", standalone="no"],
     [
     [
Line 145: Line 153:
         )
         )
     ] ).
     ] ).
</syntaxhighlight>
</pre>
===Efficient Manipulation===
===Efficient Manipulation===
Each type of node in an XML document is represented by a different Prolog functor, while data, (PCDATA, CDATA and Attribute Values), are left as "strings", (lists of character codes).
Each type of node in an XML document is represented by a different Prolog functor, while data, (PCDATA, CDATA and Attribute Values), are left as "strings", (lists of character codes).
The use of distinct functors for mark-up structures enables the efficient recursive traversal of a document, while leaving the data as strings facilitates the application-specific parsing of data content (aka [http://www.google.com/search?q=%22Micro-parsing%22+XML Micro-parsing]).
The use of distinct functors for mark-up structures enables the efficient recursive traversal of a document, while leaving the data as strings facilitates the application-specific parsing of data content (aka Micro-parsing).
For example, to turn every CDATA node into a PCDATA node with tabs expanded into spaces:
For example, to turn every CDATA node into a PCDATA node with tabs expanded into spaces:
<syntaxhighlight lang="prolog">
<pre class="prolog">
cdata_to_pcdata( cdata(CharsWithTabs), pcdata(CharsWithSpaces) ) :-
cdata_to_pcdata( cdata(CharsWithTabs), pcdata(CharsWithSpaces) ) :-
     tab_expansion( CharsWithTabs, CharsWithSpaces ).
     tab_expansion( CharsWithTabs, CharsWithSpaces ).
Line 167: Line 175:
cdata_to_pcdata( instructions(Name, Chars), instructions(Name, Chars) ).
cdata_to_pcdata( instructions(Name, Chars), instructions(Name, Chars) ).
cdata_to_pcdata( doctype(Tag, DoctypeId), doctype(Tag, DoctypeId) ).
cdata_to_pcdata( doctype(Tag, DoctypeId), doctype(Tag, DoctypeId) ).
</syntaxhighlight>
</pre>
The above uses no 'cuts', but will not create any choice points with ground input.
The above uses no 'cuts', but will not create any choice points with ground input.
===Elegance===
===Elegance===
The resolution of entity references and the decomposition of the document into distinct nodes means that the calling application is not concerned with the occasionally messy syntax of XML documents.
The resolution of entity references and the decomposition of the document into distinct nodes means that the calling application is not concerned with the occasionally messy syntax of XML documents.
For example, the clean separation of namespace nodes means that Namespaces, which are useful in combining specifications developed separately, have similar usefulness in combining applications developed separately.
For example, the clean separation of namespace nodes means that Namespaces, which are useful in combining specifications developed separately, have similar usefulness in combining applications developed separately.

Latest revision as of 20:08, 8 July 2020

“I have a hard time arguing that anything in XML is generically useful any more except for the basic syntax, which lets us apply some very handy low-level tools like parsers and XSLT. The rest (XLink, schemas, etc.) has been a pointless trip into complexity.” Simon St.Laurent

“My own experience is that having Prolog, Scheme, and Haskell available it'll take a gun pointed at my head or an extremely large bribe to make me use XSLT for anything.” Richard A. O'Keefe

Background

xml.pl is a module for parsing XML with Prolog, which provides Prolog applications with a simple Document Value Model interface to XML documents. It has been used successfully in a number of applications.

It supports a subset of XML suitable for XML Data and Worldwide Web applications but it is neither as strict nor as comprehensive as the XML 1.0 Specification mandates.

  • It is not as strict because, while the specification must eliminate ambiguities, not all errors need to be regarded as faults, and some reasonable examples of real XML usage would have to be rejected if they were.
  • It is not as comprehensive because, where the XML specification makes provision for more or less complete DTDs to be provided as part of a document, xml.pl actions the local definition of ENTITIES only. Other DTD extensions are treated as commentary.

Download the XML Module (xml.pl and plxml)

xml.pl and plxml, a small Windows application which embodies xml.pl, have been placed into the public domain to encourage the use of Prolog with XML. I hope that they will be useful to you, but they are not supported, and they are provided without any warranty of any kind.

Specification

Three predicates are exported by the module: xml_parse/[2,3], xml_subterm/2 and xml_pp/1 .

xml_parse( {+Controls,} +?Chars, ?+Document )

parses Chars, a list of character codes, to/from Document, a data structure of the form xml(Attributes, Content), where:

Attributes is a list of Name=CharData attributes from the (possibly implicit) XML signature of the document. Content is a (possibly empty) list comprising occurrences of:

pcdata(CharData)
Text
comment(CharData)
An XML comment;
namespace(URI, Prefix, Element)
a Namespace
element(Tag, Attributes, Content)
<Tag>..</Tag> encloses Content or <Tag/> if Content is empty [].
instructions(Name, CharData)
A Processing Instruction <?Name CharData?>;
cdata(CharData)
<![CDATA[CharData]]>
doctype(Tag, DoctypeId)
DTD <!DOCTYPE .. >

The conversions are not completely symmetrical in that weaker XML is accepted than can be generated. Specifically, in-bound (Chars -> Document) parsing does not require strictly well-formed XML. If Chars does not represent well-formed XML, Document is instantiated to the term malformed(Attributes, Content).

The Content of a malformed/2 structure can include:

unparsed( CharData )
Text which has not been parsed
out_of_context(Tag)
<Tag> is not closed

in addition to the parsed-term types.

Out-bound (Document -> Chars) parsing does require that Document defines well-formed XML. If an error is detected, a 'domain' exception is raised. The domain exception will attempt to identify the particular sub-term in error, and will list the ancestor elements of the sub-term in error as Tag{(Id)} terms - where Id is the value of any attribute named id.

The Controls applying to in-bound (Chars -> Document) parsing are:

extended_characters(Boolean)
Use the extended character entities for XHTML (default true).
format(Boolean)
Remove layouts when no non-layout character data appears between elements (default true).
remove_attribute_prefixes(Boolean)
Remove redundant prefixes from attributes - i.e. prefixes denoting the namespace of the parent element (default false).
allow_ampersand(Boolean)
Allow unescaped ampersand characters (&) to occur in PCDATA (default false).

For out-bound (Document -> Chars) parsing, the only available option is:

format(Boolean)
Indent the element content, (default true)

Types

Tag
An atom naming an element
Name
An atom, not naming an element
URI
An atom giving the URI of a Namespace
CharData
A "string": list of character codes.
DoctypeId
one of: public(CharData, CharData {, DTDLiterals}), system(CharData {, DTDLiterals}) or local{(DTDLiterals)}
DTDLiterals
A non-empty list of dtd_literal(CharData) terms - e.g. attribute-list declarations.
Boolean
one of true or false

xml_subterm( +XMLTerm, ?Subterm )

unifies Subterm with a sub-term of XMLTerm. This can be especially useful when trying to test or retrieve a deeply-nested subterm from a document, as demonstrated in the XML Query Use Cases with xml.pl examples. Note that XMLTerm is a sub-term of itself.

xml_pp( +XMLDocument )

"pretty prints" XMLDocument on the current output stream.

Availability

On this site, you can download the XML Module.

The module is also supplied as a library with the following Prologs:

XML Query Use Cases with xml.pl provides examples of the ways that the code can be used.

Features of xml.pl

The xml/2 data structure has some useful properties.

Reusability

Using a native Prolog representation of XML, in which terms represent document 'nodes', makes the parser reusable for any XML application. In effect, xml.pl encapsulates the application-independent tasks of document parsing and generation, which is essential where documents have components from more than one Namespace.

Same Structure

The Prolog term representing a document has the same structure as the document itself, which makes the correspondence between the literal representation of the Prolog term and the XML source readily apparent. For example, this simple SVG image:

<?xml version="1.0" standalone="no"?>
 <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.0//EN" "http://www.w3.org/.../svg10.dtd"
     [
     <!ENTITY redblue "fill: red; stroke: blue; stroke-width: 1">
     ]>
 <svg xmlns="http://www.w3.org/2000/svg" width="500" height="500">
  <circle cx=" 25 " cy=" 25 " r=" 24 " style="&redblue;"/>
 </svg>

... translates into this Prolog term:

xml( [version="1.0", standalone="no"],
     [
     doctype( svg, public( "-//W3C//DTD SVG 1.0//EN", "http://www.w3.org/.../svg10.dtd" ) ),
     namespace( 'http://www.w3.org/2000/svg', "",
         element( svg,
             [width="500", height="500"],
             [
             element( circle,
                 [cx="25", cy="25", r="24", style="fill: red; stroke: blue; stroke-width: 1"],
                 [] )
             ] )
         )
     ] ).

Efficient Manipulation

Each type of node in an XML document is represented by a different Prolog functor, while data, (PCDATA, CDATA and Attribute Values), are left as "strings", (lists of character codes). The use of distinct functors for mark-up structures enables the efficient recursive traversal of a document, while leaving the data as strings facilitates the application-specific parsing of data content (aka Micro-parsing). For example, to turn every CDATA node into a PCDATA node with tabs expanded into spaces:

cdata_to_pcdata( cdata(CharsWithTabs), pcdata(CharsWithSpaces) ) :-
    tab_expansion( CharsWithTabs, CharsWithSpaces ).
cdata_to_pcdata( xml(Attributes, Content1), xml(Attributes, Content2) ) :-
    cdata_to_pcdata( Content1, Content2 ).
cdata_to_pcdata( namespace(URI,Pfx,Content1), namespace(URI,Pfx,Content2) ) :-
    cdata_to_pcdata( Content1, Content2 ).
cdata_to_pcdata( element(Name,Atts,Content1), element(Name,Atts,Content2) ) :-
    cdata_to_pcdata( Content1, Content2 ).
cdata_to_pcdata( [], [] ).
cdata_to_pcdata( [H1|T1], [H2|T2] ) :-
    cdata_to_pcdata( H1, H2 ),
    cdata_to_pcdata( T1, T2 ).
cdata_to_pcdata( pcdata(Chars), pcdata(Chars) ).
cdata_to_pcdata( comment(Chars), comment(Chars) ).
cdata_to_pcdata( instructions(Name, Chars), instructions(Name, Chars) ).
cdata_to_pcdata( doctype(Tag, DoctypeId), doctype(Tag, DoctypeId) ).

The above uses no 'cuts', but will not create any choice points with ground input.

Elegance

The resolution of entity references and the decomposition of the document into distinct nodes means that the calling application is not concerned with the occasionally messy syntax of XML documents. For example, the clean separation of namespace nodes means that Namespaces, which are useful in combining specifications developed separately, have similar usefulness in combining applications developed separately.