Talsever

Entity Definition Markup Language (Second Draft)

A mix-in language for defining entities in schemas and instances

Informal Note April 2004

This version:
http://www.talsever.org/xml/edml.html
Editors:
Amelia A. Lewis, Talsever <amyzing@talsever.org>
Bob Foster, Bocaloco Software LLC <bob@objfac.com>

Abstract

EDML defines an XML syntax for declaring internal and external general parsed entities. It makes no provision for unparsed entities or for parameter entities. The scope provides a means for definition of boilerplate (text alone or text with markup), mnemonic aliases for numeric character references, and the traditional SGML inclusion mechanism. EDML may be incorporated into an XML schema language, incorporated into or referenced from a particular schema, or referenced from an XML instance. In order for it to be truly useful in instances and schemas, parsers would have to add support for it. It is possible that a preprocessor could be defined which would have largely equivalent effects.

Status of this Document

This is an initial draft. Comments solicited. Future status (submission to standards bodies and the like) is not clear.

This draft has been submitted for discussion to the xml-dev mailing list


Short Table of Contents

1. Introduction
2. EDML Namespace
3. Syntax
4. Usage
5. Security Considerations
A. Examples (Non-Normative)
B. Schema for Entity Definition Markup Language (Non-Normative)
C. Changelog (Non-Normative)


Table of Contents

1. Introduction
2. EDML Namespace
3. Syntax
    3.1 The entities Element
        3.1.1 Internal Collections and Compilations
            3.1.1.1 The uri attribute
            3.1.1.2 The canonical attribute
            3.1.1.3 The version attribute
            3.1.1.4 Children of the entities element
        3.1.2 References to Entity Collections
            3.1.2.1 The system attribute
            3.1.2.2 The public attribute
            3.1.2.3 Additional requirements for an entities element as reference
    3.2 The entity Element
        3.2.1 Internal Parsed General Entities
            3.2.1.1 The name attribute
            3.2.1.2 Children of the entity element
            3.2.1.3 XML 1.0 implications
        3.2.2 External Parsed General Entities
            3.2.2.1 The name attribute
            3.2.2.2 The system attribute
            3.2.2.3 The public attribute
            3.2.2.4 Additional requirements for an entities element as reference
            3.2.2.5 XML 1.0 implications
    3.3 Alternative Syntax
4. Usage
    4.1 Extending Schema Languages with EDML
    4.2 The EDML Processing Instruction
    4.3 Other Forms of EDML Embedding
        4.3.1 Document-Scope Embedding
        4.3.2 Element-Scope Embedding
    4.4 Priority of Definition
5. Security Considerations

Appendices

A. Examples (Non-Normative)
    A.1 Example: Standalone Entity Definitions Collections
    A.2 Example: Boilerplate or Subdocument Definition
    A.3 Example: Overriding Imported Definitions
    A.4 Example: Including Entity Definitions in a Schema
    A.5 Example: Including Entity Definitions Using Processing Instructions
B. Schema for Entity Definition Markup Language (Non-Normative)
C. Changelog (Non-Normative)


1. Introduction

The Entity Definition Markup Language began life as a subset of the doctype markup language, an XML transformation of DTDs. In working on the latter, it soon became clear that the task, while perhaps important, was extremely large, and the temptation to add things and leave things out was difficult to resist. On the other hand the subset language for defining entities proved quite tractable, and by its nature (because of its use of XML syntax), seemed to the author elegant and useful.

EDML defines XML 1.0 entities (applicability to XML 1.1 is left for those who need it). Because it uses XML syntax, it inherently enforces certain well-formedness and validity constraints that must be tested by other means when entities are defined in a DTD. Because it uses XML syntax, it can be relatively easily incorporated into existing XML schema languages, such as W3C XML Schema Definition Language or Relax NG (normal form).

EDML does not permit definition of all the entity types defined in XML 1.0. Specifically, EDML provides a means of defining internal and external parsed entities. Unparsed entities are out of scope (together with notations). Parameter entities, used only in DTDs, are out of scope. Effectively, EDML provides a means of defining that subset of entities which might be considered "macros" in another language--a means of substituting for a commonly repeated sequence, or difficult-to-type sequence, a simple text representation which the parser can replace with the defined substitution text or tree. Also, EDML is stronger on external parsed entities than internal ones; it cannot provide a replacement for the internal DTD subset without abandoning its advantages or drastically modifying XML.

2. EDML Namespace

The official URI for EDML is http://www.talsever.org/namespaces/edml

This URL is subject to change in future revisions.

3. Syntax

There are three formats in which entities may appear:

Both of the first two types may be embedded into other XML dialects (typically schema languages). The third type does not reap the benefit of XML containment that enforces well-formedness for the other two types.

3.1 The entities Element

The container for all entity definitions is the entities element. This container element may directly contain entity elements, and may contain entities elements that point to external defintions (but not internal ones). An entities definition either defines an internal collection of entities, defines a compilation of external entities with internal entities, or references an external collection of entity definitions.

3.1.1 Internal Collections and Compilations

An entities element that defines an internal collection or compilation MAY contain up to three optional attributes, and MUST contain at least one entity or entities child elements.

3.1.1.1 The uri attribute

The optional uri attribute provides an identifier for this entity collection. If present, it MUST NOT be empty, and MUST NOT be relative (that is, it MUST be an absolute URI). It may also be a Formal Public Identifier, particularly if the collection is a conversion from a DTD definitions collection (see examples, below).

3.1.1.2 The canonical attribute

The optional canonical attribute provides a URL (it's pointless to use a URN) that gives the canonical location of the latest version of this collection of entities. It MUST NOT be empty, and MUST be absolute.

3.1.1.3 The version attribute

The optional version attribute is a string identifying the version. No further information is provided.

3.1.1.4 Children of the entities element

If any of the uri , canonical , or version attributes exist, the element MUST contain children, and MUST NOT contain the system or public attributes.

An internal entities element MUST contain at least one child, which must be an external entities element, or an internal or external entity element. It may contain any number of each of these types of element, which are to be processed in strict order. Note that it is fairly pointless for the element to contain only a single external entities element, but the grammar permits this.

If an entities element contains any entity or entities children, it MUST NOT contain the system or public attributes.

3.1.2 References to Entity Collections

An entities element that points at an external collection of entity definitions contains a single, required attribute, system , and may contain an optional attribute, public .

3.1.2.1 The system attribute

The system attribute MUST be a URI, MUST NOT be the empty URI (the empty string), and MUST resolve to an XML document or fragment with a root element of entities . The referenced entities element MUST NOT be a reference to an external collection (that is, it must be an internal collection, although it may contain additional external references).

3.1.2.2 The public attribute

The public attribute, if it exists, MUST conform to the rules for a Formal Public Identifier. Cataloging systems may make use of FPIs for resolution.

3.1.2.3 Additional requirements for an entities element as reference

An entities element that contains a system attribute MUST be empty.

A processor MUST recognize URIs that it has seen before, and refuse to load them again.

3.2 The entity Element

Every entity is defined using an entity element. Entities may be referenced (from inside an entities element or some other XML dialect that permits it) independently.

3.2.1 Internal Parsed General Entities

The basic building block of an entity definition is the internal parsed general entity. It is defined using the entity element, which contains a single required attribute, name , and may contain any child content (the replacement content).

3.2.1.1 The name attribute

The name attribute MUST conform to the rules for an entity name. This name may be composed with the & character and the ; character to make an entity reference in documents. When it appears so in a document, the content of the entity element replaces the entity.

3.2.1.2 Children of the entity element

The entity element, if it has only a name attribute, MUST have content, which may be elements or text. If included elements are in a namespace, the namespace MUST be declared. Note that namespace declarations MAY be part of an entity definition. Any entities referenced in the content of this entity MUST appear earlier in the document (directly or by inclusion) or the replacement text will contain a skipped entity. Any content, including text and mixed content, is permitted, so long as it is well-formed (the XML sine qua non).

3.2.1.3 XML 1.0 implications

Note: because the content of an entity element is XML, all internal entities defined in this manner are by definition well-formed per XML 1.0 section 4.3.2. Because an entity is not defined until its closing tag has been read, recursion is implicitly disallowed (both direct and indirect), satisfying the No Recursion WFC in section 4.1. This specification provides an alternative means of declaring an entity, to satisfy the Entity Declared WFC and VC.

3.2.2 External Parsed General Entities

A single entity definition may be referenced externally. In this case, the entity element contains two or three attributes. The name attribute is optional. The system attribute, a URI (corresponding to the SYSTEM pseudo-attribute of a doctype declaration) is required. The public attribute (which contains an FPI) is optional.

3.2.2.1 The name attribute

The optional name attribute MUST conform to the rules for an entity name. If it is not present, the URL of the system attribute MUST resolve to an XML document or fragment with entity as its root, and conforming to the content model for internal general parsed entities (references may not be chained). If the name attribute is present, the content of the resolved document or fragment may be any well-formed XML. However, if it is an entity element, then the name supplied in the importing entity definition overrides the name of the imported definition (that is, this provides a renaming and copying mechanism).

3.2.2.2 The system attribute

The system attribute MUST conform to the rules for URIs, and additionally MUST NOT be the empty URI (the empty string). Relative URIs are to be resolved relative to the document, unless it has an XML Base URI defined, in which case relative URIs are resolved relative to the base URI. Cataloging systems may supplant the usual resolution rules, of course. The resolved entity MUST point at an XML document or fragment. The content MUST be well-formed. If the target is an entity element, then parsing should verify it; it is permitted (if the name attribute is present) for the target to be any well-formed XML fragment.

3.2.2.3 The public attribute

The public attribute, if it exists, MUST conform to the rules for a Formal Public Identifier. Cataloging systems may make use of FPIs for resolution.

3.2.2.4 Additional requirements for an entities element as reference

A processor MUST recognize URIs that it has seen before, and refuse to load them again.

The entity element, when it contains a system attribute, MUST be empty.

3.2.2.5 XML 1.0 implications

Note: external general parsed entities must satisfy the constraints of XML 1.0 section 4.3.2 (production 78). If the entity is defined using the entity element, this is enforced by usual XML syntax rules (see above, for internal general parsed entities).

3.3 Alternative Syntax

There is already an alternative syntax for EDML; it's called DTD, and it's more widely supported than EDML is or is likely to be. If a non-XML syntax is to be used, better to use that one than to invent another. DTD has the advantage of providing the internal subset. EDML is designed to integrate smoothly with existing DTD definitions of entities, but not to provide a substitute for the internal subset.

That is, while it is possible (in theory, if not yet in practice) to attach entity definitions to a schema, to attach them to an instance, and to override them in an instance, all of these techniques require external files, unlike the internal subset, which is defined inside the instance document.

4. Usage

For EDML to be useful, it must be possible to refer to EDML definitions in existing documents. There are four possible ways to do this:

  1. Embedding in the schema for a class of documents

  2. Inclusion of files in the prologue of an instance document

  3. Embedding in an instance document

  4. Embedding in the prologue of an instance document

Embedding in a schema is feasible, either directly or by import. Inclusion into an instance document in the prologue is also feasible, although it uses a technique (processing instructions) which many XML gurus find distasteful. It is similar, however, to the use of PIs for stylesheets and the like. Embedding in an instance document is problematic, primarily due to the scoping issues that it raises and the limitations that it is necessarily under. Embedding in the prologue of an instance document requires either that XML documents be redefined to permit multiple roots, or that an alternative, non-XML syntax for EDML be defined (thereby losing many of the benefits of EDML). We have already dismissed an alternative syntax for EDML (see above).

4.1 Extending Schema Languages with EDML

For entity definition collections that provide character replacement (such as the LATIN-1 entity definition collection), it is easy and sensible to incorporate the entity definition collection into the schema for the language (XHTML, for instance). This is easily enough accomplished by placing the entity definitions early in the schema document for the dialect being defined. Both internal and external definitions are possible.

The drawback to doing this, at present, is that it doesn't work. XML parsers must be enhanced to include support for EDML embedded in a schema. It may be possible to define, for instance, a resolver in SAX that can expand the entity definitions in some fashion, but it is not clear that this is even feasible (more study is needed).

4.2 The EDML Processing Instruction

It is not uncommonly necessary to repeat blocks of markup in a single document, or in a number of related instance documents produced by a single organization. This calls for the ability to include a reference to the entity definition collection in the prologue of an instance document.

This technique is also useful when creating large documents. The document can be broken into sections, each representing a smaller subtree. In this case, the usual inclusion mechanism uses the style of import which references a single entity, effectively supplying the entity name for the referenced file. This permits each subtree to be independently developed and validated.

Note: it might be necessary to define what happens if an imported entity of this type contains other entity imports, or a doctype declaration, or an internal subset.

A processing instruction is defined for this purpose. Actually, there are two. One is called 'entities' and the other is 'entity', and both contain, as their content, a single URI. The entity processing instruction may also contain an entity name. These correspond directly with the external entities and entity elements defined above. That is, an entities processing instruction imports a document containing, as its root, an internal entities element. An entity processing instruction which lacks a name imports a document containing, as its root, an internal entity element. An entity processing instruction which has a name MAY import a document containing, as its root, an internal entity element, in which case the name in the processing instruction replaces the name found in the name attribute. An entity processing instruction that has a name may also import a document or fragment which is merely well-formed XML; the name provided is the name of the entity, which has the content of the document or fragment as its content.

As with inclusion in schema languages, the major problem with this technique is that it doesn't currently work with any shipping XML parser. In this case, however, the "macro-replacement" character of EDML is an advantage; by placing a filter into a SAX processing stream, the XML can be modified by an intelligent EDML filter. This filter would need to receive processing instruction notifications, would have to resolve these PIs, and would then perform replacement of entities as they are encountered in the stream before the rest of the parser saw them. The drawback of this technique is that it does not integrate with the use of internal DTD subsets; it would potentially expand entities wrongly, ignoring the overrides in the internal subset.

4.3 Other Forms of EDML Embedding

It is possible to imagine a schema that includes the schema for EDML. This would, potentially, allow instance documents to define a sort of internal subset inside the document element.

A minor drawback of this technique, shared with including EDML in the schema and using processing instructions, is that it doesn't work. A more significant drawback is that it probably shouldn't work.

4.3.1 Document-Scope Embedding

The largest question that arises when embedding inside the document element of an instance document is what the scope of the entity definitions ought to be? One solution is to state that entity definitions are valid from the point of definition to the end of the document.

This scoping mechanism has two drawbacks. First, if the principle of "first defined has priority" is maintained, then it will surprise users accustomed to the internal DTD subset, because their attempted overrides won't work. Second, for large documents, this scoping mechanism breaks expectations. Customized parsers for large documents may read only a portion of a document; permitting document scope for entity definitions breaks this technique irretrievably.

4.3.2 Element-Scope Embedding

Another possibility for scoping entities defined inside the document element is to re-use the pattern of namespace declarations (or the xml:lang element). Sort of. Unfortunately, entities aren't declared in attributes. So it goes. Anyway. The scope of the definition would extend through the scope of the containing element.

This scoping mechanism has drawbacks, as well. First, it naturally follows that overrides happen in narrowest scope. This is the reverse of the normal principle ("first seen") for entities, and would be very likely to unreasonably complicate implementations. Second, it is highly counter-intuitive to those who currently use entities--it is easy to imagine the frustration of a user defining entities in /html/head who can't understand why they don't work inside /html/body, for instance. Mostly, it is likely to add enormously to the weight of the machinery necessary for processing entities; instead of knowing what the entities are when the document element is encountered, the parser must be ready for additional definitions, and even redefitions.

4.4 Priority of Definition

Several techniques for incorporating EDML definitions into an instance document (directly or indirectly) have been outlined above. Given these various techniques, it is reasonable to expect them to be combined. What happens, then, when a particular entity is defined multiple times?

The rule for DTDs is simple: the first definition rules. Since the internal subset is fully processed before the external subset is loaded (even though the external subset is identified before the internal subset is complete), this allows an instance to override external definitions, a powerful feature.

The same rule is applied, as a principle, to EDML definitions. When a document contains an internal subset, its entity definitions (if any) are processed first. If processing instructions exist in the document pointing to external general parsed entity definitions (see above), they are next processed. Any previously-defined entities which may be redefined are instead ignored. If the schema language includes a facility for identifying the schema within the instance, it is logically processed next, on encountering the document element. This rule is extended to those language (such as RELAX NG) which do not provide a facility to identify the governing schema inside an instance document: effectively, this means that entity definitions defined by the internal subset, then entity definitions defined using the entity inclusion processing instructions, override the entities defined in a schema for a particular XML dialect, regardless of the schema language used to define the dialect.

Entities defined using document-scope or element-scope rules cannot override existing definitions. As already noted, this may make these scopes less than useful in practice. They can only be used to define previously-undefined entities. It is recommended, therefore, that neither document-scope nor element-scope rules be deployed.

Note that schema languages that support EDML may override imports, by using an internal entities element which contains first the overrides, then the imports. This will not resolve issues for very large, overlapping, conflicting entity definition collections, but it does help a bit.

5. Security Considerations

The Billion Laughs attack is the most obvious. This specification requires that previously-seen URIs be ignored, which ameliorates the attack, but not by much.

URIs in submitted documents may be designed to produce particular errors, in order to allow an attacker to probe the network topology of a target. The tradeoff is between enhanced security and impoverished error messages.

A. Examples (Non-Normative)

This section contains some examples of entity definition with EDML, from fairly simple character replacement, to boilerplate and subdocument inclusion, to overrides, incorporating entity definitions into a schema, and importing entity definitions into a document using processing instructions.

A.1 Example: Standalone Entity Definitions Collections

For our first trick, ladies and gentlemen, the Latin-1 entity definitions, converted to EDML.


<?xml version="1.0" encoding="utf-8"?>
<entities xmlns="http://www.talsever.org/namespaces/edml"
          uri="ISO 8879:1986//ENTITIES Added Latin 1//EN//XML"
          canonical="http://www.talsever.org/entities/latin1.edml"
          version="0.3">
<!-- This version converted from:
     Copyright (C) 2001, 2002 Organization for the Advancement of Structured
     Information Standards (OASIS).

     Permission to use, copy, modify and distribute this entity set
     and its accompanying documentation for any purpose and without
     fee is hereby granted in perpetuity, provided that the above
     copyright notice and this paragraph appear in all copies. The
     copyright holders make no representation about the suitability of
     the entities for any purpose. It is provided "as is" without
     expressed or implied warranty.
-->
  <entity name="aacute">&#x00E1;</entity><!-- LATIN SMALL LETTER A WITH ACUTE -->
  <entity name="Aacute">&#x00C1;</entity><!-- LATIN CAPITAL LETTER A WITH ACUTE -->
  <entity name="acirc">&#x00E2;</entity><!-- LATIN SMALL LETTER A WITH CIRCUMFLEX -->
  <entity name="Acirc">&#x00C2;</entity><!-- LATIN CAPITAL LETTER A WITH CIRCUMFLEX -->
  <entity name="agrave">&#x00E0;</entity><!-- LATIN SMALL LETTER A WITH GRAVE -->
  <entity name="Agrave">&#x00C0;</entity><!-- LATIN CAPITAL LETTER A WITH GRAVE -->
  <entity name="aring">&#x00E5;</entity><!-- LATIN SMALL LETTER A WITH RING ABOVE -->
  <entity name="Aring">&#x00C5;</entity><!-- LATIN CAPITAL LETTER A WITH RING ABOVE -->
  <entity name="atilde">&#x00E3;</entity><!-- LATIN SMALL LETTER A WITH TILDE -->
  <entity name="Atilde">&#x00C3;</entity><!-- LATIN CAPITAL LETTER A WITH TILDE -->
  <entity name="auml">&#x00E4;</entity><!-- LATIN SMALL LETTER A WITH DIAERESIS -->
  <entity name="Auml">&#x00C4;</entity><!-- LATIN CAPITAL LETTER A WITH DIAERESIS -->
  <entity name="aelig">&#x00E6;</entity><!-- LATIN SMALL LETTER AE -->
  <entity name="AElig">&#x00C6;</entity><!-- LATIN CAPITAL LETTER AE -->
  <entity name="ccedil">&#x00E7;</entity><!-- LATIN SMALL LETTER C WITH CEDILLA -->
  <entity name="Ccedil">&#x00C7;</entity><!-- LATIN CAPITAL LETTER C WITH CEDILLA -->
  <entity name="eth">&#x00F0;</entity><!-- LATIN SMALL LETTER ETH -->
  <entity name="ETH">&#x00D0;</entity><!-- LATIN CAPITAL LETTER ETH -->
  <entity name="eacute">&#x00E9;</entity><!-- LATIN SMALL LETTER E WITH ACUTE -->
  <entity name="Eacute">&#x00C9;</entity><!-- LATIN CAPITAL LETTER E WITH ACUTE -->
  <entity name="ecirc">&#x00EA;</entity><!-- LATIN SMALL LETTER E WITH CIRCUMFLEX -->
  <entity name="Ecirc">&#x00CA;</entity><!-- LATIN CAPITAL LETTER E WITH CIRCUMFLEX -->
  <entity name="egrave">&#x00E8;</entity><!-- LATIN SMALL LETTER E WITH GRAVE -->
  <entity name="Egrave">&#x00C8;</entity><!-- LATIN CAPITAL LETTER E WITH GRAVE -->
  <entity name="euml">&#x00EB;</entity><!-- LATIN SMALL LETTER E WITH DIAERESIS -->
  <entity name="Euml">&#x00CB;</entity><!-- LATIN CAPITAL LETTER E WITH DIAERESIS -->
  <entity name="iacute">&#x00ED;</entity><!-- LATIN SMALL LETTER I WITH ACUTE -->
  <entity name="Iacute">&#x00CD;</entity><!-- LATIN CAPITAL LETTER I WITH ACUTE -->
  <entity name="icirc">&#x00EE;</entity><!-- LATIN SMALL LETTER I WITH CIRCUMFLEX -->
  <entity name="Icirc">&#x00CE;</entity><!-- LATIN CAPITAL LETTER I WITH CIRCUMFLEX -->
  <entity name="igrave">&#x00EC;</entity><!-- LATIN SMALL LETTER I WITH GRAVE -->
  <entity name="Igrave">&#x00CC;</entity><!-- LATIN CAPITAL LETTER I WITH GRAVE -->
  <entity name="iuml">&#x00EF;</entity><!-- LATIN SMALL LETTER I WITH DIAERESIS -->
  <entity name="Iuml">&#x00CF;</entity><!-- LATIN CAPITAL LETTER I WITH DIAERESIS -->
  <entity name="ntilde">&#x00F1;</entity><!-- LATIN SMALL LETTER N WITH TILDE -->
  <entity name="Ntilde">&#x00D1;</entity><!-- LATIN CAPITAL LETTER N WITH TILDE -->
  <entity name="oacute">&#x00F3;</entity><!-- LATIN SMALL LETTER O WITH ACUTE -->
  <entity name="Oacute">&#x00D3;</entity><!-- LATIN CAPITAL LETTER O WITH ACUTE -->
  <entity name="ocirc">&#x00F4;</entity><!-- LATIN SMALL LETTER O WITH CIRCUMFLEX -->
  <entity name="Ocirc">&#x00D4;</entity><!-- LATIN CAPITAL LETTER O WITH CIRCUMFLEX -->
  <entity name="ograve">&#x00F2;</entity><!-- LATIN SMALL LETTER O WITH GRAVE -->
  <entity name="Ograve">&#x00D2;</entity><!-- LATIN CAPITAL LETTER O WITH GRAVE -->
  <entity name="oslash">&#x00F8;</entity><!-- LATIN SMALL LETTER O WITH STROKE -->
  <entity name="Oslash">&#x00D8;</entity><!-- LATIN CAPITAL LETTER O WITH STROKE -->
  <entity name="otilde">&#x00F5;</entity><!-- LATIN SMALL LETTER O WITH TILDE -->
  <entity name="Otilde">&#x00D5;</entity><!-- LATIN CAPITAL LETTER O WITH TILDE -->
  <entity name="ouml">&#x00F6;</entity><!-- LATIN SMALL LETTER O WITH DIAERESIS -->
  <entity name="Ouml">&#x00D6;</entity><!-- LATIN CAPITAL LETTER O WITH DIAERESIS -->
  <entity name="szlig">&#x00DF;</entity><!-- LATIN SMALL LETTER SHARP S -->
  <entity name="thorn">&#x00FE;</entity><!-- LATIN SMALL LETTER THORN -->
  <entity name="THORN">&#x00DE;</entity><!-- LATIN CAPITAL LETTER THORN -->
  <entity name="uacute">&#x00FA;</entity><!-- LATIN SMALL LETTER U WITH ACUTE -->
  <entity name="Uacute">&#x00DA;</entity><!-- LATIN CAPITAL LETTER U WITH ACUTE -->
  <entity name="ucirc">&#x00FB;</entity><!-- LATIN SMALL LETTER U WITH CIRCUMFLEX -->
  <entity name="Ucirc">&#x00DB;</entity><!-- LATIN CAPITAL LETTER U WITH CIRCUMFLEX -->
  <entity name="ugrave">&#x00F9;</entity><!-- LATIN SMALL LETTER U WITH GRAVE -->
  <entity name="Ugrave">&#x00D9;</entity><!-- LATIN CAPITAL LETTER U WITH GRAVE -->
  <entity name="uuml">&#x00FC;</entity><!-- LATIN SMALL LETTER U WITH DIAERESIS -->
  <entity name="Uuml">&#x00DC;</entity><!-- LATIN CAPITAL LETTER U WITH DIAERESIS -->
  <entity name="yacute">&#x00FD;</entity><!-- LATIN SMALL LETTER Y WITH ACUTE -->
  <entity name="Yacute">&#x00DD;</entity><!-- LATIN CAPITAL LETTER Y WITH ACUTE -->
  <entity name="yuml">&#x00FF;</entity><!-- LATIN SMALL LETTER Y WITH DIAERESIS -->
</entities>

Note that this conversion was largely mechanical (specification authors are due to be replaced by a small shell script ...).

A.2 Example: Boilerplate or Subdocument Definition

A common requirement for organizations producing XML documents is a standard copyright notice, disclaimer, or other weasel-wording. Local guidelines probably suggest where this must appear, but the actual content is probably defined externally, once. Here's a simple copyright example.


<?xml version="1.0" encoding="utf-8"?>
<entities xmlns="http://www.talsever.org/namespaces/edml"
          uri="http://www.talsever.org/boilerplate/copyright"
          canonical="http://www.talsever.org/boilerplate/copyright.xml">

  <entity name="copyright"><p>Copyright &#x00A9;
    <a href="http://www.talsever.org/">Talsever</a>, 
    All Rights Reserved.</p>
  </entity>

</entities>

The above form is perfectly standalone, but may not be the best solution. Here's an alternative. First, define the XML content in a file by itself.


<?xml encoding="utf-8"?>
<p>Copyright &#x00A9; <a href="http://www.talsever.org/">Talsever</a>, 
All Rights Reserved.</p>

Then import it, using the schema technique or the processing instruction technique. We will demonstrate this, below. For purposes of demonstration, we assert that the above fragment may be found at http://www.talsever.org/boilerplate/copyright.xhtml.

Any XML entity (well-formed fragment) may be assigned an entity name in this fashion. For purposes of discussion later on, imagine that we have also written three sections of a specification, which are each located at http://www.talsever.org/xml/edml/, in the files header.xml, normative.xml, and informative.xml.

A.3 Example: Overriding Imported Definitions

When vocabularies containing entities are mixed, it is not uncommonly necessary to manually override certain definitions. For instance, it may be that a set of Greek language entity definitions used by classical document authors could collide with a set of symbol entity definitions used in markup of mathematics (in a treatise on classical greek geometry, perhaps?). As entities are generally given short, memorable names, in a global namespace, increasing use of entities leads to increasing likelihood of collisions. For another instance, imagine that the previous example had called the entity "copy" rather than "copyright", and an instance document also imported the ISO Numeric entities. For this example, though, we'll be frivolous.


<?xml version="1.0" encoding="utf-8"?>
<entities xmlns="http://www.talsever.org/namespaces/edml"
          uri="http://www.talsever.org/boilerplate/cuties"
          canonical="http://www.talsever.org/boilerplate/cuties.xml">

  <entity name="Aacute">Ah, a cutie!</entity>
  <entity name="Oacute">Oh, a cutie!</entity>
  <entities system="http://www.talsever.org/entities/latin1.edml" />

</entities>

Because of the priority rules, an import of this collection into a schema or instance document results in the definition of the &Aacute; and &Oacute; overrides, plus the rest of the Latin 1 set.

For purposes of demonstration later, imagine that the following is located in a file at http://www.talsever.org/entities/Aacute.xml:


<?xml encoding="UTF-8"?>
Ah, a cutie!

And the following is asserted to be located in a file at http://www.talsever.org/entities/Oacute.edml:


<?xml version="1.0" encoding="UTF-8"?>
<entity xmlns="http://www.talsever.org/namespaces/edml"
        name="Oacute">Oh, a cutie!</entity>

A.4 Example: Including Entity Definitions in a Schema

On to actually using the defined entities. One of the proposed methods is to incorporate the entities into a schema for a particular XML dialect. Having started with the Latin 1 entities, used by XHTML, and knowing that the XHTML working group is currently engaged in creating RELAX NG schemas for XHTML 2.0, here is how to add EDML entity definitions.

First, add a section to the end of the driver (http://www.w3.org/2002/06/xhtml2):


  <div>
    <x:h2>Entities module</x:h2>
    <include href="entities.rng"/>
  </div>

Next, define the entities grammar:


<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
         xmlns:x="http://www.w3.org/1999/xhtml"
         xmlns:e="http://www.talsever.org/namespaces/edml">

  <x:h1>Entity Definitions Collections Module</x:h1>

  <div>
    <x:h2>Core Entity Definitions Collection</x:h2>

    <e:entities system="ISOamsa.edml" />
    <e:entities system="ISOamsb.edml" />
    <e:entities system="ISOamsc.edml" />
    <e:entities system="ISOamsn.edml" />
    <e:entities system="ISOamso.edml" />
    <e:entities system="ISOamsr.edml" />
    <e:entities system="ISObox.edml" />
    <e:entities system="ISOcyr1.edml" />
    <e:entities system="ISOcyr2.edml" />
    <e:entities system="ISOdia.edml" />
    <e:entities system="ISOgrk1.edml" />
    <e:entities system="ISOgrk2.edml" />
    <e:entities system="ISOgrk3.edml" />
    <e:entities system="ISOgrk4.edml" />
    <e:entities system="ISOlat1.edml" />
    <e:entities system="ISOlat2.edml" />
    <e:entities system="ISOnum.edml" />
    <e:entities system="ISOpub.edml" />
    <e:entities system="ISOtech.edml" />

  </div>

  <!-- other collections might be defined as well -->

</grammar>

Finally, convert existing entity collections (ISOxyzzy.ent) to EDML format (ISOxyzzy.edml).

Note that the relative ease of defining things in this fashion is one of the attractions of EDML. The drawback, as previously mentioned, is that no current RNG or XML processor understands that these are entity definitions, so it's rather a trophy wife at the moment.

A.5 Example: Including Entity Definitions Using Processing Instructions

And, for our final act, an actual, albeit trivial document that uses processing instructions to import previously defined examples, with some silly overrides.


<?xml version="1.0" ?>
<!DOCTYPE spec PUBLIC "-//W3C//DTD Specification V2.2//EN" "http://www.w3.org/2002/xmlspec/dtd/2.2/xmlspec.dtd">
<!-- redefine Aacute -->
<?entity Aacute http://www.talsever.org/entities/Aacute.xml ?>
<!-- redefine Oacute -->
<?entity http://www.talsever.org/entities/Oacute.edml ?>
<!-- redefine oacute using the content of the Oacute entity -->
<?entity oacute http://www.talsever.org/entities/Oacute.edml ?>
<!-- import all of latin 1 *except* the already defined cuties -->
<?entities http://www.talsever.org/entities/latin1.edml ?>
<!-- import the three parts of the document -->
<?entity header http://www.talsever.org/edml/header.xml ?>
<?entity normative http://www.talsever.org/edml/normative.xml ?>
<?entity informative http://www.talsever.org/edml/informative.xml ?>
<spec w3c-doctype="other" other-doctype="random-noise" status="int-review ">
&header;
&normative;
&informative;
</spec>

Supposing that the document base uri is http://www.talsever.org/edml, then the following variant is possible:


<?xml version="1.0" ?>
<!DOCTYPE spec PUBLIC "-//W3C//DTD Specification V2.2//EN" "http://www.w3.org/2002/xmlspec/dtd/2.2/xmlspec.dtd">
<?entity Aacute /entities/Aacute.xml ?>
<?entity /entities/Oacute.edml ?>
<?entity oacute /entities/Oacute.edml ?>
<?entities /entities/latin1.edml ?>
<?entity header header.xml ?>
<?entity normative normative.xml ?>
<?entity informative informative.xml ?>
<spec w3c-doctype="other" other-doctype="random-noise" status="int-review ">
&header;
&normative;
&informative;
</spec>

Note, in the foregoing examples, that the character entities are not actually used in the document. Que sera, sera.

B. Schema for Entity Definition Markup Language (Non-Normative)

The following schema parses (via trang), but could contain errors. If the text differs from the schema, then the text rules.


<?xml version="1.0" encoding="utf-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
         ns="http://www.talsever.org/namespaces/edml"
         datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">

  <start>
    <choice>
      <element name="entities">
        <ref name="entities-internal" />
      </element>
      <element name="entity">
        <ref name="entity-internal" />
      </element>
    </choice>
  </start>

  <define name="entities-content">
    <choice>
      <ref name="location-attributes" />
      <ref name="entities-internal" />
    </choice>
  </define>

  <define name="entities-internal">
    <ref name="entities-attributes" />
    <oneOrMore>
      <choice>
        <element name="entities">
          <ref name="entities-content" />
        </element>
        <element name="entity">
          <ref name="entity-content" />
        </element>
      </choice>
    </oneOrMore>
  </define>

  <define name="entity-content">
    <choice>
      <ref name="entity-internal" />
      <group>
        <ref name="location-attributes" />
        <optional>
          <ref name="name-attribute" />
        </optional>
      </group>
    </choice>
  </define>

  <define name="entity-internal">
    <ref name="name-attribute" />
    <ref name="any-mixed" />
  </define>

  <define name="any-mixed">
    <mixed>
      <oneOrMore>
        <element>
          <anyName>
            <except>
              <nsName ns="http://www.talsever.org/namespaces/edml" />
            </except>
          </anyName>
          <ref name="any-mixed" />
        </element>
      </oneOrMore>
    </mixed>
  </define>

  <define name="location-attributes">
    <attribute name="system">
      <data type="anyURI" />
    </attribute>
    <optional>
      <attribute name="public" />
    </optional>
    <!-- always empty if this attribute set is present -->
    <empty />
  </define>

  <define name="entities-attributes">
    <optional>
      <attribute name="uri">
        <data type="anyURI" />
      </attribute>
    </optional>
    <optional>
      <attribute name="canonical">
        <data type="anyURI" />
      </attribute>
    </optional>
    <optional>
      <attribute name="version" />
    </optional>
  </define>

  <define name="name-attribute">
    <attribute name="name">
      <data type="NCName" />
    </attribute>
  </define>

</grammar>

Same schema, different syntax:


default namespace nedml = "http://www.talsever.org/namespaces/edml"

start =
  element entities { entities-internal }
  | element entity { entity-internal }

entities-content = location-attributes | entities-internal

entities-internal =
  entities-attributes,
  (element entities { entities-content }
   | element entity { entity-content })+

entity-content =
  entity-internal | (location-attributes, name-attribute?)

entity-internal = name-attribute, any-mixed

any-mixed =
  mixed {
    element * - edml:* { any-mixed }+
  }

location-attributes =
  attribute system { xsd:anyURI },
  attribute public { text }?,
  # always empty if this attribute set is present
  empty

entities-attributes =
  attribute uri { xsd:anyURI }?,
  attribute canonical { xsd:anyURI }?,
  attribute version { text }?

name-attribute = attribute name { xsd:NCName }

C. Changelog (Non-Normative)

2004 Apr 25: Modified the semantic of the entity element and processing instruction, so that entity may be used standalone. Enabled renaming of entities on import. Revised the schemas (again).

2004 Apr 25: Added more divisions in the syntax section, to break up long blocks of text and identify each piece. Improved (well, arguably so, anyway, everyone's a critic!) the examples.