W3C

XML Pipeline Definition Language Version 1.0

W3C Note 28 February 2002

This version:
http://www.w3.org/TR/2002/NOTE-xml-pipeline-20020228/ (in HTML and XML with separate provision of the Pipeline schema)
Latest version:
http://www.w3.org/TR/xml-pipeline/
Editors:
Norman Walsh, Sun Microsystems, Inc. <Norman.Walsh@Sun.COM>
Eve Maler, Sun Microsystems, Inc. <Eve.Maler@Sun.COM>

Abstract

This Note describes the features and syntax for XML Pipeline Definition Language. Pipeline is an XML vocabulary for describing the processing relationships between XML resources. A pipeline document specifies the inputs and outputs to XML processes and a pipeline controller uses this document to figure out the chain of processing that must be executed in order to get a particular result.

Status of this Document

This document is a submission to the World Wide Web Consortium (see Submission Request, W3C Staff Comment) that defines a way to increase interoperability of multi-process XML applications. Comments to the authors are welcome.

This Note is made available by W3C for discussion only. Publication of this Note by the W3C indicates no endorsement by W3C or the W3C Team, or any W3C Members. The W3C has had no editorial control over the preparation of this Note. This document is a work in progress and may be updated, replaced, or rendered obsolete by other documents at any time.

For a full list of acknowledged submissions, see Acknowledged Submissions to W3C. A list of current W3C technical documents can be found at the Technical Reports page.

Table of Contents

1 Introduction
    1.1 Terminology
    1.2 Motivation
    1.3 Scope
    1.4 Process Classification
2 XML Pipeline Definition Language
    2.1 Resources and URIs in the Pipeline Language
    2.2 Information Sets and Pipeline Controller Behavior
    2.3 Example
    2.4 Definition of the Pipeline Language
        2.4.1 XML Schema for the Pipeline Language
        2.4.2 Pipeline Document Constructs
            2.4.2.1 The pipeline element
            2.4.2.2 The processdef element
            2.4.2.3 The process element
            2.4.2.4 The input element
            2.4.2.5 The output element
            2.4.2.6 The error element
            2.4.2.7 The param element
            2.4.2.8 The document element

Appendices

A Use Cases
    A.1 Simple Business Transactions
    A.2 Transforming to a New Schema
    A.3 Document Publishing
    A.4 Business Transaction Hub
    A.5 Web Service Implementation
B A Complex Example
C References
D Open Issues


1 Introduction

There is a large and growing set of specifications that describe processes operating on [XML] documents. Considering how these specifications interact raises many issues. This specification will address the issues related to interoperability of applications that involve multiple processes operating on documents.

This specification is not generally concerned with the XML parsing process. XML documents are here considered to be operated on as [XML Infoset] information sets. Parsing, with some appropriate level of support for [XML Namespaces] and [XML Base], is assumed to be a well-defined process that is a necessary precursor to any operation over an XML document.

The processes of interest in this specification are those that construct, inspect, augment, or extract from information sets. A process begins with zero or more information sets and produces zero or more information sets (it may also produce ancillary information, such as whether it succeeded or failed).

Although some, perhaps most, applications work with concrete object models that are not identical to the Infoset, it is still useful to describe the processing model in terms of the Infoset. In practice, applications will use SAX event streams or DOM object models, or even use XML representations of infosets, to pass information back and forth.

1.1 Terminology

[ Definition: The key words must, must not, required, shall, shall not, should, should not, recommended, may, and optional in this specification are to be interpreted as described in [RFC 2119].]

1.2 Motivation

Applications, for example many web services, can be implemented by integrating business processing and standard XML processing. Business processes vary greatly among different applications and environments, but XML processing is often mostly a matter of composing the individual operations described by separate specifications (validation, transformation, and so on) in a useful order.

Appendix A Use Cases provides use cases that demonstrate that no fixed order of processing can be imposed. Although it would in many ways be easier to build interoperable systems if a fixed order could be defined, any such order would impede some classes of applications.

However, the order of processing actually used in any one application is an important aspect of the semantics of the application, in that it might be imperative that some operations precede others; if some other order is applied, the results might be incorrect. It follows that the ability to identify an underlying processing model and describe how processing is to proceed is an imperative precondition to developing interoperable applications.

Note that for any given application, the processes might require a partial order, and not necessarily a total order. In other words:

  1. There are dependency relationships among the processes and any processing order that violates these dependencies is erroneous.

  2. But any order that does not violate the dependency relationships is equally correct.

The satisfaction of dependencies between processes is already a well-understood problem in software development; it forms the heart of systems like make and [Ant].

This specification describes a declarative XML vocabulary that addresses this need in a way that is tuned for XML processing.

1.3 Scope

The focus of this specification is intentionally quite narrow, in order to provide a working solution for the most pressing needs first. It is clear that a number of useful extensions could be made to support processing meta-models, wildcards in process inputs and outputs, and so on. These features might be developed over time, but they are not critical to solving the most immediate problems.

It is clear that addressing the whole problem will ultimately require APIs that allow a process manager to initiate individual processes as necessary. This specification does not attempt to address this issue. However, the software development community has already begun to do so with the development of SAX, TRaX, and other APIs.

The identification of a processing model itself could be performed entirely at the programming API level, but this approach seems to set the bar too high. True interoperability will be achieved only when it is possible for end-users working with stock XML tools to build processing models for themselves. Thus, this specification proposes an XML vocabulary rather than an API.

1.4 Process Classification

The following process classification provides a framework in which to discuss the operations of applications in general terms:

  1. Constructive processes. These are processes that build new information sets. Processes which produce a new information set or add information items or property values defined in [XML Infoset] (such as elements and attributes) to an existing information set are described as constructive; the latter scenario is an example of transformation, which is a subclass of construction. An [XInclude] process is constructive, as are [XSLT] and [XQuery] .

  2. Augmenting processes. Processes, like [XML Schema] validation or [CSS] styling, that add new types of information items or properties to an existing information set are performing some sort of augmentation. A schema that adds default element or attribute values, as well as adding datatype information, is constructive as well as augmenting.

  3. Inspection processes. Processes that inspect but do not modify a document, such as [Schematron] or [RELAX NG] validation, fall into this category. Verifying a digital signature might also leave the document unchanged, indicating somehow that the process passed or failed.

  4. Extraction processes. Some processes reach into an existing information set and remove or copy parts of it for further processing. Processes that use [XPointer] (such as [XLink] and XInclude) fall into this category by virtue of their ability to address specific parts of a document. XML Schema processes are extractive in that they operate on namespace-specific portions of information sets. Some processes might decrypt only a region of encrypted content. We note that such processes might have different characteristics than more global operations, especially with respect to implementation efficiency.

  5. Packaging processes. Distributed or federated web applications will need to package a collection of resources to transmit to another location or service. This packaging could be performed using SOAP with attachments [SwA], for example. We note that the whole issue of packaging resources and providing a useful manifest is not adequately addressed.

It is important to note that the classes are not mutually exclusive, as shown, for example, by the fact that some schemas can cause a process to perform construction, augmentation, and extraction. In addition, processes can be hierarchical, with a constructive process performing a bit of validation or extraction, for example.

The fact that documents can be augmented or transformed raises another set of issues with respect to addressing into augmented or transformed results. [Linking/Style] discusses this at some length.

2 XML Pipeline Definition Language

[Definition: The XML Pipeline Definition Language ("the Pipeline language") is a vocabulary for describing the processing relationships between XML resources.] This section describes Pipeline language concepts, defines the language, and defines the required behavior of applications that take the language as input.

[ Definition: A document that is an instance of the XML Pipeline Definition Language is a pipeline document.] [Definition: A pipeline controller is an application that builds a resource identified in the pipeline document by determining and then executing the processes necessary to produce it.]

The high-level requirements of the Pipeline language are as follows:

  1. It shall be expressed in XML.

    It should be possible to author and manipulate pipeline documents using standard XML tools.

  2. It shall be as declarative as possible.

    Declarative languages are more amenable to optimization and other compilation techniques. It should be relatively easy to implement a conformant pipeline controller, but it should also be possible to build a sophisticated controller that can perform parallel operations, lazy or greedy processing, and other optimizations.

  3. It shall be neutral with respect to implementation language.

    Just as there is no single language that can process XML exclusively, there should be no single language that can implement the Pipeline language exclusively. It should be possible to interoperably exchange pipeline documents across computing platforms.

2.1 Resources and URIs in the Pipeline Language

An individual process produces one or more result resources from some set of input resources. The Pipeline language uses URIs to identify resources (both inputs and results). A pipeline controller uses these URIs to keep track of the resources that it has available. In the context of a pipeline, every resource is identified by exactly one URI. Two resources are the same if they have the same URI. Two URIs are the same if they are lexically equivalent.

To indicate that an inspection process (a process that does not modify an input infoset) has succeeded, the Pipeline forces the process to produce a result with a new URI. The resulting information set will be identical to the input information set, but its new label allows the pipeline controller to keep track of its status.

Note that the processing of URIs in pipeline documents depends on [XML Base] processing.

2.2 Information Sets and Pipeline Controller Behavior

The resources operated on by a pipeline controller are considered to consist of XML information sets.

Note:

In practice, some information set properties introduce dependencies between information items that makes composition non-trivial:

  • XML Schema validity properties introduce dependencies on descendants.

  • The in-scope namespaces property can introduce dependencies (at least implicitly) on ancestors.

  • Markers, as have existed in earlier drafts of [XML Infoset], introduce dependencies among siblings.

These dependencies have to be addressed by the specification of each process that operates on information sets.

Processing begins with an initial set of zero or more information sets, the name of some target that is the desired result, and a pipeline document that describes how documents are related to each other. Every relationship has three parts: a set of input information sets, a process, and a set of result information sets. For example, the result R.xml might be related to a set of inputs, S.xml and T.xsl, by an XSLT transformation. Informally, we say that the process identified by a relationship can produce the results from its inputs.

Processing begins with the pipeline document and the URI of the desired target.

  1. If the target is up to date, no processing is required and the target is returned. [ Definition: Given two information sets A and B, A is up to date with respect to B if A exists and unless B is known to be more recent than A. A target is up to date if and only if it is up to date with respect to all of the resources it depends on.]

  2. If the target is not up to date, the controller identifies the process from the pipeline document that can produce the target. It does this by examining the output elements of each process, searching for one that has a label whose value matches the target.

    It is an error if there is not exactly one such process in the pipeline document. If there is no such process, the target cannot be produced; if there is more than one, the processing is ambiguous.

  3. Assuming that exactly one process is found, the controller considers each of the information sets that are inputs to that process as intermediate targets. These intermediate targets are resolved in exactly the same manner as the principal target.

    It is an error for processes to depend directly or indirectly on their outputs. In other words, if Process 1 produces C from B and Process 2 produces B from A, it is an error for the pipeline document to contain a process that produces A from B or C (or any intermediate target produced from B or C).

  4. When all of the input documents are up to date, and no errors have occurred, the process is executed to produce the output result(s) and the target is returned.

  5. If an error occurs, processing terminates. Each process can selectively ignore errors and return appropriate error documents in place of its normal outputs.

Note that the order of the processes specified in the pipeline document is insignificant. Users need not figure out the right order for the whole pipeline; they need only declare the dependencies. Naturally, dependencies can be made linear, forcing a fixed order if that is what is desired.

2.3 Example

We consider a simple application of [XInclude] , [XML Schema] validation, and [XSLT] in which the following order is required:

  1. Begin with a source document, mydoc.xml .

  2. Expand any XInclude directives that it contains.

  3. Make sure that it is schema-valid with respect to someschema.xsd.

  4. If it is, transform it with mystyle.xsl, and return the result. Otherwise return an error document.

The following example shows what the corresponding pipeline document might look like.

Example: Simple Pipeline Document
<pipeline xmlns="http://www.w3.org/2002/02/xml-pipeline"
          xml:base="http://example.org/">

  <param name="target" select="'result'"/>

  <processdef name="xinclude.p" definition="org.example.xml.Xinclude"/>
  <processdef name="validate.p" definition="org.example.xml.XmlSchema"/>
  <processdef name="transform.p" definition="org.example.xml.XSLT"/>

  <process id="p3" type="transform.p">
    <input name="stylesheet" label="mystyle.xsl"/>
    <input name="document" label="valid"/>
    <output name="result" label="result"/>
    <param name="chunk">0</param>
  </process>

  <process id="p2" type="validate.p">
    <input name="document" label="xresult"/>
    <input name="schema" label="someschema.xsd"/>
    <output name="result" label="valid"/>
    <error name="invalid" label="#invalidDocument"/>
  </process>

  <process id="p1" type="xinclude.p">
    <input name="document" label="myfile.xml"/>
    <output name="result" label="xresult"/>
  </process>

  <document name="invalidDocument">
    <html xmlns="http://www.w3.org/1999/xhtml">
      <head>
    <title>Failure!</title>
      </head>
      <body>
    <h1>Your job failed because the document is invalid.</h1>
      </body>
    </html>
  </document>
</pipeline>

Assume that initially, the pipeline controller is handed this pipeline document along with myfile.xml, mystyle.xsl, and someschema.xsd (all with the base URI http://example.org/ ). For the purpose of this example, assume that no other documents exist when processing begins. The pipeline controller proceeds along the following lines:

  1. The target information set, http://example.org/result , will be inferred from the default value of the target parameter and the xml:base setting (unless some other target was specified through a command line or other option).

  2. The process "p3" has an output parameter with the label http://example.org/result , so it has to be run to produce the target.

  3. The "p3" process depends on mystyle.xsl, which the controller already has, and valid.

  4. The process "p2" has an output parameter with the label valid, so it has to be run to produce the intermediate target.

  5. The "p2" process in turn depends on xresult in addition to the schema document already available.

  6. The process "p1" has an output parameter with the label xresult, so it has to be run to produce the intermediate target.

  7. The only input to the "p1" process already exists, so the "xinclude.p" process is executed according to whatever definition was provided.

  8. Assuming that "p1" runs without error, it produces the xresult that we need to run "p2".

  9. With all of the inputs to "p2" available, it can be executed, producing the necessary valid.

  10. Finally, "p3" can be executed, producing the result. The controller succeeds.

  11. If at any point an error occurs, the controller returns either a specified error document or a built-in error document and fails.

In this example, there is only one order of processing that can satisfy all of the dependencies: "xinclude.p" then "validate.p" then "transform.p". In a more complex pipeline, multiple or even parallel orders are possible. For example, in Appendix B A Complex Example, all of the *.ex targets are independent.

2.4 Definition of the Pipeline Language

The following sections define the rules for pipeline documents and the required semantics of elements in the Pipeline language.

2.4.1 XML Schema for the Pipeline Language

The following XML Schema describes the Pipeline language. The p: namespace prefix is used here and in examples in the following sections to denote the Pipeline namespace, http://www.w3.org/2002/02/xml-pipeline . A few co-occurrence constraints are described normatively in the text of the following sections.

Example: Pipeline XML Schema
<?xml version="1.0" encoding="utf-8"?>
<xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'
           xmlns:p='http://www.w3.org/2002/02/xml-pipeline'
           targetNamespace='http://www.w3.org/2002/02/xml-pipeline'
           elementFormDefault='qualified'>

  <!-- $Id: Overview.html,v 1.3 2002/02/28 09:37:29 dom Exp $ -->

  <xs:complexType name='pipeline'>
    <xs:choice minOccurs='1' maxOccurs='unbounded'>
      <xs:element ref='p:processdef'/>
      <xs:element ref='p:param'/>
      <xs:element ref='p:process'/>
      <xs:element ref='p:document'/>
      <xs:any namespace='##other' processContents='skip'/>
    </xs:choice>
    <xs:attribute name='id' type='xs:ID'/>
    <xs:anyAttribute namespace="##other" processContents="lax"/>
  </xs:complexType>

  <xs:complexType name='processdef'>
    <xs:choice minOccurs='0' maxOccurs='unbounded'>
      <xs:any namespace='##other' processContents='skip'/>
    </xs:choice>
    <xs:attribute name='id' type='xs:ID'/>
    <xs:attribute name="name" type="xs:ID" use="required"/>
    <xs:attribute name="definition" type="xs:string"/>
    <xs:anyAttribute namespace="##other" processContents="lax"/>
  </xs:complexType>

  <xs:complexType name='process'>
    <xs:choice minOccurs='0' maxOccurs='unbounded'>
      <xs:element ref='p:input'/>
      <xs:element ref='p:output'/>
      <xs:element ref='p:error'/>
      <xs:element ref='p:param'/>
      <xs:any namespace='##other' processContents='skip'/>
    </xs:choice>
    <xs:attribute name='id' type='xs:ID'/>
    <xs:attribute name='type' type='xs:IDREF' use="required"/>
    <xs:attribute name='ignore-errors' type='xs:boolean'/>
    <xs:anyAttribute namespace="##other" processContents="lax"/>
  </xs:complexType>

  <xs:complexType name='input' mixed="true">
    <xs:choice minOccurs='0' maxOccurs='unbounded'>
      <xs:any namespace='##other' processContents='skip'/>
    </xs:choice>
    <xs:attribute name='id' type='xs:ID'/>
    <xs:attribute name='name' type='xs:string'/>
    <xs:attribute name='label' type='xs:anyURI'/>
    <xs:anyAttribute namespace="##other" processContents="lax"/>
  </xs:complexType>

  <xs:complexType name='output'>
    <xs:choice minOccurs='0' maxOccurs='unbounded'>
      <xs:any namespace='##other' processContents='skip'/>
    </xs:choice>
    <xs:attribute name='id' type='xs:ID'/>
    <xs:attribute name='name' type='xs:string'/>
    <xs:attribute name='label' type='xs:anyURI' use="required"/>
    <xs:anyAttribute namespace="##other" processContents="lax"/>
  </xs:complexType>

  <xs:complexType name='error'>
    <xs:choice minOccurs='0' maxOccurs='unbounded'>
      <xs:any namespace='##other' processContents='skip'/>
    </xs:choice>
    <xs:attribute name='id' type='xs:ID'/>
    <xs:attribute name='name' type='xs:string'/>
    <xs:attribute name='label' type='xs:anyURI'/>
    <xs:anyAttribute namespace="##other" processContents="lax"/>
  </xs:complexType>

  <xs:complexType name='document'>
    <xs:choice minOccurs='0' maxOccurs='unbounded'>
      <xs:any namespace='##other' processContents='skip'/>
    </xs:choice>
    <xs:attribute name='id' type='xs:ID'/>
    <xs:attribute name='label' type='xs:anyURI' use="required"/>
    <xs:anyAttribute namespace="##other" processContents="lax"/>
  </xs:complexType>

  <xs:complexType name='param' mixed="true">
    <xs:choice minOccurs="0" maxOccurs="unbounded">
      <xs:any namespace='##other' processContents='skip'/>
    </xs:choice>
    <xs:attribute name='id' type='xs:ID'/>
    <xs:attribute name='name' type='xs:string' use="required"/>
    <xs:attribute name='select' type='xs:string'/>
    <xs:anyAttribute namespace="##other" processContents="lax"/>
  </xs:complexType>

  <xs:element name="pipeline" type="p:pipeline"/>
  <xs:element name="processdef" type="p:processdef"/>
  <xs:element name="process" type="p:process"/>
  <xs:element name="input" type="p:input"/>
  <xs:element name="output" type="p:output"/>
  <xs:element name="error" type="p:error"/>
  <xs:element name="document" type="p:document"/>
  <xs:element name="param" type="p:param"/>
</xs:schema>

2.4.2 Pipeline Document Constructs

A pipeline document primarily contains elements from the pipeline namespace. The processing pipeline begins with the root element, pipeline.

In certain locations, the pipeline document may contain any element not from the pipeline namespace, provided that the expanded-name of the element has a non-null namespace URI. The presence of such foreign content must not change the behavior of pipeline elements and functions defined in this specification. Thus, a pipeline controller is always free to ignore such foreign content, and must ignore a foreign element or attribute (other than xml:base) without giving an error if it does not recognize the namespace URI.

It is an error for foreign elements to contain elements from the pipeline namespace.

2.4.2.1 The pipeline element

<p:pipeline
  id = xs:ID>
  <!-- Content: (p:processdef | p:param | p:process | p:document | foreign-content)+ -->
</p:pipeline>

The pipeline element must be the root of a pipeline document.

The pipeline element contains a mixture of one or more process definitions (processdef), top-level parameters (param), processes (process), other labeled documents (document), and foreign elements. It may contain foreign attributes.

2.4.2.2 The processdef element

<p:processdef
  id = xs:ID
  name = xs:ID
  definition = xs:string>
  <!-- Content: (foreign-content )* -->
</p:processdef>

The processdef element associates a name with an external process definition. Subsequent process elements must identify the process that they wish to perform by reference to a named process definition.

The value of the definition attribute is implementation defined. Process names may have multiple definitions (there may be multiple processdef elements with the same name). If a controller understands more than one definition, it should use the first definition (in pipeline document order) that it understands.

The processdef element may contain foreign elements and attributes.

2.4.2.3 The process element

<p:process
  id = xs:ID
  type = xs:IDREF
  ignore-errors = xs:boolean >
  <!-- Content: ( p:input | p:output | p:error | p:param | foreign-content )* -->
</p:process>

Each process element describes one "step" in the pipeline. A process must have a type attribute which refers to the value of the name attribute on one of the processdef elements in this pipeline document. The process element may have an id attribute.

A process may contain inputs, outputs, params, and errors as well as foreign content.

The inputs to a process are the information sets on which it depends. The outputs of a process are the information sets that it produces. If an error occurs, a pipeline controller may produce error information sets instead of its normal outputs. Parameters, if there are any, will be passed to the process if it executes.

A process is executed if and only if at least one of its outputs is not up to date with respect to at least one of its input s. If a process is executed, it is the responsibility of the pipeline controller to marshal the arguments appropriately for the process.

The dependency evaluation process is recursive. If Process 2 depends on information set A that is produced by Process 1, the dependencies on A described by Process 1 must be evaluated, and information set A may be updated by Process 1, before the controller can determine if the outputs of Process 2 are up to date with respect to A.

If the process has unnamed inputs, they cannot be passed to the controller, although they are treated as dependencies and can cause the controller to execute.

It is an error for more than one process to produce the same information set.

The pipeline controller is responsible for storing and tracking output documents. Arguments are passed to processes by name, not position, so the order of the children of the process element is irrelevant.

When a process is executed, it either succeeds or fails. If it succeeds, it produces its named output information sets and returns an indication of success.

If it fails, processing either terminates or proceeds. A process may proceed only if both of the following conditions are met:

  1. The ignore-errors attribute on this process is set to true.

  2. For each output information set there is a corresponding error information set. A corresponding information set is one within the same process that has the same name.

If the process is to proceed, the error information sets are used to produce its named output information sets and an indication of success is returned.

If the process fails, no information sets are produced and an indication of failure is returned. The failure of any executed process causes the pipeline controller to abandon the task of building the target.

2.4.2.4 The input element

<p:input
  id = xs:ID
  name = xs:string
  label = xs:anyURI>
  <!-- Content: (foreign-content)* -->
</p:input>

An input element identifies an information set. An input may be named or anonymous (indicated by the absence of a value for the label attribute). Names must be unique within a process. The content of the input information set comes from one of three locations:

  1. If the input has no label, the content of the element itself is considered to be the information set.

  2. If the label attribute matches the label of some output element somewhere in the pipeline, that output document is the information set.

  3. Otherwise, the resource is retrieved from the URI in the label attribute.

It is an error for the label of an input element to match the label of an output of the same process. The label is a URI; if it is relative, it is interpreted with respect to the current base URI as defined by [XML Base].

2.4.2.5 The output element

<p:output
  id = xs:ID
  name = xs:string
   label = xs:anyURI>
  <!-- Content: (foreign-content)* -->
</p:output>

An output element associates a label with an information set produced by a process. At most one result information set can be anonymous, all the others must be named. Names must be unique within a process.

When a process is executed, it is the responsibility of the pipeline controller to collect the result information sets produced by the controller, associate them with the appropriate output statements, and store them for use in evaluating other processes.

The output must also have a label which must be unique within a pipeline. The label is a URI; if it is relative, it is interpreted with respect to the current base URI as defined by [XML Base].

2.4.2.6 The error element

<p:error
  id = xs:ID
  name = xs:string
  label = xs:anyURI>
  <!-- Content: (foreign-content)* -->
</p:error>

If a process fails, the error element may be used to provide an alternate result for the normal outputs of a process.

The content of the error information set comes from one of two locations:

  1. If the error has no label, the content of the element itself is considered to be the information set.

  2. Otherwise, the resource is retrieved from the URI in the label attribute.

2.4.2.7 The param element

<p:param
  id = xs:ID
  name = xs:string
  select = xs:string>
  <!-- Content: (foreign-content)* -->
</p:param>

Additional parameters may be passed to a process with the param element.

Pipeline parameters are named. It is the responsibility of the pipeline controller to marshal them appropriately for the actual process used.

If the select attribute is specified, its content is the value of the parameter, otherwise the content of the param element is the value.

2.4.2.8 The document element

<p:document
  id = xs:ID
  label = xs:anyURI>
  <!-- Content: (foreign-content )* -->
</p:document>

The document element associates labels with information sets that may be used as the input to process es (or errors) in the pipeline.

The document element is shorthand for a common process construct. Given the following document element:

<p:document label="someUri">
  <foo/>
</p:document>

and a process definition for the "identity" transformation, the document element has precisely the same effect as the following process definition:

<p:process type="identity">
  <p:input>
    <foo/>
  </p:input>
  <p:output label="someUri"/>
</p:process>

A Use Cases

The need for a pipeline definition language is motivated by a selection of use cases.

A.1 Simple Business Transactions

Two parties agree to conduct business electronically. They will exchange purchase orders, invoices, and other business documents using some appropriate transport protocol. Before responding to a request, each party wishes to validate the request against a known schema so that errors do not result in mismanaged funds.

This use case depends on approximately the following processing order:

  1. Documents must be parsed and verified as well-formed with respect to [XML], [XML Namespaces], and [XML Base].

  2. Any XInclude elements must be expanded.

  3. Validation must be performed against a specific schema, ignoring any schema location information in the document.

A.2 Transforming to a New Schema

A company has a collection of documents in "XYZ Schema V1.0". A new release, "XYZ Schema V1.1" is released which contains a small number of backwards incompatible (but programmatically correctable) changes and a few new features. The company needs to to update its existing documents to the new schema in order to exploit the new features.

This use case depends on approximately the following processing order:

  1. Documents must be parsed and verified as well-formed with respect to XML 1.0, XML Namespaces, and XML Base.

  2. Validation must be performed against a specific "XYZ Schema V1.0" schema, ignoring any schema location information in the document (in order to be sure that local extensions will not interfere with the automated conversion process).

  3. Documents must be transformed with a specific stylesheet, ignoring any style information in the document.

  4. The transformation must preserve all existing markup (it must be an identity transformation) except for the specific changes required to convert to V1.1. (In particular, it must preserve existing XInclude elements.)

  5. Schema location information in the document must be updated.

  6. The converted document must be validated against the V1.1 schema, making use of any local schema information that is present, in order to assure that the transformation did not introduce errors.

A.3 Document Publishing

A company has a collection of documents in XML. It wishes to publish them on a periodic basis.

This use case depends on approximately the following processing order:

  1. Documents must be parsed and verified as well-formed with respect to XML 1.0, XML Namespaces, and XML Base.

  2. The document must be schema validated using local schema information that is present.

  3. The document must be transformed and published either with the stylesheet information in the document or with a specific set of stylesheets.

A.4 Business Transaction Hub

A company has built some sort of a hub for doing business transaction processing. It accepts documents in a variety of schemas, transforms them to some internal schema, operates on those documents, and returns results in the schema appropriate for the requestor.

This use case depends on approximately the following processing order:

  1. Documents must be parsed and verified as well-formed with respect to XML 1.0, XML Namespaces, and XML Base.

  2. XInclude elements must be expanded.

  3. The document must be schema validated using either the schema identified by the document or an out of band schema.

  4. The document must be transformed to the "hub schema". This new document may use XInclude elements to refer to standard boilerplate or other constant information.

  5. XInclude elements must be expanded again.

  6. Validate again, to make sure the transformation did not introduce errors.

  7. Perform whatever processing is required.

  8. Transform the result into an appopriate outbound schema.

  9. Perhaps expand XIncludes again.

A.5 Web Service Implementation

Consider a web service that is part of some larger service chain. It might need to operate only on portions of a document (because other portions are encrypted, for example, or simply because it only deals with a certain namespace). It might perform validation on only some elements, for example, or expand only certain XIncludes.

This use case depends on approximately the following processing order:

  1. Documents must be parsed and verified as well-formed with respect to XML 1.0, XML Namespaces, and XML Base.

  2. Selected portions of the document must be schema validated.

  3. XInclude processing must be performed selectively.

  4. The information set may be augmented or transformed as a result of the web services operation.

B A Complex Example

The following pipeline document describes the build process for the HTML version of this specification. This specification consists of three source XML files (pipeline.xml, example1.xml, and xml-pipeline.xsd), and a small set of XML files derived from the XSD document (*.ex). They are all stitched together in various ways by some XSL stylesheets and an XInclude processor.

The pipeline controller in this case is a command-line processor. Each of the processes is defined to be a standard command line that is executed to perform the process. All of the information sets in this case are XML files on disk, but that is not necessary (or even desirable).

Note that an implementation-specific XPath-like notation is used here in the definition attribute for the processdef elements. While this specification does not mandate this notational strategy, it might prove fruitful in building common APIs to achieve added interoperability.

Example: pipeline.xpipe Document
<pipeline xmlns="http://www.w3.org/2002/02/xml-pipeline"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

  <processdef name="xinclude"
   definition="xinclude {$document} {$result}"/>
  <processdef name="validate.dtd"
   definition="validate.dtd {$document} {$result}"/>
  <processdef name="validate.xsd"
   definition="validate.xsd {$document} {$schema} {$result}"/>
  <processdef name="transform"
   definition="xslt {$document} {$stylesheet} {$result} {$#param}"/>
  <processdef name="tidy"
   definition="tidy -iq -latin1 -n {$document} &gt; {$result}"/>
  <processdef name="clean"
   definition="rm {$opts} {$glob}"/>
  <processdef name="makediff"
   definition="cvsdiffmk -d {$doctype} -r {$DIFFV1} -r {$DIFFV2}
               -o {$result} {$document}"/>

  <param name="target" select="'pipeline.html'"/>

  <process type="tidy">
    <input name="document" label="/tmp/pipeline.html"/>
    <output name="result" label="pipeline.html"/>
  </process>

  <process type="xinclude">
    <input name="document" label="pipeline.xml"/>
    <output name="result" label="/tmp/pipeline.xml"/>
    <input label="pipeline.ex"/>
    <input label="processdef.ex"/>
    <input label="process.ex"/>
    <input label="input.ex"/>
    <input label="output.ex"/>
    <input label="error.ex"/>
    <input label="document.ex"/>
    <input label="param.ex"/>
    <input label="xml-pipeline.xsd"/>
    <input label="example1.xml"/>
    <input label="pipeline.xpipe"/>
  </process>

  <process type="transform">
    <input name="document" label="/tmp/pipeline.xml"/>
    <input name="stylesheet" label=".stylesheets/xmlspec.xsl"/>
    <output name="result" label="/tmp/pipeline.html"/>
  </process>

  <process type="transform">
    <input name="document" label="xml-pipeline.xsd"/>
    <input name="stylesheet" label="hackxsd.xsl"/>
    <output name="result" label="pipeline.ex"/>
    <param name="element" select="'pipeline'"/>
  </process>

  <process type="transform">
    <input name="document" label="xml-pipeline.xsd"/>
    <input name="stylesheet" label="hackxsd.xsl"/>
    <output name="result" label="processdef.ex"/>
    <param name="element" select="'processdef'"/>
  </process>

  <process type="transform">
    <input name="document" label="xml-pipeline.xsd"/>
    <input name="stylesheet" label="hackxsd.xsl"/>
    <output name="result" label="process.ex"/>
    <param name="element" select="'process'"/>
  </process>

  <process type="transform">
    <input name="document" label="xml-pipeline.xsd"/>
    <input name="stylesheet" label="hackxsd.xsl"/>
    <output name="result" label="input.ex"/>
    <param name="element" select="'input'"/>
  </process>

  <process type="transform">
    <input name="document" label="xml-pipeline.xsd"/>
    <input name="stylesheet" label="hackxsd.xsl"/>
    <output name="result" label="output.ex"/>
    <param name="element" select="'output'"/>
  </process>

  <process type="transform">
    <input name="document" label="xml-pipeline.xsd"/>
    <input name="stylesheet" label="hackxsd.xsl"/>
    <output name="result" label="error.ex"/>
    <param name="element" select="'error'"/>
  </process>

  <process type="transform">
    <input name="document" label="xml-pipeline.xsd"/>
    <input name="stylesheet" label="hackxsd.xsl"/>
    <output name="result" label="document.ex"/>
    <param name="element" select="'document'"/>
  </process>

  <process type="transform">
    <input name="document" label="xml-pipeline.xsd"/>
    <input name="stylesheet" label="hackxsd.xsl"/>
    <output name="result" label="param.ex"/>
    <param name="element" select="'param'"/>
  </process>

  <process type="transform">
    <input name="document" label="diff.xml"/>
    <input name="stylesheet" label=".stylesheets/notediff.xsl"/>
    <output name="result" label="/tmp/diff.html"/>
  </process>

  <process type="tidy">
    <input name="document" label="/tmp/diff.html"/>
    <output name="result" label="diff.html"/>
  </process>

  <process type="makediff">
    <input name="document" label="pipeline.xml"/>
    <output name="result" label="diff.xml"/>
    <param name="doctype" select="'xmlspec'"/>
    <param name="DIFFV1" select="{$diffv1}"/>
    <param name="DIFFV2" select="{$diffv2}"/>
  </process>

  <process type="clean">
    <output label="clean"/>
    <param name="glob" select="*.{aux,fo,log,out,html,pdf}"/>
    <param name="opts" select="-f"/>
  </process>
</pipeline>

C References

Ant
The Jakarta Project. Ant. The Apache Foundation, 2001. (See http://jakarta.apache.org/ant/ .)
CSS
Bert Bos, Håkon Wium Lie, Chris Lilley, et al., editors. Cascading Style Sheets, level 2. World Wide Web Consortium, 1998. (See http://www.w3.org/TR/REC-CSS2/.)
Linking/Style
Norman Walsh, editor. XML Linking and Style. World Wide Web Consortium, 2001. (See http://www.w3.org/TR/xml-link-style/ .)
PLH
Philippe Le Hégaret. The XML Processing Model. World Wide Web Consortium, 2001. (See http://www.w3.org/2001/06/ProcessingModel-plh.html.)
RELAX NG
James Clark, editor. OASIS RELAX NG Technical Committee . OASIS. 2001. (See http://www.oasis- open.org/committees/relax-ng/.)
RFC 2119
S. Bradner, editor. Key words for use in RFCs to Indicate Requirement Levels. IETF (Internet Engineering Task Force), March 1997. (See http://www.ietf.org/rfc/rfc2119.txt .)
Schematron
Rick Jelliffe. The Schematron. Academia Sinica Computing Centre. 2001. (See http://xml.ascc.net/xml/resource/schematron/schematron.html.)
SwA
John Barton, Satish Thatte, Henrik Frystyk Nielsen, editors. SOAP Messages with Attachments. World Wide Web Consortium, 2000. (See http://www.w3.org/TR/SOAP-attachments .)
XInclude
Jonathan Marsh and David Orchard, editors. XML Inclusions (XInclude) Version 1.0. World Wide Web Consortium, 2001. (See http://www.w3.org/TR/xinclude/ .)
XLink
Steve DeRose, Eve Maler, David Orchard, editors. XML Linking Language (XLink) Version 1.0. World Wide Web Consortium, 2001. (See http://www.w3.org/TR/xlink/.)
XML
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, et al., editors. Extensible Markup Language (XML) 1.0 Second Edition. World Wide Web Consortium, 1998. (See http://www.w3.org/TR/REC-xml .)
XML Base
Jonathan Marsh, editor. XML Base. World Wide Web Consortium, 2000. (See http://www.w3.org/TR/xmlbase/ .)
XML Infoset
Richard Tobin and John Cowan, editors. XML Information Set. World Wide Web Consortium, 2001. (See http://www.w3.org/TR/xml-infoset/.)
XML Namespaces
Tim Bray, Dave Hollander, Andrew Layman, editors. Namespaces in XML. World Wide Web Consortium, 1999. (See http://www.w3.org/TR/REC-xml-names/.)
XML Schema
Henry S. Thompson, David Beech, Murray Maloney, et al. editors. XML Schema Part 1: Structures. World Wide Web Consortium, 2000. (See http://www.w3.org/TR/xmlschema-1/ .)
XPointer
Steve DeRose, Eve Maler, Ron Daniel Jr., editors. XML Pointer Language (XPointer) Version 1.0. World Wide Web Consortium, 2001. (See http://www.w3.org/TR/xptr/.)
XQuery
Don Chamberlin, James Clark, Daniela Florescu, et al., editors. XQuery 1.0: An XML Query Language. World Wide Web Consortium, 2001. (See http://www.w3.org/TR/xquery/.)
XSLT
James Clark, editor. XML Transformations (XSLT) Version 1.0. World Wide Web Consortium, 1999. (See http://www.w3.org/TR/xslt.)

D Open Issues

This appendix identifies some open issues.

  1. In 2.1 Resources and URIs in the Pipeline Language, URI equality is defined as lexical. Is this sufficiently specific?

  2. Can a pipeline document be its own target? Can it have XInclude instructions and validity constraints and other anticipated processing?

  3. In order to provide greater flexibility in the Pipeline document, does it make sense to consider allowing some attribute values (for example, the definition attribute of processdef and the select attribute of param) to be XPath expressions?

  4. As currently defined, the processdef element defined in 2.4.2.2 The processdef element has an implementation-specific attribute, which introduces interoperability problems. What is the right solution for this problem? Is it related to the possibility of introducing XPath expressions?

  5. The schema for Pipeline in 2.4.1 XML Schema for the Pipeline Language should be expanded to define key/keyref constructs for the uniqueness and reference constraints in its elements.

  6. The issue of controlling extractive processes (beyond the control already built into specifications such as XML Schema) is not so far addressed here. The issue was explored to a certain extent at the XML Processing Model Workshop (see, for example, the paper presented by Philippe Le Hégaret [PLH]) and does need to be accounted for eventually.