Internet-Draft T. Shaneyfelt Department of Computer Science University of Hawaii at Hilo Expires: February 14, 2004 HTMLX: Simple Well-Formed Format For Legacy HTML Documents draft-shaneyfelt-htmlx-01.txt Abstract Changing some legacy Htpertext Markup Language (HTML) documents to XHTML format would require certain tags to be dropped. Not changing them to some sort of Extensible Markup Language (XML) would prevent their use by new tools. This memo documents a method for expressing the format and content of legacy HTML documents as XML in order that their structure and content may be accessible by XML parsers. Obsolete tags are explicitly allowed, and well-formedness is required. 1. Introduction With minimal modification, legacy documents can become accessible to XML parsers. The minimal requirement is well-formedness. HTML based on Structured Generalized Markup Language (SGML) is not well-formed like newer XHTML documents, but at the same time, XHTML obsoletes certain tags that were used in prior HTML standards. Rather than converting legacy documents directly into XHTML, the documents could be tidied into a well-formed representation that could then either be accessed with XML parsers without losing original tags, or it could be later transformed via tools that parse XML (such as XSLT) into XHTML, if desired. Tools that edit legacy documents may implement an option for the user to save the document in this intermediate format as the document is being transformed into XHTML. This format is backwards compatible with most popular user agents. 2. Security Considerations None known 3. Format 3.1 Well-formedness HTMLX Documents shall be well-formed. Legacy Documents will need to be minimally modified to meet the well-formedness requirement. 3.2 Empty elements For user agent compatibility, it is suggested that the following elements should be kept as separate begin/end tags rather than being collapsed into a single tag: applet, iframe, object, script, textarea, title 3.3 Entities Entities other than gt,lt,amp,quot,and apos shall be either be converted to numeric entities unless defined by a declared entity definition. For example, the nbsp entity may be converted to a numeric entity wherever it appears, or it may be defined in an entity declaration or DTD at the top of the page. 3.4 Declarations Namespaces and DTD tags are not required. Documents without any DTD are considered to be HTMLX1 3.4.1 Namespace Editing software is not to add a namespace to a document without being directed to do so by the user. Indiscriminately inserting a namespace would imply conformance to related standards, and should not be done until the author is ready to take that step. 3.4.2 Document Type Definition Editing software is not to add a DTD to a document without being directed to do so by the user. Indiscriminately inserting a DTD would imply conformance to related standards, and should not be done until the author is ready to take that step. 3.5 Attributes All attribute values must be quoted to comply with XML. 4 MIME Type HTMLX documents should be sent as "text/html" and treated as html, according to the intent of the World Wide Consortium's (W3C) HTML Working Group's (WG) intent as expressed in http://lists.w3.org/Archives/Public/www-html/2000Sep/0024.html 5 The file name may end with any of the following extensions: .html .htm .xml A browser will only attempt to format the first two types as HTML, whereas the third will typically be processed as an XML data file, as current practice and standards dictate. 6 Software It is expected that some software will require well-formednes and other software will not. Software reading the document is not required to verify well-formedness, but software saving the document should attempt to produce well-formedness. A mechanism for alerting the user of ill-formedness upon saving a document is suggested for documents in the process of being converted where the software does not completely automate the process. Revisions 00 - Initial version 01 - Added Security Considerations and Revisions sections, and updated the list of empty elements that should be kept as separate begin/end tags Author's Address Ted Shaneyfelt University of Hawaii at Hilo 200 W. Lanikaula Street Hilo, Hawaii 96720-4091 For additional contact information, see http://cs.uhh.hawaii.edu/cs/people/staff/#ted Copyright (C) The Internet Society 2004. This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than a "work in progress Expires: February 14, 2004