Common Format and Media Type for Control-Character-Separated Values (CCSV) Files

Internet-Draft	CCSV	September 2024
Rankin	Expires 20 March 2025	[Page]

Abstract

This document establishes the format used for Control-Character-Separated Values (CCSV) files and registers the associated MIME type "text/ccsv".¶

About This Document

This note is to be removed before publishing as an RFC.¶

The latest revision of this draft can be found at https://oldgrognard.github.io/ccsv-id/draft-rankin-ccsv.html. Status information for this document may be found at https://datatracker.ietf.org/doc/draft-rankin-ccsv/.¶

Source for this draft and an issue tracker can be found at https://github.com/oldgrognard/ccsv-id.¶

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶

This Internet-Draft will expire on 20 March 2025.¶

Copyright Notice

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶

1. Introduction

A CCSV (Control-Character-Separated Values file) is a file format that enables moving data between spreadsheets, statistical analysis programs, databases, and any other program that works with rectangular data. It is very similar to (CSV) Comma-Separated Values files [RFC4180], (TSV) Tab-Separated Values files, and their derivatives. Unlike those file types, the CCSV minimizes usage ambiguity by having non-printable characters as delimiters. The two delimiter characters may not appear in the document's text, making the practice of escaping certain characters or adding additional delimiters for certain strings unnecessary. This document seeks to define the format of Control Character Separated Values (CCSV) files and formally register the "text/ccsv" Media Type for CCSV in accordance with [RFC6838].¶

2. Conventions and Definitions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶

2.1. Definition of the CCSV format

In order for a file to be a CCSV, it MUST adhere to the following formatting rules:¶

2.1.1. Formatting Rules

The file MUST use UTF-8 encoding. Since US-ASCII is a subset of UTF-8, programs may create CCSV files with that encoding. A consuming program may not be able to interpret all characters if it only works with US-ASCII and SHOULD work with UTF-8 if the source is unknown.¶
The file MUST NOT begin with a Byte Order Mark (U+FEFF).¶
A CCSV MUST begin with a header. The header consists of the names of the columns separated with US (U+001F) entities.¶
A Unit Separator US (U+001F) is used between each field in a record. Note that carriage returns and line feeds are not part of the delimiter and are valid characters in the body of a field.¶
A Record Separator RS (U+001E) is used between each record in the file including the header.¶
A Record Separator RS (U+001E) MAY appear as the last entity of the last record but is NOT RECOMMENDED.¶
The header and each record, if any, MUST contain the same number of US (U+001F) entities. Consequently, the header and every record will have the same number of fields.¶
Empty fields are represented by consecutive delimiters.¶
The US (U+001F) entity and the RS (U+001E) entity MUST NOT appear in the body of a field.¶

The ABNF grammar [STD68] appears as follows:¶

file = header RS *(record RS) [record]
header = name *( US name )
record = field *( US field )
name = field
field = *VCHAR
VCHAR = %x00-1D / %x20-D7FF / %xE000-10FFFF ; all characters except the designated delimiters and surrogates
RS = %x1E ; record separator
US = %x1F ; unit separator

3. Encoding Considerations

CCSV files MUST be encoded using UTF-8 [RFC3629].¶

Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of the file or networked-transmitted text.¶

3.1. Why UTF-8?

3.1.1. Compatibility

UTF-8 is widely supported across different platforms, operating systems, and languages. This ensures that CCSV files can be opened and read correctly regardless of the environment they are used in.¶

3.1.2. Internationalization

UTF-8 supports a vast range of characters from various languages, including those that use non-Latin scripts. This is crucial for data that might include international names, addresses, or other text in multiple languages, ensuring that all characters are preserved and displayed correctly.¶

3.1.3. Efficiency

UTF-8 is a variable-width encoding scheme that uses 1 to 4 bytes for each character. It is efficient for encoding text that is primarily in English, as it uses only one byte for the most common characters, but can still accommodate characters from other languages when needed.¶

Subtype name: ccsv¶

Required parameters: N/A¶

Optional parameters: N/A¶

Encoding considerations: See Section 3 ¶

Security considerations: See Section 4 ¶

Interoperability considerations: See Section 5 ¶

Published specification: TBD¶

Applications that use this media type:¶

    Databases, spreadsheets, statistical programs, and data conversion utilities

Fragment identifier considerations: N/A¶

Additional information:¶

    Deprecated alias names for this type: N/A
    Magic number(s): N/A
    File extension(s): CCSV
    Macintosh file type code(s): TEXT

Person & email address to contact for further information:¶

    Mike Rankin
    2108 Independence Dr
    Chambersburg, PA  17201
    USA

    mrankin@icf.com

Intended usage: COMMON¶

Restrictions on usage: N/A¶

Author/Change controller: Mike Rankin¶

Provisional registration?¶

7. Normative References

[RFC2119]: Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC3629]: Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 2003, <https://www.rfc-editor.org/rfc/rfc3629>.
[RFC4180]: Shafranovich, Y., "Common Format and MIME Type for Comma-Separated Values (CSV) Files", RFC 4180, DOI 10.17487/RFC4180, October 2005, <https://www.rfc-editor.org/rfc/rfc4180>.
[RFC6838]: Freed, N., Klensin, J., and T. Hansen, "Media Type Specifications and Registration Procedures", BCP 13, RFC 6838, DOI 10.17487/RFC6838, January 2013, <https://www.rfc-editor.org/rfc/rfc6838>.
[RFC8174]: Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.
[STD68]: Internet Standard 68, <https://www.rfc-editor.org/info/std68>.
At the time of writing, this STD comprises the following:

Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", STD 68, RFC 5234, DOI 10.17487/RFC5234, January 2008, <https://www.rfc-editor.org/info/rfc5234>.

Common Format and Media Type for Control-Character-Separated Values (CCSV) Files

Abstract

About This Document

Status of This Memo

Copyright Notice

Table of Contents

1. Introduction

2. Conventions and Definitions

2.1. Definition of the CCSV format

2.1.1. Formatting Rules

3. Encoding Considerations

3.1. Why UTF-8?

3.1.1. Compatibility

3.1.2. Internationalization

3.1.3. Efficiency

3.1.4. Standardization

3.1.5. Future-Proofing

4. Security Considerations

5. Interoperability Considerations

6. IANA Considerations

7. Normative References

Acknowledgments

Author's Address