Unicode Character Database |
Extended Character Properties
Revision 3.2.0 Authors Mark Davis Date 2002-03-22 This Version http://www.unicode.org/Public/3.2-Update/PropList-3.2.0.html Previous Version http://www.unicode.org/Public/3.1-Update1/PropList-3.1.1.html Latest Version http://www.unicode.org/Public/UNIDATA/PropList.html
SummaryThis document describes the format and content of the PropList.txt data file in the Unicode Character Database (UCD).
Status
The file and the files described herein are part of the Unicode Character Database and governed by the UCD Terms of Use given below.
For general information on file formats and table formats, and the implications of normative vs informative properties, see UnicodeCharacterDatabase.html.
Warning: the information in this file does not completely describe the use and interpretation of Unicode character properties and behavior. It must be used in conjunction with the data in the other files in the Unicode Character Database, and relies on the notation and definitions supplied in The Unicode Standard. All chapter references are to Version 3.2.0 of the standard unless otherwise indicated.
Introduction
PropList.txt contains extended properties that supplement the General Category property described in UnicodeData.html. Unlike the derived properties, the properties in PropList.txt cannot be derived directly from UnicodeData.txt or other data files of the UCD. These properties are listed in the following table.
All properties in this file are binary.
Note: The properties of the form Other_XXX are used to generate properties in DerivedCoreProperties.txt. They are not intended for general use, such as in APIs that return property values.
Property Value N/I Definition and Usage White_space N Space characters and those format control characters (such as TAB, CR and LF) which should be treated by programming languages as "white space" for the purpose of parsing elements. Note: ZERO WIDTH SPACE and ZERO WIDTH NO-BREAK SPACE are not included, since their functions are restricted to line-break control. Their names are unfortunately misleading in this respect.
Note: There are other senses of "whitespace" that encompass a different set of characters.
Bidi_Control N Those format control characters which have specific functions in the Bidirectional Algorithm. Join_Control N Those format control characters which have specific functions for control of cursive joining and ligation. ASCII_Hex_Digit N ASCII characters commonly used for the representation of hexadecimal numbers. Dash I Those punctuation characters explicitly called out as dashes in the Unicode Standard, plus compatibility equivalents to those. Most of these have the Pd General Category, but some have the Sm General Category because of their use in mathematics. Hyphen I Those dashes used to mark connections between pieces of words, plus the Katakana middle dot. The Katakana middle dot functions like a hyphen, but is shaped like a dot rather than a dash. Quotation_Mark I Those punctuation characters that function as quotation marks. Terminal_Punctuation I Those punctuation characters that generally mark the end of textual units. Other_Math I Used in deriving the Math property. Hex_Digit I Characters commonly used for the representation of hexadecimal numbers, plus their compatibility equivalents. Other_Alphabetic I Used in deriving the Alphabetic property. Ideographic I Characters considered to be CJKV (Chinese, Japanese, Korean, and Vietnamese) ideographs. Diacritic I Characters that linguistically modify the meaning of another character to which they apply. Some diacritics are not combining characters, and some combining characters are not diacritics. Extender I Characters whose principal function is to extend the value or shape of a preceding alphabetic character. Typical of these are length and iteration marks. Other_Lowercase I Used in deriving the Lowercase property. Other_Uppercase I Used in deriving the Uppercase property. Noncharacter_Code_Point N Code points that are explicitly defined as illegal for the encoding of characters. See Unicode 3.1 for more information. Other_Grapheme_Extend N Used in deriving the Grapheme_Extend property. Grapheme_Link N Used in determining default grapheme cluster boundaries. For more information, see UTR #29: Text Boundaries (in proposed draft status at publication of Unicode 3.2).
IDS_Binary_Operator
IDS_Trinary_Operator
Radical
Unified_IdeographN For a machine-readable list of Ideographic Description Sequences. For more information, see Unicode 3.2.
Other_Default_Ignorable_Code_Point N Used in deriving the Default_Ignorable_Code_Point property. Deprecated N For a machine-readable list of deprecated characters. No characters will ever be removed from the standard, but the usage of deprecated characters is strongly discouraged. For more information, see Unicode 3.2.
Soft_Dotted N Characters with a "soft dot", like i or j. An accent placed on these characters causes the dot to disappear. An explicit dot above can be added where required, such as in Lithuanian. For more information, see Unicode 3.0, Chapter 7, Diacritics on i and j
Logical_Order_Exception N There are a small number of characters that do not use logical order. These characters require special handling in most processing. For more information, see Unicode 3.2.
UCD Terms of UseDisclaimer
The Unicode Character Database is provided as is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. If this file has been purchased on magnetic or optical media from Unicode, Inc., the sole remedy for any claim will be exchange of defective media within 90 days of receipt.
This disclaimer is applicable for all other data files accompanying the Unicode Character Database, some of which have been compiled by the Unicode Consortium, and some of which have been supplied by other sources.
Limitations on Rights to Redistribute This Data
Recipient is granted the right to make copies in any form for internal distribution and to freely use the information supplied in the creation of products supporting the UnicodeTM Standard. The files in the Unicode Character Database can be redistributed to third parties or other organizations (whether for profit or not) as long as this notice and the disclaimer notice are retained. Information can be extracted from these files and used in documentation or programs, as long as there is an accompanying notice indicating the source.