Unicode Character Database |
Derived Character Properties
Revision 3.2.0 Authors Mark Davis Date 2002-03-22 This Version http://www.unicode.org/Public/3.2-Update/DerivedProperties-3.2.0.html Previous Version http://www.unicode.org/Public/3.1-Update/DerivedProperties-3.1.0.html Latest Version http://www.unicode.org/Public/UNIDATA/DerivedProperties.html
SummaryThis document describes the format and content of the main derived data files in the Unicode Character Database (UCD).
Status
The file and the files described herein are part of the Unicode Character Database and governed by the UCD Terms of Use given below.
For general information on file formats and table formats, and the implications of normative vs informative properties, see UnicodeCharacterDatabase.html.
Warning: the information in this file does not completely describe the use and interpretation of Unicode character properties and behavior. It must be used in conjunction with the data in the other files in the Unicode Character Database, and relies on the notation and definitions supplied in The Unicode Standard. All chapter references are to Version 3.2.0 of the standard unless otherwise indicated.
Contents
Introduction
This document describes a number of data files in the Unicode Character database. These are the Derived data files, containing information that can be completely derived from other data files, but is presented in a different format for ease of use.
The files themselves are informative, although they may contain normative properties. For more information, see UnicodeCharacterDatabase.html.
Unless otherwise noted, all properties in this file are binary.
Derived Core Properties
The following are important derived properties of Unicode characters, and are contained in DerivedCoreProperties.txt.
Property Name N/I Definition and Generation Math I Characters with the Math property. For more information, see Chapter 4, Character Properties. Generated from: Sm + Other_Math
Alphabetic I Characters with the Alphabetic property. For more information, see Chapter 4, Character Properties. Generated from: Lu+Ll+Lt+Lm+Lo+ Other_Alphabetic
Lowercase I Characters with the Lowercase property. For more information, see Chapter 4, Character Properties and UAX #21: Case Mappings. Generated from: Ll + Other_Lowercase
Uppercase I Characters with the Uppercase property. For more information, see Chapter 4, Character Properties and UAX #21: Case Mappings. Generated from: Lu + Other_Uppercase
ID_Start I Characters that can start an identifier. Generated from Lu+Ll+Lt+Lm+Lo+Nl
ID_Continue I Characters that can continue an identifier. See Cf Note. Generated from: ID_Start + Mn+Mc+Nd+Pc
XID_Start I Same as ID_Start, except for modifications to allow closure under normalization forms NFKC and NFKD. Generated from: ID_Start; see Closure Note
XID_Continue I Same as ID_Continue, except for modifications to allow closure under normalization forms NFKC and NFKD. Generated from: ID_Continue; see Closure Note and Cf Note.
Default_Ignorable_Code_Point N For programmatic determination of default-ignorable code points. New characters that should be ignored in processing (unless explicitly supported) will be assigned in these ranges, permitting programs to correctly handle the default behavior of such characters when not otherwise supported. For more information, see UTR #29: Text Boundaries (in proposed draft status at release time for Unicode 3.2). Generated from Other_Default_Ignorable_Code_Point + Cf + Cc + Cs - White_Space
Grapheme_Base For programmatic determination of grapheme cluster boundaries. For more information, see UTR #29: Text Boundaries (in proposed draft status at publication of Unicode 3.2). Generated from: [0..10FFFF] - Cc - Cf - Cs - Co - Cn - Zl - Zp - Grapheme_Extend - Grapheme_Link - CGJ
CGJ = Combining Grapheme Joiner
Grapheme_Extend For programmatic determination of grapheme cluster boundaries. For more information, see UTR #29: Text Boundaries (in proposed draft status at publication of Unicode 3.2). Generated from: Me + Mn + Mc + Other_Grapheme_Extend - Grapheme_Link - CGJ
Closure Note: XID_Start and XID_Continue are defined by adding or removing certain special characters as per UAX #15, Annex 7. They do not remove the non-NFKD nor the non_NFKC characters; if that is desired it needs to be a separate filter. They merely ensure that:
if
isIdentifer(string)
then
isIdentifier(NFKC(string))
and
isIdentifier(NFKD(string))
Cf Note: The general category Cf characters are not included in ID_Continue nor in XID_Continue; they should continue identifiers, but be filtered out of the result.
For more information on identifiers, see Chapter 5, Implementation Guidelines, and UAX #15, Annex 7.
Derived Extracted Properties
The following files contain other properties of the UCD that are simply separated out, and listed in range format. These files are provided purely as a reformatting of existing data, with a certain exceptions listed below. They are all contained in a subdirectory called extracted.
".txt" Files N/I Definition and Generation DerivedBidiClass N From UnicodeData.txt, field 4 DerivedBinaryProperties N From UnicodeData.txt, field 9. See Bidi Note. DerivedCombiningClass N From UnicodeData.txt, field 3 DerivedDecompositionType * From the <tag> in UnicodeData.txt, field 5. For characters with canonical decomposition mappings (no tag), the value "canonical" is used. * The value "canonical" is normative; the others are informative.
DerivedEastAsianWidth I From EastAsianWidth.txt, field 1 DerivedGeneralCategory N From UnicodeData.txt, field 2 DerivedJoiningGroup N From ArabicShaping.txt, field 2 DerivedJoiningType N From ArabicShaping.txt, field 1 DerivedLineBreak * From LineBreak.txt, field 1. * Some values are normative; some are informative. See UTR #11: Line Break Property for more information.
DerivedNumericType N The property value is is based on the contents of UnicodeData.txt, fields 6 through 8:
property value non-empty fields decimal 6, 7, & 8 digit 7 & 8 numeric 8 DerivedNumericValues N Non-binary Property From UnicodeData.txt, field 8
Bidi Note: The BidiMirrored property and the BidiMirroring property are different. The former is a normative property that indicates whether characters are mirrored in a right-to-left context in the Unicode Bidirectional Algorithm. The latter is an informative mapping of BidiMirrored characters, where possible, to characters that normally have the corresponding mirrored glyph.
Derived Normalization Properties
The properties in DerivedNormalizationProperties.txt are useful in dealing with normalization forms. In the following table, NF* refers to one of NFD, NFC, NFKC, or NFKD.
Property Name N/I Definition and Generation FNC N Non-binary Property Characters that require extra mappings for closure under Case Folding plus Normalization Form KC. Characters marked with this property have a third field with the mapping in it. Generated with the following:
b = NFKC(Fold(a)); c = NFKC(Fold(b)); if (c != b) add mapping from a to cComp_Ex N Characters that are excluded from composition: those explicitly in CompositionExclusions.txt, plus:
(3) Singleton Decompositions
(4) Non-Starter DecompositionsNFD_QuickCheck
NFKD_QuickCheck
NFC_QuickCheck
NFKC_QuickCheckN Non-binary Property
Value File Text Description No NF*_No Characters that cannot ever occur in the respective normalization form. See QuickCheck Note. Maybe NF*_Maybe Characters that may occur in in the respective normalization, depending on the context. See QuickCheck Note. Yes n/a All other characters. This is the default value, and is not explicitly listed in the file.
For more information, see UAX #15 Annex 8.NF*_Expands N Characters that expand to more than one character in the specified normalization form.
UCD Terms of Use
Disclaimer
The Unicode Character Database is provided as is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. If this file has been purchased on magnetic or optical media from Unicode, Inc., the sole remedy for any claim will be exchange of defective media within 90 days of receipt.
This disclaimer is applicable for all other data files accompanying the Unicode Character Database, some of which have been compiled by the Unicode Consortium, and some of which have been supplied by other sources.
Limitations on Rights to Redistribute This Data
Recipient is granted the right to make copies in any form for internal distribution and to freely use the information supplied in the creation of products supporting the UnicodeTM Standard. The files in the Unicode Character Database can be redistributed to third parties or other organizations (whether for profit or not) as long as this notice and the disclaimer notice are retained. Information can be extracted from these files and used in documentation or programs, as long as there is an accompanying notice indicating the source.