Unicode character categories
Shaping models are typically specified with respect to how
scripts are defined in the Unicode standard.
Every codepoint in the Unicode Character Database (UCD) is
assigned a Unicode General Category (UGC),
which provides the most fundamental information about the
codepoint: whether the codepoint represents a
Letter, a Mark, a
Number, Punctuation, a
Symbol, a Separator,
or something else (Other).
These UGC properties are "Major" categories. Each codepoint is
further assigned to a "minor" category within its Major
category, such as "Letter, uppercase" (Lu
) or
"Letter, modifier" (Lm
).
Shaping models are concerned primarily with Letter and Mark
codepoints. The minor categories of Mark codepoints are
particularly important for shaping. Marks can be nonspacing
(Mn
), spacing combining
(Mc
), or enclosing (Me
).
In addition to the UGC property, codepoints in the Indic and
Southeast Asian scripts are also assigned
Unicode Indic Syllabic Category (UISC) and
Unicode Indic Positional Category (UIPC)
properties that provide more detailed information needed for
shaping.
The UISC property sub-categorizes Letters and Marks according to
common script-shaping behaviors. For example, UISC distinguishes
between consonant letters, vowel letters, and vowel marks. The
UIPC property sub-categorizes Mark codepoints by the relative visual
position that they occupy (above, below, right, left, or in
multiple positions).
Some complex scripts require that the text run be split into
syllables. What constitutes a valid syllable in these
scripts is specified in regular expressions, formed from the
Letter and Mark codepoints, that take the UISC and UIPC
properties into account.