In text shaping, a cluster is a sequence of
characters that needs to be treated as a single, indivisible
unit. A single letter or symbol can be a cluster of its
own. Other clusters correspond to longer subsequences of the
input code points — such as a ligature or conjunct form
— and require the shaper to ensure that the cluster is not
broken during the shaping process.
A cluster is distinct from a grapheme,
which is the smallest unit of meaning in a writing system or
script.
The definitions of the two terms are similar. However, clusters
are only relevant for script shaping and glyph layout. In
contrast, graphemes are a property of the underlying script, and
are of interest when client programs implement orthographic
or linguistic functionality.
For example, two individual letters are often two separate
graphemes. When two letters form a ligature, however, they
combine into a single glyph. They are then part of the same
cluster and are treated as a unit by the shaping engine —
even though the two original, underlying letters remain separate
graphemes.
HarfBuzz is concerned with clusters, not
with graphemes — although client programs using HarfBuzz
may still care about graphemes for other reasons from time to time.
During the shaping process, there are several shaping operations
that may merge adjacent characters (for example, when two code
points form a ligature or a conjunct form and are replaced by a
single glyph) or split one character into several (for example,
when decomposing a code point through the
ccmp
feature). Operations like these alter
clusters; HarfBuzz tracks the changes to ensure that no clusters
get lost or broken during shaping.
HarfBuzz records cluster information independently from how
shaping operations affect the individual glyphs returned in an
output buffer. Consequently, a client program using HarfBuzz can
utilize the cluster information to implement features such as:
Correctly positioning the cursor within a shaped text run,
even when characters have formed ligatures, composed or
decomposed, reordered, or undergone other shaping operations.
Correctly highlighting a text selection that includes some,
but not all, of the characters in a word.
Applying text attributes (such as color or underlining) to
part, but not all, of a word.
Generating output document formats (such as PDF) with
embedded text that can be fully extracted.
Determining the mapping between input characters and output
glyphs, such as which glyphs are ligatures.
Performing line-breaking, justification, and other
line-level or paragraph-level operations that must be done
after shaping is complete, but which require examining
character-level properties.