Chinese character description languages

The Chinese character description languages are several proposed languages to describe Chinese (or CJK) characters and information such as their list of components, list of strokes (basic and complex), their order, and the location of each of them on a background empty square. They are designed to overcome the inherent lack of information within a bitmap description. This enriched information can be used to identify variants of characters that are unified into one code point by Unicode and ISO/IEC 10646, as well as to provide an alternative form of representation for rare characters that do not yet have a standardized encoding in Unicode or ISO/IEC 10646. Many aim to work for Kaishu style and Song style, as well as to provide the character's internal structure which can be used for easier look-up of a character by indexing the character's internal make-up and cross-referencing among similar characters.

CDL edit

Character Description Language (CDL) is an XML-based declarative language co-created by Tom Bishop and Richard Cook for the Wenlin Institute. It defines characters by the arrangement of components, which are not required to reflect the semantic or etymological history of the character. In order for a component to fit into the allotted portion of the whole character's square, A set of fewer than 50 strokes allow one to construct approximately 1,000 components, which may in turn describe tens of thousands of characters.[1]

Ideographic Description Sequences edit

Chapter 12 of the Unicode specification[2] defines the "Ideographic Description Sequences" (IDS) syntax used to describe characters in featural terms, by arrangements of components with code points. Sixteen special characters in the range U+2FF0 to U+2FFF act as prefix operators to combine other characters or sequences to form larger characters.

Ideographic Description Characters in Unicode
Character Unicode Character Number Full Unicode Name
U+2FF0 Ideographic description character left to right
U+2FF1 Ideographic description character above to below
U+2FF2 Ideographic description character left to middle and right
U+2FF3 Ideographic description character above to middle and below
U+2FF4 Ideographic description character full surround
U+2FF5 Ideographic description character surround from above
U+2FF6 Ideographic description character surround from below
U+2FF7 Ideographic description character surround from left
U+2FFC Ideographic description character surround from right
U+2FF8 Ideographic description character surround from upper left
U+2FF9 Ideographic description character surround from upper right
U+2FFA Ideographic description character surround from lower left
U+2FFD Ideographic description character surround from lower right
U+2FFB Ideographic description character overlaid
U+2FFE Ideographic description character horizontal reflection
⿿ U+2FFF Ideographic description character rotation

Two additional ideographic description characters are scattered in other Unicode blocks. U+303E IDEOGRAPHIC VARIATION INDICATOR is not officially an ideographic description character, but is sometimes used in ideographic description sequences.

Other related Ideographic Description Characters in Unicode
Character Unicode Character Number Block Full Unicode Name
U+303E CJK Symbols and Punctuation Ideographic variation indicator
U+31EF CJK Strokes Ideographic description character subtraction

These sequences are useful in describing to the reader a character that is not directly printable, either because it is absent in a given font, or is absent from the Unicode standard altogether. For example, the Sawndip character   encoded in CJK Unified Ideographs Extension F as U+2DA21 𭨡 can be described as ⿰書史. Another use is for dictionary lookup purposes, as a rough input method for queries.

These sequences can be rendered either by keeping the individual characters separately or by parsing the Ideographic Description Sequence and drawing the ideograph so described.[3] They do not, by themselves, provide unambiguous rendering for all characters. For instance, the sequence ⿱十一 represents both 'EARTH' with the middle bar being narrower, and 'SCHOLAR' with the middle bar being wider.

Unicode's specification for these sequences is based on the characters and syntax of the earlier GBK standard. Additional symbols are later encoded to fill in the missing combinations.

The IDSgrep free software package by Matthew Skala[4][5] extends Unicode's IDS syntax to include additional features for dictionary lookup; it is capable of converting KanjiVG's database to its own extended IDS format, or of searching EIDS files generated by the related Tsukurimashou font family.

See also edit

Notes edit

  1. ^ Bishop & Cook 2013-12-31:pp2, 9
  2. ^ "Ideographic Description Characters" (PDF). The Unicode Standard, Version 6.0 (PDF). Mountain View, CA: The Unicode Consortium. February 2011. pp. 409–412. Archived (PDF) from the original on 18 January 2024.
  3. ^ "The Unicode® Standard – Version 12.0 – Core Specification" (PDF). Unicode Consortium. March 2019. p. 26. Archived (PDF) from the original on Jun 2, 2023.
  4. ^ "IDSgrep". Tsukurimashou Project. 2024-01-31. Archived from the original on Feb 7, 2024.
  5. ^ Skala, Matthew (2015). "A Structural Query System for Han Characters" (PDF). International Journal of Asian Language Processing. 23 (2): 127–159. arXiv:1404.5585. Archived from the original (PDF) on 2016-03-04. Retrieved 2016-01-13.

External links edit

Wenlin CDL edit