Unicode Mapping
At the moment we can only retrieve glyphs from the font using their index, which is not very helpful. In this section, we're going to implement unicode mapping. Unicode mapping will let us retrieve glyphs based using their unicode code point. We're going to be implementing support for two byte encoding only. All relevant data is stored in the cmap
table.
Table | Why it's needed |
---|---|
cmap | Maps unicode code points to glyphs. |
Let's add a codePoint to each glyph. We're not going to use this code point for anything, it's useful for debugging.
struct Glyph { vector< Edge > edges; Point min; Point max; bool isFlipped; u32 index; u16 codePoint; // NEW inline Glyph() { index = 0; codePoint = 0; isFlipped = false; } // Methods stay unchanged };
Add an unordered_map< u16, u32 >
to the font class, this dictionary maps code points to glyph indices. Let's also declare a new function to retrieve a glyph by it's code point, GetGlyphByCodePoint
.
class Font { protected: vector< Glyph > allGlyphs; unordered_map< u16, u32 > unicodeMap; // NEW protected: Glyph ParseSimpleGlyph(u16 index, u8* glyf, const vector< u32 >& loca); Glyph ParseCompoundGlyph(u16 index, u8* glyf, const vector< u32 >& loca); public: Font(const char* file, bool flip = true); u32 NumGlyphs(); const Glyph& GetGlyphByIndex(u32 index); const Glyph& GetGlyphByCodePoint(u16 codePoint); // NEW };
A ttf file can contain multiple character maps. Find the cmap
table, it contains a u16
for it's version and a u16
for the number of character encoding tables in the file. The character encoding tables follow the cmap table directly, they contain a u16
for the platform id, a u16
for a platform specific identifier and a u32
thats an offset to the actual mapping table.
The platform id of each cmap subtable will be one of the following:
- 0: Unicode, this is the only one we will support
- 1: Mac specific
- 2: Reserved, don't use
- 3: Microsoft
Let's add some code at the end of the constructor. Find the cmap table, read the number of subtables. Loop torugh all cmap subtables, searching for the unicode subtable.
Font::Font(const char* file, bool flip) { // ... Existing constructor code remains unchanged u32 cmapTag = ('c' << 24) | ('m' << 16) | ('a' << 8) | 'p'; u8* cmap = 0; tableTag = data + sizeofOffsetTable; for (u16 table = 0; table < numTables; ++table) { u32 thisTag = (tableTag[0] << 24) | (tableTag[1] << 16) | (tableTag[2] << 8) | tableTag[3]; u8* pOffset = tableTag + 8; u32 offset = (pOffset[0] << 24) | (pOffset[1] << 16) | (pOffset[2] << 8) | pOffset[3]; if (thisTag == cmapTag) { cmap = data + offset; } tableTag += tableDirectoryStride; } p = cmap + 2; // Skip version u16 numCMapSubTables = read_u16(&p); u8* unicodeTable = 0; for (u16 i = 0; i < numCMapSubTables; ++i) { u16 platformID = read_u16(&p); p += 2; // Skip platformSpecificID u32 offset = read_u32(&p); if (platformID == 0) { // Found UNICODE Platform p = cmap + offset; u16 unicodeTableFormat = read_u16(&p); if (unicodeTableFormat == 4) { unicodeTable = cmap + offset; break; } } }
The unicode platform format is the only one we will cover in this blog. When the platform id of a cmap subtable is 1, the platform specific id will be one of the following:
- 0: Version 1.0 semantics
- 1: Version 1.1 semantics
- 2: Deprecated
- 3: Unicode 2.0 or later, BMP only
- 4: Unicode 2.0 or later, this is the only one we will be implementing
- 5: Unicode variation sequence
- 6: Last resort
This blog only covers format 4, which is a two byte encoding. This is the most common mapping table, most fonts will have one. The format 4 table structure is described here. If a unicode table is found, the first two bytes are a u16
containing the format of the table.
This format contains blocks of unicode ranges. The table is made up of 4 parallel arrays describing each range and an array of glyph indices. The startCode
and endCode
arrays contain the first and last character code of each range. There other two arrays, idDelta
and idRangeOffset
are used to map a code point to a glyph index.
if (unicodeTable != 0) { std::vector< u16 > endCode; std::vector< u16 > startCode; std::vector< u16 > idDelta; std::vector< u16 > idRangeOffset; u16 segCount = 0; p = unicodeTable; u16 unicodeTableFormat = read_u16(&p); if (unicodeTableFormat == 4) {
The next u16
is the length of the whole subtable, including the header data. This is followed by a language code that we will not use. The next u16
is double the number of unicode segments that this table contains. Divide by two to get the number of segments.
u16 unicodeTableLength = read_u16(&p); u16 language = read_u16(&p); u16 segCountX2 = read_u16(&p); segCount = segCountX2 / 2; p += 2 * 3; // Skip searchRange, entrySelector, rangeShift endCode.resize(segCount); startCode.resize(segCount); idDelta.resize(segCount); idRangeOffset.resize(segCount);
The endCode
, startCode
, idDelta
, and idRangeOffset
arrays follow. There are u16
arrays immediateley following each other, except for a two byte padding between endCode
and startCode
which should have a value of 0.
for (u16 i = 0; i < segCount; ++i) { endCode[i] = read_u16(&p); } p += 2; // Skip padding for (u16 i = 0; i < segCount; ++i) { startCode[i] = read_u16(&p); } for (u16 i = 0; i < segCount; ++i) { idDelta[i] = read_u16(&p); } for (u16 i = 0; i < segCount; ++i) { idRangeOffset[i] = read_u16(&p); }
The next two arrays are idRangeOffset
which has a size of segCount
and glyphIndexArray
which has a variable length. The lenght of glyphIndexArray
isn't stored, it's just however many bytes are left (divided by two, since it's an array of u16
). There is an indexing trick used to retrieve glyph indices that relies on idRangeOffset
and glyphIndexArray
being contigous in memory. To accomodate for this, we're going to read the data for both into the idRangeOffset
array.
// Next is the glyphIndexArray, but idRangeOffset is used to "index" into this data. // Since the indexing scheme relies on this data being laid out contigously in memory // let's read the contents of glyphIndexArray into idRangeOffset u32 bytesLeft = unicodeTableLength - (p - (unicodeTable)); idRangeOffset.resize(segCount + bytesLeft / 2); for (int i = 0; i < bytesLeft / 2; ++i) { idRangeOffset[segCount + i] = read_u16(&p); } }
Next, we're going to loop trough every segment, then each code point in those segments. For each code point, we will find the glyph index, and record the pair in the unicodeMap
dictionary. There are two ways a code point can map to a glyph index. If the value of idRangeOffset
for the current segment is 0
, the idDelta array is used. If it's not 0
, then the value in idRangeOffset
is used.
If the id range for the segment is zero, add the character code point to the id delta for the segment. The result has to be modulo 65536. If the idRangeOffset value for the segment is not 0, add the character code offset from startCode ((c - start)
) to the idRangeOffset value ((idRangeOffset[s] / 2)
). This sum is an offset. If idRangeOffset
and glyphIndexArray
are contigous in memory (which they are), the offset will index into glyphIndexArray.
for (u16 s = 0; s < segCount; ++s) { u16 start = startCode[s]; u16 end = endCode[s]; for (u16 c = start; c <= end; ++c) { i32 index = 0; if (idRangeOffset[s] != 0) { index = *(&idRangeOffset[s] + (idRangeOffset[s] / 2) + (c - start)); } else { index = i32(idDelta[s] + c) % 65536; } if (index != 0) { allGlyphs[index].codePoint = c; } unicodeMap[c] = index; if (start == end) { break; } } } } // End if (unicodeTable != 0) // ... rest of constructor unchanged free(data); }
Implementing the GetGlyphByCodePoint
function is trivial, given a code point, find it's index in the unicodeMap
dictionary and return the glyph at that index. If an invalid code point is provided, indexing the unicodeMap should return a default value of 0 which points to the missing glyph.
const Glyph& Font::GetGlyphByCodePoint(u16 codePoint) { u32 index = unicodeMap[codePoint]; return allGlyphs[index]; }
Consider overriding the operator[]
operator for conveniance.
const Glyph& Font::operator[](u16 codePoint); const Glyph& Font::operator[](u16 codePoint) const