Unicode Mapping

At the moment we can only retrieve glyphs from the font using their index, which is not very helpful. In this section, we're going to implement unicode mapping. Unicode mapping will let us retrieve glyphs based using their unicode code point. We're going to be implementing support for two byte encoding only. All relevant data is stored in the cmap table.

Table	Why it's needed
cmap	Maps unicode code points to glyphs.

Let's add a codePoint to each glyph. We're not going to use this code point for anything, it's useful for debugging.

struct Glyph {
    vector< Edge > edges;
    Point min;
    Point max;

    bool isFlipped;
    u32 index;
    u16 codePoint; // NEW

    inline Glyph() {
        index = 0;
        codePoint = 0;
        isFlipped = false;
    }
    
    // Methods stay unchanged
};

Add an unordered_map< u16, u32 > to the font class, this dictionary maps code points to glyph indices. Let's also declare a new function to retrieve a glyph by it's code point, GetGlyphByCodePoint.

class Font {
protected:
    vector< Glyph > allGlyphs;
    unordered_map< u16, u32 > unicodeMap; // NEW
protected:
    Glyph ParseSimpleGlyph(u16 index, u8* glyf, const vector< u32 >& loca);
    Glyph ParseCompoundGlyph(u16 index, u8* glyf, const vector< u32 >& loca);
public:
    Font(const char* file, bool flip = true);

    u32 NumGlyphs();
    const Glyph& GetGlyphByIndex(u32 index);
    const Glyph& GetGlyphByCodePoint(u16 codePoint); // NEW
};

A ttf file can contain multiple character maps. Find the cmap table, it contains a u16 for it's version and a u16 for the number of character encoding tables in the file. The character encoding tables follow the cmap table directly, they contain a u16 for the platform id, a u16 for a platform specific identifier and a u32 thats an offset to the actual mapping table.

The platform id of each cmap subtable will be one of the following:

0: Unicode, this is the only one we will support
1: Mac specific
2: Reserved, don't use
3: Microsoft

Let's add some code at the end of the constructor. Find the cmap table, read the number of subtables. Loop torugh all cmap subtables, searching for the unicode subtable.

Font::Font(const char* file, bool flip) {
    // ... Existing constructor code remains unchanged

    u32 cmapTag = ('c' << 24) | ('m' << 16) | ('a' << 8) | 'p';
    u8* cmap = 0;
    tableTag = data + sizeofOffsetTable;
    for (u16 table = 0; table < numTables; ++table) {
        u32 thisTag = (tableTag[0] << 24) | (tableTag[1] << 16) | (tableTag[2] << 8) | tableTag[3];
        u8* pOffset = tableTag + 8;
        u32 offset = (pOffset[0] << 24) | (pOffset[1] << 16) | (pOffset[2] << 8) | pOffset[3];
        if (thisTag == cmapTag) {
            cmap = data + offset;
        }
        tableTag += tableDirectoryStride;
    }

    p = cmap + 2; // Skip version
    u16 numCMapSubTables = read_u16(&p);

    u8* unicodeTable = 0;

    for (u16 i = 0; i < numCMapSubTables; ++i) {
        u16 platformID = read_u16(&p);
        p += 2; // Skip platformSpecificID
        u32 offset = read_u32(&p);

        if (platformID == 0) { // Found UNICODE Platform
            p = cmap + offset;
            u16 unicodeTableFormat = read_u16(&p);
            if (unicodeTableFormat == 4) {
                unicodeTable = cmap + offset;
                break;
            }
        }
    }

The unicode platform format is the only one we will cover in this blog. When the platform id of a cmap subtable is 1, the platform specific id will be one of the following:

0: Version 1.0 semantics
1: Version 1.1 semantics
2: Deprecated
3: Unicode 2.0 or later, BMP only
4: Unicode 2.0 or later, this is the only one we will be implementing
5: Unicode variation sequence
6: Last resort

This blog only covers format 4, which is a two byte encoding. This is the most common mapping table, most fonts will have one. The format 4 table structure is described here. If a unicode table is found, the first two bytes are a u16 containing the format of the table.

This format contains blocks of unicode ranges. The table is made up of 4 parallel arrays describing each range and an array of glyph indices. The startCode and endCode arrays contain the first and last character code of each range. There other two arrays, idDelta and idRangeOffset are used to map a code point to a glyph index.

    if (unicodeTable != 0) {
        std::vector< u16 > endCode;
        std::vector< u16 > startCode;
        std::vector< u16 > idDelta;
        std::vector< u16 > idRangeOffset;
        u16 segCount = 0;

        p = unicodeTable;
        u16 unicodeTableFormat = read_u16(&p);

        if (unicodeTableFormat == 4) {

The next u16 is the length of the whole subtable, including the header data. This is followed by a language code that we will not use. The next u16 is double the number of unicode segments that this table contains. Divide by two to get the number of segments.

            u16 unicodeTableLength = read_u16(&p);
            u16 language = read_u16(&p);
            u16 segCountX2 = read_u16(&p);
            segCount = segCountX2 / 2;
            p += 2 * 3; // Skip searchRange, entrySelector, rangeShift

            endCode.resize(segCount);
            startCode.resize(segCount);
            idDelta.resize(segCount);
            idRangeOffset.resize(segCount);

The endCode, startCode, idDelta, and idRangeOffset arrays follow. There are u16 arrays immediateley following each other, except for a two byte padding between endCode and startCode which should have a value of 0.

            for (u16 i = 0; i < segCount; ++i) {
                endCode[i] = read_u16(&p);
            }
            p += 2; // Skip padding

            for (u16 i = 0; i < segCount; ++i) {
                startCode[i] = read_u16(&p);
            }

            for (u16 i = 0; i < segCount; ++i) {
                idDelta[i] = read_u16(&p);
            }

            for (u16 i = 0; i < segCount; ++i) {
                idRangeOffset[i] = read_u16(&p);
            }

The next two arrays are idRangeOffset which has a size of segCount and glyphIndexArray which has a variable length. The lenght of glyphIndexArray isn't stored, it's just however many bytes are left (divided by two, since it's an array of u16). There is an indexing trick used to retrieve glyph indices that relies on idRangeOffset and glyphIndexArray being contigous in memory. To accomodate for this, we're going to read the data for both into the idRangeOffset array.

            // Next is the glyphIndexArray, but idRangeOffset is used to "index" into this data.
            // Since the indexing scheme relies on this data being laid out contigously in memory
            // let's read the contents of glyphIndexArray into idRangeOffset
            u32 bytesLeft = unicodeTableLength - (p - (unicodeTable));
            idRangeOffset.resize(segCount + bytesLeft / 2);
            for (int i = 0; i < bytesLeft / 2; ++i) {
                idRangeOffset[segCount + i] = read_u16(&p);
            }
        }

Next, we're going to loop trough every segment, then each code point in those segments. For each code point, we will find the glyph index, and record the pair in the unicodeMap dictionary. There are two ways a code point can map to a glyph index. If the value of idRangeOffset for the current segment is 0, the idDelta array is used. If it's not 0, then the value in idRangeOffset is used.

If the id range for the segment is zero, add the character code point to the id delta for the segment. The result has to be modulo 65536. If the idRangeOffset value for the segment is not 0, add the character code offset from startCode ((c - start)) to the idRangeOffset value ((idRangeOffset[s] / 2)). This sum is an offset. If idRangeOffset and glyphIndexArray are contigous in memory (which they are), the offset will index into glyphIndexArray.

        for (u16 s = 0; s < segCount; ++s) {
            u16 start = startCode[s];
            u16 end = endCode[s];

            for (u16 c = start; c <= end; ++c) {
                i32 index = 0;

                if (idRangeOffset[s] != 0) {
                    index = *(&idRangeOffset[s] + (idRangeOffset[s] / 2) + (c - start));
                }
                else {
                    index = i32(idDelta[s] + c) % 65536;
                }
                if (index != 0) {
                    allGlyphs[index].codePoint = c;
                }
                unicodeMap[c] = index;
                if (start == end) {
                    break;
                }
            }
        }
    } // End if (unicodeTable != 0) 

    // ... rest of constructor unchanged
    free(data);
}

Implementing the GetGlyphByCodePoint function is trivial, given a code point, find it's index in the unicodeMap dictionary and return the glyph at that index. If an invalid code point is provided, indexing the unicodeMap should return a default value of 0 which points to the missing glyph.

const Glyph& Font::GetGlyphByCodePoint(u16 codePoint) {
    u32 index = unicodeMap[codePoint];
    return allGlyphs[index];
}

Consider overriding the operator[] operator for conveniance.

const Glyph& Font::operator[](u16 codePoint);
const Glyph& Font::operator[](u16 codePoint) const