Character annotations

In order to train character-based classifiers, a lot of training data is needed. Currently, we have annotations available at the word level, but not on a character level. Each team will be given an equal number of pages, both as an image and as an xml-file, in a format as specified on the recognizer page, and will have to label the individual characters.

Character coordinate specification

The character annotations need to be in a specific format to make sure all teams can use the annotations. The format is an extension of the word coordinate specification. The general format is:

<?xml version="1.0" encoding = "UTF-8"?>
<Image name="IMAGE_NAME">
    <TextLine no="LINEID" top="TOP" bottom="BOTTOM" left="LEFT" right="RIGHT" shear="45">
        <Word no="WORDID" top="TOP" bottom="BOTTOM" left="LEFT" right="RIGHT" shear="45" text="TEXT">
            <Character no="CHARID" top="TOP" bottom="BOTTOM" left="LEFT" right="RIGHT" shear="45" text="TEXT" />
            ...
        </Word>
        ...
    </TextLine>
    ...
</Image>

The attributes of the Image, TextLine and Word tags are already filled in. You need to add the Character tags and fill in the attributes. These annotations may overlap if necessary. If, when using the supplied Python scripts for reading and writing the XML-files, you leave the top, bottom and shear attributes empty, they will be filled in with the values inherited from the word. Usually, you don’t touch the shear attribute, and usually, you can also leave the bottom and top attributes empty.

The coordinates are the absolute coordinates in the page, not relative to the word or text line.

Approaches

To get your preliminary character segmentation hypothesis, you can use one of the following approaches:

EM-approach, starting from a linear regression model similar to what has been done in Semi-automatic determination of allograph duration and position in on-line handwritten words based on the expected number of strokes, using the pixel-width of each character instead of the number of strokes.
A brute-force approach, where character mining is done by using a sliding window to find similar characters, working from most frequent to least frequent characters (i.e., pick the most common character in the dataset, make a model of that character and find similar characters and bootstrap your labelling system like that).
Knowledge-based approach: define heuristics for segmentation points using local kernels, finding minima, maxima, crossings, etc.
Similar to approach 2, but using a page-wise approach: label characters from a single page and find similar characters to those found on the first page.

Exchanging annotations

On May 14, there will be an opportunity to exchange annotations. You can offer a certain number of annotations from a couple of pages for annotations of other pages to further train your models. The more you offer, the more you get in return from the other teams. However, don’t wait until the end to build your complete recognizer! Build the entire “pipeline” as soon as possible, and improve results with more character annotations.

You can start with the pages specified below to increase the likelihood that you have annotations that other teams do not yet have.

Team	Pages
Team 1. Arjen, Gerben, Aliene, Julien	Stanford-CCCC_0002.jpg Stanford-CCCC_0004.jpg Stanford-CCCC_0006.jpg Stanford-CCCC_0008.jpg Stanford-CCCC_0010.jpg Stanford-CCCC_0012.jpg Stanford-CCCC_0014.jpg Stanford-CCCC_0016.jpg Stanford-CCCC_0018.jpg Stanford-CCCC_0020.jpg Stanford-CCCC_0022.jpg Stanford-CCCC_0024.jpg Stanford-CCCC_0026.jpg Stanford-CCCC_0028.jpg Stanford-CCCC_0030.jpg Stanford-CCCC_0032.jpg Stanford-CCCC_0034.jpg Stanford-CCCC_0036.jpg Stanford-CCCC_0038.jpg Stanford-CCCC_0040.jpg Stanford-CCCC_0042.jpg Stanford-CCCC_0044.jpg Stanford-CCCC_0046.jpg Stanford-CCCC_0048.jpg
Team 2. Arryon, Pieter, Marten, Damian	Stanford-CCCC_0050.jpg Stanford-CCCC_0052.jpg Stanford-CCCC_0054.jpg Stanford-CCCC_0056.jpg Stanford-CCCC_0058.jpg Stanford-CCCC_0060.jpg Stanford-CCCC_0062.jpg Stanford-CCCC_0064.jpg Stanford-CCCC_0066.jpg Stanford-CCCC_0068.jpg Stanford-CCCC_0070.jpg Stanford-CCCC_0072.jpg Stanford-CCCC_0074.jpg Stanford-CCCC_0076.jpg Stanford-CCCC_0078.jpg Stanford-CCCC_0080.jpg Stanford-CCCC_0082.jpg Stanford-CCCC_0084.jpg Stanford-CCCC_0086.jpg Stanford-CCCC_0088.jpg Stanford-CCCC_0090.jpg Stanford-CCCC_0092.jpg KNMP-VIII_F_69______2C2O_0002.jpg KNMP-VIII_F_69______2C2O_0004.jpg
Team 3. Taylor, Francesco, Tom, Erik	KNMP-VIII_F_69______2C2O_0006.jpg KNMP-VIII_F_69______2C2O_0008.jpg KNMP-VIII_F_69______2C2O_0010.jpg KNMP-VIII_F_69______2C2O_0012.jpg KNMP-VIII_F_69______2C2O_0014.jpg KNMP-VIII_F_69______2C2O_0016.jpg KNMP-VIII_F_69______2C2O_0018.jpg KNMP-VIII_F_69______2C2O_0020.jpg KNMP-VIII_F_69______2C2O_0022.jpg KNMP-VIII_F_69______2C2O_0024.jpg KNMP-VIII_F_69______2C2O_0026.jpg KNMP-VIII_F_69______2C2O_0028.jpg KNMP-VIII_F_69______2C2O_0030.jpg KNMP-VIII_F_69______2C2O_0032.jpg KNMP-VIII_F_69______2C2O_0034.jpg KNMP-VIII_F_69______2C2O_0036.jpg KNMP-VIII_F_69______2C2O_0038.jpg KNMP-VIII_F_69______2C2O_0040.jpg KNMP-VIII_F_69______2C2O_0042.jpg KNMP-VIII_F_69______2C2O_0044.jpg KNMP-VIII_F_69______2C2O_0046.jpg KNMP-VIII_F_69______2C2O_0048.jpg KNMP-VIII_F_69______2C2O_0050.jpg KNMP-VIII_F_69______2C2O_0052.jpg
Team 4. Anton, Ciarán, Rick, Rik	KNMP-VIII_F_69______2C2O_0054.jpg KNMP-VIII_F_69______2C2O_0056.jpg KNMP-VIII_F_69______2C2O_0058.jpg KNMP-VIII_F_69______2C2O_0060.jpg KNMP-VIII_F_69______2C2O_0062.jpg KNMP-VIII_F_69______2C2O_0064.jpg KNMP-VIII_F_69______2C2O_0066.jpg KNMP-VIII_F_69______2C2O_0068.jpg KNMP-VIII_F_69______2C2O_0070.jpg KNMP-VIII_F_69______2C2O_0072.jpg KNMP-VIII_F_69______2C2O_0074.jpg KNMP-VIII_F_69______2C2O_0076.jpg KNMP-VIII_F_69______2C2O_0078.jpg KNMP-VIII_F_69______2C2O_0080.jpg KNMP-VIII_F_69______2C2O_0082.jpg KNMP-VIII_F_69______2C2O_0084.jpg KNMP-VIII_F_69______2C2O_0086.jpg KNMP-VIII_F_69______2C2O_0088.jpg KNMP-VIII_F_69______2C2O_0090.jpg KNMP-VIII_F_69______2C2O_0092.jpg KNMP-VIII_F_69______2C2O_0094.jpg KNMP-VIII_F_69______2C2O_0096.jpg KNMP-VIII_F_69______2C2O_0098.jpg KNMP-VIII_F_69______2C2O_0100.jpg
Team 5. Rik, Petr Jankovsky, Arnoud, Ben	KNMP-VIII_F_69______2C2O_0102.jpg KNMP-VIII_F_69______2C2O_0104.jpg KNMP-VIII_F_69______2C2O_0106.jpg KNMP-VIII_F_69______2C2O_0108.jpg KNMP-VIII_F_69______2C2O_0110.jpg KNMP-VIII_F_69______2C2O_0112.jpg KNMP-VIII_F_69______2C2O_0114.jpg KNMP-VIII_F_69______2C2O_0116.jpg KNMP-VIII_F_69______2C2O_0118.jpg KNMP-VIII_F_69______2C2O_0120.jpg KNMP-VIII_F_69______2C2O_0122.jpg KNMP-VIII_F_69______2C2O_0124.jpg KNMP-VIII_F_69______2C2O_0126.jpg KNMP-VIII_F_69______2C2O_0128.jpg KNMP-VIII_F_69______2C2O_0130.jpg KNMP-VIII_F_69______2C2O_0132.jpg KNMP-VIII_F_69______2C2O_0134.jpg KNMP-VIII_F_69______2C2O_0136.jpg KNMP-VIII_F_69______2C2O_0138.jpg KNMP-VIII_F_69______2C2O_0140.jpg KNMP-VIII_F_69______2C2O_0142.jpg KNMP-VIII_F_69______2C2O_0144.jpg KNMP-VIII_F_69______2C2O_0146.jpg KNMP-VIII_F_69______2C2O_0148.jpg

Last modified: April 30, 2014, by Jean-Paul van Oosten
Part of the HWR course