Character annotations

In order to train character-based classifiers, a lot of training data is needed. Currently, we have annotations available at the word level, but not on a character level. Each team will be given an equal number of pages, both as an image and as an xml-file, in a format as specified on the recognizer page, and will have to label the individual characters.

Character coordinate specification

The character annotations need to be in a specific format to make sure all teams can use the annotations. The format is an extension of the word coordinate specification. The general format is:

<?xml version="1.0" encoding = "UTF-8"?>
<Image name="IMAGE_NAME">
    <TextLine no="LINEID" top="TOP" bottom="BOTTOM" left="LEFT" right="RIGHT" shear="45">
        <Word no="WORDID" top="TOP" bottom="BOTTOM" left="LEFT" right="RIGHT" shear="45" text="TEXT">
            <Character no="CHARID" top="TOP" bottom="BOTTOM" left="LEFT" right="RIGHT" shear="45" text="TEXT" />
            ...
        </Word>
        ...
    </TextLine>
    ...
</Image>

The attributes of the Image, TextLine and Word tags are already filled in. You need to add the Character tags and fill in the attributes. These annotations may overlap if necessary. If, when using the supplied Python scripts for reading and writing the XML-files, you leave the top, bottom and shear attributes empty, they will be filled in with the values inherited from the word. Usually, you don’t touch the shear attribute, and usually, you can also leave the bottom and top attributes empty.

The coordinates are the absolute coordinates in the page, not relative to the word or text line.

Approaches

Each team must choose from one of the following four approaches:

EM-approach, starting from a linear regression model similar to what has been done in Semi-automatic determination of allograph duration and position in on-line handwritten words based on the expected number of strokes, using the pixel-width of each character instead of the number of strokes.
A brute-force approach, where character mining is done by using a sliding window to find similar characters, working from most frequent to least frequent characters (i.e., pick the most common character in the dataset, make a model of that character and find similar characters and bootstrap your labelling system like that).
Knowledge-based approach: define heuristics for segmentation points using local kernels, finding minima, maxima, crossings, etc.
Similar to approach 2, but using a page-wise approach: label characters from a single page and find similar characters to those found on the first page.

Exchanging annotations

On May 16, there will be an opportunity to exchange annotations. You can offer a certain number of annotations from a couple of pages for annotations of other pages to further train your models. The more you offer, the more you get in return from the other teams.

You can start with the pages specified below to increase the likelihood that you have annotations that other teams do not yet have.

Team	Pages
Team 1.	NL_HaNa_H2_7823_0055 NL_HaNa_H2_7823_0057 NL_HaNa_H2_7823_0059 NL_HaNa_H2_7823_0061 NL_HaNa_H2_7823_0063 NL_HaNa_H2_7823_0065 NL_HaNa_H2_7823_0067 NL_HaNa_H2_7823_0069 NL_HaNa_H2_7823_0071 NL_HaNa_H2_7823_0073 NL_HaNa_H2_7823_0075 NL_HaNa_H2_7823_0077 NL_HaNa_H2_7823_0079 NL_HaNa_H2_7823_0081 NL_HaNa_H2_7823_0083 NL_HaNa_H2_7823_0085 NL_HaNa_H2_7823_0087
Team 2.	NL_HaNa_H2_7823_0089 NL_HaNa_H2_7823_0091 NL_HaNa_H2_7823_0093 NL_HaNa_H2_7823_0095 NL_HaNa_H2_7823_0097 NL_HaNa_H2_7823_0105 NL_HaNa_H2_7823_0107 NL_HaNa_H2_7823_0109 NL_HaNa_H2_7823_0121 NL_HaNa_H2_7823_0123 NL_HaNa_H2_7823_0125 NL_HaNa_H2_7823_0139 NL_HaNa_H2_7823_0141 NL_HaNa_H2_7823_0143 NL_HaNa_H2_7823_0145 NL_HaNa_H2_7823_0147 NL_HaNa_H2_7823_0149
Team 3.	NL_HaNa_H2_7823_0151 NL_HaNa_H2_7823_0153 NL_HaNa_H2_7823_0155 NL_HaNa_H2_7823_0157 NL_HaNa_H2_7823_0159 NL_HaNa_H2_7823_0163 NL_HaNa_H2_7823_0165 NL_HaNa_H2_7823_0167 NL_HaNa_H2_7823_0169 NL_HaNa_H2_7823_0171 NL_HaNa_H2_7823_0173 NL_HaNa_H2_7823_0175 NL_HaNa_H2_7823_0177 NL_HaNa_H2_7823_0179 NL_HaNa_H2_7823_0181 NL_HaNa_H2_7823_0183 NL_HaNa_H2_7823_0185
Team 4.	NL_HaNa_H2_7823_0187 NL_HaNa_H2_7823_0189 NL_HaNa_H2_7823_0191 NL_HaNa_H2_7823_0193 NL_HaNa_H2_7823_0195 NL_HaNa_H2_7823_0197 NL_HaNa_H2_7823_0199 NL_HaNa_H2_7823_0205 NL_HaNa_H2_7823_0207 NL_HaNa_H2_7823_0209 NL_HaNa_H2_7823_0211 NL_HaNa_H2_7823_0221 NL_HaNa_H2_7823_0223 NL_HaNa_H2_7823_0235 NL_HaNa_H2_7823_0237 NL_HaNa_H2_7823_0239 NL_HaNa_H2_7823_0241

Last modified: April 25, 2013, by Jean-Paul van Oosten
Part of the HWR course