Character annotations

In order to train character-based classifiers, a lot of training data is needed. Currently, we have annotations available at the word level, but not on a character level. Each team will be given an equal number of pages, both as an image and as an xml-file, in a format as specified on the recognizer page, and will have to label the individual characters.

Character coordinate specification

The character annotations need to be in a specific format to make sure all teams can use the annotations. The format is an extension of the word coordinate specification. The general format is:

<?xml version="1.0" encoding = "UTF-8"?>
<Image name="IMAGE_NAME">
    <TextLine no="LINEID" top="TOP" bottom="BOTTOM" left="LEFT" right="RIGHT" shear="45">
        <Word no="WORDID" top="TOP" bottom="BOTTOM" left="LEFT" right="RIGHT" shear="45" text="TEXT">
            <Character no="CHARID" top="TOP" bottom="BOTTOM" left="LEFT" right="RIGHT" shear="45" text="TEXT" />
            ...
        </Word>
        ...
    </TextLine>
    ...
</Image>

The attributes of the Image, TextLine and Word tags are already filled in. You need to add the Character tags and fill in the attributes. These annotations may overlap if necessary. If, when using the supplied Python scripts for reading and writing the XML-files, you leave the top, bottom and shear attributes empty, they will be filled in with the values inherited from the word. Usually, you don’t touch the shear attribute, and usually, you can also leave the bottom and top attributes empty.

The coordinates are the absolute coordinates in the page, not relative to the word or text line.

Approaches

Each team must choose from one of the following four approaches:

  1. EM-approach, starting from a linear regression model similar to what has been done in Semi-automatic determination of allograph duration and position in on-line handwritten words based on the expected number of strokes, using the pixel-width of each character instead of the number of strokes.
  2. A brute-force approach, where character mining is done by using a sliding window to find similar characters, working from most frequent to least frequent characters (i.e., pick the most common character in the dataset, make a model of that character and find similar characters and bootstrap your labelling system like that).
  3. Knowledge-based approach: define heuristics for segmentation points using local kernels, finding minima, maxima, crossings, etc.
  4. Similar to approach 2, but using a page-wise approach: label characters from a single page and find similar characters to those found on the first page.

Exchanging annotations

On May 16, there will be an opportunity to exchange annotations. You can offer a certain number of annotations from a couple of pages for annotations of other pages to further train your models. The more you offer, the more you get in return from the other teams.

You can start with the pages specified below to increase the likelihood that you have annotations that other teams do not yet have.

TeamPages
Team 1. NL_HaNa_H2_7823_0055
NL_HaNa_H2_7823_0057
NL_HaNa_H2_7823_0059
NL_HaNa_H2_7823_0061
NL_HaNa_H2_7823_0063
NL_HaNa_H2_7823_0065
NL_HaNa_H2_7823_0067
NL_HaNa_H2_7823_0069
NL_HaNa_H2_7823_0071
NL_HaNa_H2_7823_0073
NL_HaNa_H2_7823_0075
NL_HaNa_H2_7823_0077
NL_HaNa_H2_7823_0079
NL_HaNa_H2_7823_0081
NL_HaNa_H2_7823_0083
NL_HaNa_H2_7823_0085
NL_HaNa_H2_7823_0087
Team 2. NL_HaNa_H2_7823_0089
NL_HaNa_H2_7823_0091
NL_HaNa_H2_7823_0093
NL_HaNa_H2_7823_0095
NL_HaNa_H2_7823_0097
NL_HaNa_H2_7823_0105
NL_HaNa_H2_7823_0107
NL_HaNa_H2_7823_0109
NL_HaNa_H2_7823_0121
NL_HaNa_H2_7823_0123
NL_HaNa_H2_7823_0125
NL_HaNa_H2_7823_0139
NL_HaNa_H2_7823_0141
NL_HaNa_H2_7823_0143
NL_HaNa_H2_7823_0145
NL_HaNa_H2_7823_0147
NL_HaNa_H2_7823_0149
Team 3. NL_HaNa_H2_7823_0151
NL_HaNa_H2_7823_0153
NL_HaNa_H2_7823_0155
NL_HaNa_H2_7823_0157
NL_HaNa_H2_7823_0159
NL_HaNa_H2_7823_0163
NL_HaNa_H2_7823_0165
NL_HaNa_H2_7823_0167
NL_HaNa_H2_7823_0169
NL_HaNa_H2_7823_0171
NL_HaNa_H2_7823_0173
NL_HaNa_H2_7823_0175
NL_HaNa_H2_7823_0177
NL_HaNa_H2_7823_0179
NL_HaNa_H2_7823_0181
NL_HaNa_H2_7823_0183
NL_HaNa_H2_7823_0185
Team 4. NL_HaNa_H2_7823_0187
NL_HaNa_H2_7823_0189
NL_HaNa_H2_7823_0191
NL_HaNa_H2_7823_0193
NL_HaNa_H2_7823_0195
NL_HaNa_H2_7823_0197
NL_HaNa_H2_7823_0199
NL_HaNa_H2_7823_0205
NL_HaNa_H2_7823_0207
NL_HaNa_H2_7823_0209
NL_HaNa_H2_7823_0211
NL_HaNa_H2_7823_0221
NL_HaNa_H2_7823_0223
NL_HaNa_H2_7823_0235
NL_HaNa_H2_7823_0237
NL_HaNa_H2_7823_0239
NL_HaNa_H2_7823_0241

Last modified: April 25, 2013, by Jean-Paul van Oosten
Part of the HWR course