In order to train character-based classifiers, a lot of training data is needed. Currently, we have annotations available at the word level, but not on a character level. Each team will be given an equal number of pages, both as an image and as an xml-file, in a format as specified on the recognizer page, and will have to label the individual characters.
The character annotations need to be in a specific format to make sure all teams can use the annotations. The format is an extension of the word coordinate specification. The general format is:
<?xml version="1.0" encoding = "UTF-8"?>
<Image name="IMAGE_NAME">
<TextLine no="LINEID" top="TOP" bottom="BOTTOM" left="LEFT" right="RIGHT" shear="45">
<Word no="WORDID" top="TOP" bottom="BOTTOM" left="LEFT" right="RIGHT" shear="45" text="TEXT">
<Character no="CHARID" top="TOP" bottom="BOTTOM" left="LEFT" right="RIGHT" shear="45" text="TEXT" />
...
</Word>
...
</TextLine>
...
</Image>
The attributes of the Image
, TextLine
and Word
tags are already filled
in. You need to add the Character
tags and fill in the attributes. These
annotations may overlap if necessary. If, when using the supplied Python
scripts for reading and writing the XML-files, you leave the top, bottom and
shear attributes empty, they will be filled in with the values inherited from
the word. Usually, you don’t touch the shear attribute, and usually, you can
also leave the bottom and top attributes empty.
The coordinates are the absolute coordinates in the page, not relative to the word or text line.
Each team must choose from one of the following four approaches:
On May 16, there will be an opportunity to exchange annotations. You can offer a certain number of annotations from a couple of pages for annotations of other pages to further train your models. The more you offer, the more you get in return from the other teams.
You can start with the pages specified below to increase the likelihood that you have annotations that other teams do not yet have.
Team | Pages |
---|---|
Team 1. |
NL_HaNa_H2_7823_0055
NL_HaNa_H2_7823_0057 NL_HaNa_H2_7823_0059 NL_HaNa_H2_7823_0061 NL_HaNa_H2_7823_0063 NL_HaNa_H2_7823_0065 NL_HaNa_H2_7823_0067 NL_HaNa_H2_7823_0069 NL_HaNa_H2_7823_0071 NL_HaNa_H2_7823_0073 NL_HaNa_H2_7823_0075 NL_HaNa_H2_7823_0077 NL_HaNa_H2_7823_0079 NL_HaNa_H2_7823_0081 NL_HaNa_H2_7823_0083 NL_HaNa_H2_7823_0085 NL_HaNa_H2_7823_0087 |
Team 2. |
NL_HaNa_H2_7823_0089
NL_HaNa_H2_7823_0091 NL_HaNa_H2_7823_0093 NL_HaNa_H2_7823_0095 NL_HaNa_H2_7823_0097 NL_HaNa_H2_7823_0105 NL_HaNa_H2_7823_0107 NL_HaNa_H2_7823_0109 NL_HaNa_H2_7823_0121 NL_HaNa_H2_7823_0123 NL_HaNa_H2_7823_0125 NL_HaNa_H2_7823_0139 NL_HaNa_H2_7823_0141 NL_HaNa_H2_7823_0143 NL_HaNa_H2_7823_0145 NL_HaNa_H2_7823_0147 NL_HaNa_H2_7823_0149 |
Team 3. |
NL_HaNa_H2_7823_0151
NL_HaNa_H2_7823_0153 NL_HaNa_H2_7823_0155 NL_HaNa_H2_7823_0157 NL_HaNa_H2_7823_0159 NL_HaNa_H2_7823_0163 NL_HaNa_H2_7823_0165 NL_HaNa_H2_7823_0167 NL_HaNa_H2_7823_0169 NL_HaNa_H2_7823_0171 NL_HaNa_H2_7823_0173 NL_HaNa_H2_7823_0175 NL_HaNa_H2_7823_0177 NL_HaNa_H2_7823_0179 NL_HaNa_H2_7823_0181 NL_HaNa_H2_7823_0183 NL_HaNa_H2_7823_0185 |
Team 4. |
NL_HaNa_H2_7823_0187
NL_HaNa_H2_7823_0189 NL_HaNa_H2_7823_0191 NL_HaNa_H2_7823_0193 NL_HaNa_H2_7823_0195 NL_HaNa_H2_7823_0197 NL_HaNa_H2_7823_0199 NL_HaNa_H2_7823_0205 NL_HaNa_H2_7823_0207 NL_HaNa_H2_7823_0209 NL_HaNa_H2_7823_0211 NL_HaNa_H2_7823_0221 NL_HaNa_H2_7823_0223 NL_HaNa_H2_7823_0235 NL_HaNa_H2_7823_0237 NL_HaNa_H2_7823_0239 NL_HaNa_H2_7823_0241 |
Last modified: April 25, 2013, by Jean-Paul van Oosten
Part of the HWR course