In order to train character-based classifiers, a lot of training data is needed. Currently, we have annotations available at the word level, but not on a character level. Each team will be given an equal number of pages, both as an image and as an xml-file, in a format as specified on the recognizer page, and will have to label the individual characters.
The character annotations need to be in a specific format to make sure all teams can use the annotations. The format is an extension of the word coordinate specification. The general format is:
<?xml version="1.0" encoding = "UTF-8"?>
<Image name="IMAGE_NAME">
<TextLine no="LINEID" top="TOP" bottom="BOTTOM" left="LEFT" right="RIGHT" shear="45">
<Word no="WORDID" top="TOP" bottom="BOTTOM" left="LEFT" right="RIGHT" shear="45" text="TEXT">
<Character no="CHARID" top="TOP" bottom="BOTTOM" left="LEFT" right="RIGHT" shear="45" text="TEXT" />
...
</Word>
...
</TextLine>
...
</Image>
The attributes of the Image
, TextLine
and Word
tags are already filled
in. You need to add the Character
tags and fill in the attributes. These
annotations may overlap if necessary. If, when using the supplied Python
scripts for reading and writing the XML-files, you leave the top, bottom and
shear attributes empty, they will be filled in with the values inherited from
the word. Usually, you don’t touch the shear attribute, and usually, you can
also leave the bottom and top attributes empty.
The coordinates are the absolute coordinates in the page, not relative to the word or text line.
To get your preliminary character segmentation hypothesis, you can use one of the following approaches:
On May 14, there will be an opportunity to exchange annotations. You can offer a certain number of annotations from a couple of pages for annotations of other pages to further train your models. The more you offer, the more you get in return from the other teams. However, don’t wait until the end to build your complete recognizer! Build the entire “pipeline” as soon as possible, and improve results with more character annotations.
You can start with the pages specified below to increase the likelihood that you have annotations that other teams do not yet have.
Team | Pages |
---|---|
Team 1. Arjen, Gerben, Aliene, Julien |
Stanford-CCCC_0002.jpg
Stanford-CCCC_0004.jpg Stanford-CCCC_0006.jpg Stanford-CCCC_0008.jpg Stanford-CCCC_0010.jpg Stanford-CCCC_0012.jpg Stanford-CCCC_0014.jpg Stanford-CCCC_0016.jpg Stanford-CCCC_0018.jpg Stanford-CCCC_0020.jpg Stanford-CCCC_0022.jpg Stanford-CCCC_0024.jpg Stanford-CCCC_0026.jpg Stanford-CCCC_0028.jpg Stanford-CCCC_0030.jpg Stanford-CCCC_0032.jpg Stanford-CCCC_0034.jpg Stanford-CCCC_0036.jpg Stanford-CCCC_0038.jpg Stanford-CCCC_0040.jpg Stanford-CCCC_0042.jpg Stanford-CCCC_0044.jpg Stanford-CCCC_0046.jpg Stanford-CCCC_0048.jpg |
Team 2. Arryon, Pieter, Marten, Damian |
Stanford-CCCC_0050.jpg
Stanford-CCCC_0052.jpg Stanford-CCCC_0054.jpg Stanford-CCCC_0056.jpg Stanford-CCCC_0058.jpg Stanford-CCCC_0060.jpg Stanford-CCCC_0062.jpg Stanford-CCCC_0064.jpg Stanford-CCCC_0066.jpg Stanford-CCCC_0068.jpg Stanford-CCCC_0070.jpg Stanford-CCCC_0072.jpg Stanford-CCCC_0074.jpg Stanford-CCCC_0076.jpg Stanford-CCCC_0078.jpg Stanford-CCCC_0080.jpg Stanford-CCCC_0082.jpg Stanford-CCCC_0084.jpg Stanford-CCCC_0086.jpg Stanford-CCCC_0088.jpg Stanford-CCCC_0090.jpg Stanford-CCCC_0092.jpg KNMP-VIII_F_69______2C2O_0002.jpg KNMP-VIII_F_69______2C2O_0004.jpg |
Team 3. Taylor, Francesco, Tom, Erik |
KNMP-VIII_F_69______2C2O_0006.jpg
KNMP-VIII_F_69______2C2O_0008.jpg KNMP-VIII_F_69______2C2O_0010.jpg KNMP-VIII_F_69______2C2O_0012.jpg KNMP-VIII_F_69______2C2O_0014.jpg KNMP-VIII_F_69______2C2O_0016.jpg KNMP-VIII_F_69______2C2O_0018.jpg KNMP-VIII_F_69______2C2O_0020.jpg KNMP-VIII_F_69______2C2O_0022.jpg KNMP-VIII_F_69______2C2O_0024.jpg KNMP-VIII_F_69______2C2O_0026.jpg KNMP-VIII_F_69______2C2O_0028.jpg KNMP-VIII_F_69______2C2O_0030.jpg KNMP-VIII_F_69______2C2O_0032.jpg KNMP-VIII_F_69______2C2O_0034.jpg KNMP-VIII_F_69______2C2O_0036.jpg KNMP-VIII_F_69______2C2O_0038.jpg KNMP-VIII_F_69______2C2O_0040.jpg KNMP-VIII_F_69______2C2O_0042.jpg KNMP-VIII_F_69______2C2O_0044.jpg KNMP-VIII_F_69______2C2O_0046.jpg KNMP-VIII_F_69______2C2O_0048.jpg KNMP-VIII_F_69______2C2O_0050.jpg KNMP-VIII_F_69______2C2O_0052.jpg |
Team 4. Anton, CiarĂ¡n, Rick, Rik |
KNMP-VIII_F_69______2C2O_0054.jpg
KNMP-VIII_F_69______2C2O_0056.jpg KNMP-VIII_F_69______2C2O_0058.jpg KNMP-VIII_F_69______2C2O_0060.jpg KNMP-VIII_F_69______2C2O_0062.jpg KNMP-VIII_F_69______2C2O_0064.jpg KNMP-VIII_F_69______2C2O_0066.jpg KNMP-VIII_F_69______2C2O_0068.jpg KNMP-VIII_F_69______2C2O_0070.jpg KNMP-VIII_F_69______2C2O_0072.jpg KNMP-VIII_F_69______2C2O_0074.jpg KNMP-VIII_F_69______2C2O_0076.jpg KNMP-VIII_F_69______2C2O_0078.jpg KNMP-VIII_F_69______2C2O_0080.jpg KNMP-VIII_F_69______2C2O_0082.jpg KNMP-VIII_F_69______2C2O_0084.jpg KNMP-VIII_F_69______2C2O_0086.jpg KNMP-VIII_F_69______2C2O_0088.jpg KNMP-VIII_F_69______2C2O_0090.jpg KNMP-VIII_F_69______2C2O_0092.jpg KNMP-VIII_F_69______2C2O_0094.jpg KNMP-VIII_F_69______2C2O_0096.jpg KNMP-VIII_F_69______2C2O_0098.jpg KNMP-VIII_F_69______2C2O_0100.jpg |
Team 5. Rik, Petr Jankovsky, Arnoud, Ben |
KNMP-VIII_F_69______2C2O_0102.jpg
KNMP-VIII_F_69______2C2O_0104.jpg KNMP-VIII_F_69______2C2O_0106.jpg KNMP-VIII_F_69______2C2O_0108.jpg KNMP-VIII_F_69______2C2O_0110.jpg KNMP-VIII_F_69______2C2O_0112.jpg KNMP-VIII_F_69______2C2O_0114.jpg KNMP-VIII_F_69______2C2O_0116.jpg KNMP-VIII_F_69______2C2O_0118.jpg KNMP-VIII_F_69______2C2O_0120.jpg KNMP-VIII_F_69______2C2O_0122.jpg KNMP-VIII_F_69______2C2O_0124.jpg KNMP-VIII_F_69______2C2O_0126.jpg KNMP-VIII_F_69______2C2O_0128.jpg KNMP-VIII_F_69______2C2O_0130.jpg KNMP-VIII_F_69______2C2O_0132.jpg KNMP-VIII_F_69______2C2O_0134.jpg KNMP-VIII_F_69______2C2O_0136.jpg KNMP-VIII_F_69______2C2O_0138.jpg KNMP-VIII_F_69______2C2O_0140.jpg KNMP-VIII_F_69______2C2O_0142.jpg KNMP-VIII_F_69______2C2O_0144.jpg KNMP-VIII_F_69______2C2O_0146.jpg KNMP-VIII_F_69______2C2O_0148.jpg |
Last modified: April 30, 2014, by Jean-Paul van Oosten
Part of the HWR course