Assignment 1: manual word labeling

Before you can train a recognizer and this recognizer can be tested, train and test data must exist. Therefore, you first have to create this data. For the first assignment you have to manually label all words of 5 pages of the Kabinet der Koningin. The pages will be assigned to you during the practical session.

The tags left, right, top, bottom in the XML have the following interpretation:

While 'shear' indicates the deviation angle (in degrees) from a rectangle.

Framework

You can use the files provided in /student/hwr/framework as a basis. Some of the files are: More details about the framework will once be here: Framework.

To use these files, copy all files in /student/hwr/framework to a directory in your account and in that directory type 'make'. Now run: python handyfunctions.py and see what happens. To forcefully exit the program, you need to use 'ps -a' and 'kill [processnumber]' in a separate terminal window.

Word annotation tool

You could use word-annotation.py to label your words if you like. You can mark word boundaries using the right mouse button and type the words separated by spaces. The program helps you by showing manually entered text of complete text lines, which are often in sync with the words that you need to label. Be aware that this program is not complete so triple-check the functionality and output of the program. Issues that need your attention: This means that you need to update the program before you can start labeling. The images are here: List of files The user name and password will be mentioned during the practical.

Save and submit

Save all word labels and the corresponding rectangles in a .words file in XML format, according to handyfunctions.py and word-annotation.py.

You can use /student/hwr/framework/xmlwordcut.py to test whether your generated xml files are correct. This program expects that the xml files have a main starting and ending tag, so for this to work add

<Document>
and
</Document>
(or alike) to the beginning and ending of your .words files, respectively.

Send the .words files in an e-mail to Axel Brink. He will split them into a train set and a test set. The train set will be available for students in /student/hwr/trainset; the test set will be held back for fair testing.

Criterion: five xml (.words) files must be handed in and they must be correct.


Last modified: May 24, 2007 by Axel Brink.