up: back to handwriting recognition course
>> next: assignment 2
Assignment 1: word labels
Goal
During the practical, you will be assigned ten pages. Label these pages by selecting word regions and typing the words. This results in 10 files with extension '.words'. You do not have to start from scratch. Some of the pages have been annotated by last year's students, but these annotations have to be corrected. The rest of the page have been (partially) annotated line-by-line, not word-by-word. You will create initial .words files from these line annotations using a script and then have to correct them.
In the following instructions, several directory names are suggested. You may use alternative directories if you insist.
Preparation
- If you work at home, then make sure that you have the required software installed: Python 2.4 (or newer), wxPython 2.6 (or newer).
- Make a directory for the practicals:
mkdir ~/hwr
- Copy the word annotation tool:
cp -r /home/student/hwr/word-annot ~/hwr/
- Download your source images here. Select .tif files, not .kdkxml files. The password is on Nestor under the "Course Documents" button. Store the images in /tmp. Space limitations may force you to follow the instructions in this page for every page one at a time. Type
df -h
to see how much space is available on '/'.
- Make a subdirectory for the annotation files:
mkdir ~/hwr/words
Getting initial .words files
- Check which pages are assigned to you here.
- Some pages have been labeled by students last year; they can be found in /home/student/hwr/data/words-2007. Check which of those match the pages assigned to you and copy them to your words directory:
cp /home/student/hwr/data/words-2007/NL_HaNa_H2_7823_xxxx.words ~/hwr/words
(where xxxx is substituted with the page number)
- The rest of the .words files can be generated from existing manual line-by-line annotation (in /home/student/hwr/data/line_annot.txt). Go to your words directory:
cd ~/hwr/words
and generate the missing .words files:
python ../word-annot/init_words_file.py /home/student/hwr/data/line_annot.txt xxxx
(where xxxx is substituted with the page number)
- Change file permissions such that you can see and modify them:
chmod u+wr ~/hwr/words/*.words
Correcting the .words files
- The .words files are still far from perfect. You have to correct them. This is not fun at all but it must be done so that you and your fellow students can make fantastic recognizers. I have made a simple interface for you to ease the job a little. Do not expect too much of it.
cd ~/hwr/word-annot/
python word-annotation.py /tmp/NL_HaNa_H2_7823_xxxx.tif ../words/NL_HaNa_H2_7823_xxxx.words &
- Click a word region to select it. (The detection of mouse clicks is not perfect -- if it does not work then be patient and try to click somewhere else in the box.)
- Every word region is bounded by four line segments. Move the line segments by dragging them. The regions should include the entire word as good as possible. They may include a little ink from other words, this is usually unavoidable. Ink from other words should be removed by your future recognizer program.
- You can change to word's annotation by typing it in the box at the bottom and clicking "OK". (Do not forget to click "OK".)
- After you are done changing a page, click "Save". Do not forget this; you will not get a notification otherwise. It is a good habit to press "Save" often anyway and also make backups of your .words files, since the annotation tool is not very well tested. (But I think it is much better than last year's version.)
- For some changes, you may need to adapt the .words file manually. For example, to insert new word boxes. Do not add word boxes for words that you cannot read.
- The ampersand(&) (and some other special symbols) cannot be stored; leave the transcription for that symbol blank.
- If you are satisfied with all the .words files, put them in Nestor's digital drop box.
- Remove everything you put in /tmp (to keep the computers clean). The images may not leave the building because they are private data.
Last modified: 24 April 2008