Recognizer: technical details

The goal of the course is to build a recognizer. You will create a program called recognize, that will classify all word zones on a page. The input will be an input image (a complete page, in ppm format) and an XML file containing the coordinates of each word-zone.

You will be provided with a number of python and C++ modules that will do some work for you, such as reading and writing ppm/pgm images. These files can be found in /home/student/vakken/hwr/toolbox. Details of the contents of the toolbox can be found below.

Combining C++ and Python

To be able to call C++-functions from, e.g., Python code, you can use Swig. This program creates the code that is needed to link both programming languages. You only need to specify a small “interface file”. In the toolbox you can find a simple example for the image library.

The swig website contains a small tutorial that has a little more information. For the toolbox, you can just call make to create everything you need.

Word coordinate specification

The toolbox contains a small Python module wordio.py that parses .words-files. These files are XML files that accompany image files and describe where the text lines and words are, and what the transcription of each word is.

An example, describing two text lines:

<?xml version="1.0" encoding = "UTF-8"?>
<Image name="NL_HaNa_H2_7823_0055">
    <TextLine no="1" top="599" bottom="803" left="1235" right="2843" shear="45">
        <Word no="1" top="599" bottom="803" left="1235" right="1536" shear="45" text="Rappt"/>
        <Word no="2" top="623" bottom="803" left="1513" right="1610" shear="45" text="JD"/>
        <Word no="3" top="708" bottom="803" left="1526" right="1624" shear="45" text="10"/>
        <Word no="4" top="649" bottom="782" left="1684" right="1846" shear="45" text="Feb"/>
        <Word no="5" top="688" bottom="803" left="1808" right="1884" shear="45" text="no"/>
        <Word no="6" top="708" bottom="803" left="1865" right="2024" shear="45" text="175,"/>
        <Word no="7" top="708" bottom="803" left="2025" right="2212" shear="45" text="om"/>
        <Word no="8" top="708" bottom="803" left="2213" right="2843" shear="45" text="machtiging"/>
    </TextLine>
    <TextLine no="2" top="781" bottom="929" left="1106" right="2890" shear="45">
        <Word no="1" top="786" bottom="903" left="1106" right="1408" shear="45" text="wijzend"/>
        <Word no="2" top="781" bottom="871" left="1425" right="1518" shear="45" text="te"/>
        <Word no="3" top="792" bottom="868" left="1554" right="1949" shear="45" text="beschikken"/>
        <Word no="4" top="812" bottom="929" left="1930" right="2032" shear="45" text="op"/>
        <Word no="5" top="810" bottom="884" left="2036" right="2166" shear="45" text="een"/>
        <Word no="6" top="808" bottom="903" left="2185" right="2724" shear="45" text="verzoekschrift"/>
        <Word no="7" top="808" bottom="897" left="2750" right="2890" shear="45" text="van"/>
    </TextLine>
</Image>

The coordinates of the <TextLine> tags indicate text line zones that were determined semi-automatically. The shear factor indicates the shear angle, or slant, of the text. It is always 45 degrees for this dataset.

Because of this shear, the word zones are interpreted as parallelograms. The attributes of the <Word> tags have the following meaning:

top indicates the Y position of the top of the parallelogram
bottom indicates the Y position of the bottom of the parallelogram
left indicates the X position of the top-left vertex of the parallelogram
right indicates the X position of the top-right vertex of the parallelogram
text is the transcription of the handwritten text in the image

Output of the recognizer

The recognizer program will receive two arguments: the input image and a .words file, without the text-attribute. This means that you don’t have to find the coordinates and shear of the words, but can focus on the classification.

The output of your program should be a new .words file. It should be identical to the input file, only with the text-attribute filled in. Your program will be invoked as follows:

$ recognizer input.ppm input.words /path/to/output.words

where recognizer is your program and $ represents the command line prompt. The file at /path/to/output.words does not exist yet, and your program will create it and write the filled-in xml file.

Handing in your recognizer

Hand in your code in a .tar.gz file, marked with your name and version. If there are compilation steps (if you use C++ or C in your program), provide a Makefile that compiles your program with a single make command.

Document your code thoroughly, this will make grading easier and help you writing the final paper.

Getting started

Step 1: Get the toolbox files

Make a directory for the practicals:
mkdir ~/hwr
Copy the introduction files:
cp -r /home/student/vakken/hwr/toolbox ~/hwr

Step 2: Get a sample image

Download a source image here.
The password is on Nestor under the “Information” button.
Be careful to select a .tif file named like NL_HaNa_H2_7823_xxxx.tif, not a file with ‘colorized’ in the name or a file with extension .kdkxml.
Store the image in the temporary storage directory /dev/shm (it is flushed at a reboot). These files are big; keep an eye on disk space usage. Type df -h to see how much space is available.
The sample code only works with raw .pbm/.ppm/.pgm files (simple bitmaps; .pbm is black/white, .ppm is color, .pgm is grey scale).
Convert the downloaded image to raw PPM: convert -compress LZW /dev/shm/NL_HaNa_H2_7823_xxxx.tif /dev/shm/NL_HaNa_H2_7823_xxxx.ppm
Note: the ‘compress LZW’ option is misleading; there is no compression in the PPM format, but this ensures that the file is stored as raw .ppm (instead of ASCII .ppm). You can remove the .tif version of the image.

Step 3: Compile and test the toolbox files

Compile the C++ code:
cd ~/hwr/toolbox
make
Try the cropper example:
python example-crop.py /dev/shm/NL_HaNa_H2_7823_xxxx.ppm /dev/shm/cropped.ppm
It will create an image (cropped.ppm) of a part of the original image.
Look at the connected component example (example_cocos.py). It will not work immediately, because you don’t have a gray-scale image (.pgm-file) yet, but it is provided to learn how to work with the connected component-part of the toolbox.

Get the `.words` files

The .words files can be found in /home/student/vakken/hwr/data/words.

Toolbox contents

File	Description
`pamImage.cpp`	Read and write `.pbm` / `.pgm` / `.ppm` files (you don’t need to look inside this file)
`pamImage.h`	Header for `pamImage.cpp`; look here to see what you can do with a PamImage object.
`pamImage.i`	Interface between C++ and Python for `pamImage`. Swig uses this file to create a Python wrapper around the C++ code
`cocos_arnold/`	C++ routines for fast connected components labeling by Arnold Meijster (no need to look inside).
`cocoslib.cpp`	Procedures to compute connected components in document images (uses `cocos_arnold/`).
`cocoslib.h`	Header for `cocoslib.cpp`.
`cocoslib.i`	Interface between C++ and Python for `cocoslib`
`example_cocos.py`	Provides a quickstart for using connected components.
`croplib.cpp`	Crop `.pbm` / `.pgm` / `.ppm` images
`croplib.h`	Header for `croplib.cpp`
`croplib.i`	Interface between C++ and Python for `croplib`
`example-crop.py`	Shows how to use `croplib`. Crops an image.
`word.py`	Class for word zones and transcription
`wordio.py`	Reads and writes `.words` files
`Makefile`	Instructions for `make` to compile the code.

Last modified: April 22, 2014, by Jean-Paul van Oosten
Part of the HWR course