Description of .words files

Under construction!

A .words file is a XML file that accompanies a .TIF file and describes where the text lines and words are, and what the transcription of each word is. Its format is explained by an example that describes two text lines:

<?xml version="1.0" encoding = "UTF-8"?>
<Image name="NL_HaNa_H2_7823_0055">
    <TextLine no="1" top="599" bottom="803" left="1235" right="2843" shear="45">
        <Word no="1" top="599" bottom="803" left="1235" right="1536" shear="45" text="Rappt"/>
        <Word no="2" top="623" bottom="803" left="1513" right="1610" shear="45" text="JD"/>
        <Word no="3" top="708" bottom="803" left="1526" right="1624" shear="45" text="10"/>
        <Word no="4" top="649" bottom="782" left="1684" right="1846" shear="45" text="Feb"/>
        <Word no="5" top="688" bottom="803" left="1808" right="1884" shear="45" text="no"/>
        <Word no="6" top="708" bottom="803" left="1865" right="2024" shear="45" text="175,"/>
        <Word no="7" top="708" bottom="803" left="2025" right="2212" shear="45" text="om"/>
        <Word no="8" top="708" bottom="803" left="2213" right="2843" shear="45" text="machtiging"/>
    </TextLine>
    <TextLine no="2" top="781" bottom="929" left="1106" right="2890" shear="45">
        <Word no="1" top="786" bottom="903" left="1106" right="1408" shear="45" text="wijzend"/>
        <Word no="2" top="781" bottom="871" left="1425" right="1518" shear="45" text="te"/>
        <Word no="3" top="792" bottom="868" left="1554" right="1949" shear="45" text="beschikken"/>
        <Word no="4" top="812" bottom="929" left="1930" right="2032" shear="45" text="op"/>
        <Word no="5" top="810" bottom="884" left="2036" right="2166" shear="45" text="een"/>
        <Word no="6" top="808" bottom="903" left="2185" right="2724" shear="45" text="verzoekschrift"/>
        <Word no="7" top="808" bottom="897" left="2750" right="2890" shear="45" text="van"/>
    </TextLine>
</Image>

The coordinates of the <TextLine> tags indicate text line zones that were determined semi-automatically. The shear factor indicates the shear anle, or slant, of the text. It is always 45 (degrees).

Because of this shear, the word zones are interpreted as parallelograms. The attributes of the <Word> tags have the following meaning:

The coordinates are visualised in the picture shown to the right.

Reading and writing of these files can be done using wordio.py. A .words file can be inspected and modified using word-annotation.py.

During testing, the <Word> tags are not available for your program! The .words files that your program will get as input only contain the <TextLine> tags.


Credit: syntax highlighted by Code2HTML, v. 0.9.1

Last modified: 29 April 2009 by Axel Brink.