Under construction!
A .words file is a XML file that accompanies a .TIF file and describes where the text lines and words are, and what the transcription of each word is. Its format is explained by an example that describes two text lines:
<?xml version="1.0" encoding = "UTF-8"?> <Image name="NL_HaNa_H2_7823_0055"> <TextLine no="1" top="599" bottom="803" left="1235" right="2843" shear="45"> <Word no="1" top="599" bottom="803" left="1235" right="1536" shear="45" text="Rappt"/> <Word no="2" top="623" bottom="803" left="1513" right="1610" shear="45" text="JD"/> <Word no="3" top="708" bottom="803" left="1526" right="1624" shear="45" text="10"/> <Word no="4" top="649" bottom="782" left="1684" right="1846" shear="45" text="Feb"/> <Word no="5" top="688" bottom="803" left="1808" right="1884" shear="45" text="no"/> <Word no="6" top="708" bottom="803" left="1865" right="2024" shear="45" text="175,"/> <Word no="7" top="708" bottom="803" left="2025" right="2212" shear="45" text="om"/> <Word no="8" top="708" bottom="803" left="2213" right="2843" shear="45" text="machtiging"/> </TextLine> <TextLine no="2" top="781" bottom="929" left="1106" right="2890" shear="45"> <Word no="1" top="786" bottom="903" left="1106" right="1408" shear="45" text="wijzend"/> <Word no="2" top="781" bottom="871" left="1425" right="1518" shear="45" text="te"/> <Word no="3" top="792" bottom="868" left="1554" right="1949" shear="45" text="beschikken"/> <Word no="4" top="812" bottom="929" left="1930" right="2032" shear="45" text="op"/> <Word no="5" top="810" bottom="884" left="2036" right="2166" shear="45" text="een"/> <Word no="6" top="808" bottom="903" left="2185" right="2724" shear="45" text="verzoekschrift"/> <Word no="7" top="808" bottom="897" left="2750" right="2890" shear="45" text="van"/> </TextLine> </Image>
The coordinates of the <TextLine> tags indicate text line zones that were determined semi-automatically. The shear factor indicates the shear anle, or slant, of the text. It is always 45 (degrees).
Because of this shear, the word zones are interpreted as parallelograms. The attributes of the <Word> tags have the following meaning:
top
indicates the y position of the top of the parallelogrambottom
indicates the y position of the bottom of the parallelogramleft
indicates the x position of the top-left vertex of the parallelogramright
indicates the x position of the top-right vertex of the parallelogramtext
is the transcription of the handwritten text in the imageReading and writing of these files can be done using wordio.py. A .words file can be inspected and modified using word-annotation.py.
During testing, the <Word> tags are not available for your program! The .words files that your program will get as input only contain the <TextLine> tags.
Credit: syntax highlighted by Code2HTML, v. 0.9.1
Last modified: 29 April 2009 by Axel Brink.