L.R.B. Schomaker, August 2005.
|Background of the pilot -- The goal is to create a test on Information-Retrieval (IR) approaches for handwritten document retrieval where the bag-of-words concept from regular IR is replaced by a "bag-of-glyphs" in handwriting. A minimum number of assumptions on language and content is present. In fact we only assume a script with horizontal lines of text.|
|Word overlap (Jaccard)||Feature vector||Comment||Hit List button|
|2%||FULL NOISE|| minimum attainable performance |
(on random feature values)
|Noise-based hit list|
|20%||WORD FRAGMENT SHAPES||test on most succesful feature thus far||Shape-based hit list|
|50%||BAG OF WORDS|| optimum attainable performance |
(if 'perfect' word labels were known)
|Human label-based hit list|
1. These results are preliminary, tests are based on a very large number of experiments, which gives confidence. More rigorous testing is needed, still.
2. The 20% word overlap for shape matching is promising, but not sufficient. Some queries give hopeful results, others fail. Try to find vergunning, or Amsterdam. Note that no linguistic or probabilistic information has been used, whatsoever. There is much room for improvement.
3. A 100% performance would require that each top hit is identical to the query. The linguistic word statistics of this particular document evidently lead to a much more moderate word overlap between lines of 50%. In this respect a 20% overlap between query line and top hit is a reasonable performance for a system which uses no other advance knowledge than the assumption that the handwritten text consists of horizontal patterns, vertically organized in a set of lines.
4. No human or machine effort has been spent on the exact association of a word shape and its ASCII representation. For the performance measurement, a crude labeling was performed by loosely typing the line of text corresponding to a line of handwriting.
Read the: Annotation harvest of NA KdK 1903, May 2006