NWO/CATCH, Scratch project
Pilot experiment ("nulmeting"), results on:
Sparse-knowledge search methods for handwritten collections

L.R.B. Schomaker, August 2005.

Background of the pilot -- The goal is to create a test on Information-Retrieval (IR) approaches for handwritten document retrieval where the bag-of-words concept from regular IR is replaced by a "bag-of-glyphs" in handwriting. A minimum number of assumptions on language and content is present. In fact we only assume a script with horizontal lines of text.

Collection: Kabinet van de Koningin, Nationaal Archief, 50 pages.
Preprocessing: binarization and segmentation into 1100 lines.
Performance measure: Jaccard distance on manual line label annotation, using bag of words (Salton), comparing each query with its nearest neighbour in the space of word occurrence.
Three experiments:
- 1. Reference minimum performance (Jaccard word match) in case of a feature vector with uniform noise
- 2. A preliminary Scratch method for word-fragment shape based matching to be evaluated
- 3. Reference 'optimum' performance (Jaccard word match) in case of a bag-or-words matching with manually entered line text as labels

Performance for these three Feature Groups

Word overlap (Jaccard)	Feature vector	Comment	Hit List button
2%	FULL NOISE	minimum attainable performance (on random feature values)	Noise-based hit list
20%	WORD FRAGMENT SHAPES	test on most succesful feature thus far	Shape-based hit list
50%	BAG OF WORDS	optimum attainable performance (if 'perfect' word labels were known)	Human label-based hit list

Notes

1. These results are preliminary, tests are based on a very large number of experiments, which gives confidence. More rigorous testing is needed, still.

2. The 20% word overlap for shape matching is promising, but not sufficient. Some queries give hopeful results, others fail. Try to find vergunning, or Amsterdam. Note that no linguistic or probabilistic information has been used, whatsoever. There is much room for improvement.

3. A 100% performance would require that each top hit is identical to the query. The linguistic word statistics of this particular document evidently lead to a much more moderate word overlap between lines of 50%. In this respect a 20% overlap between query line and top hit is a reasonable performance for a system which uses no other advance knowledge than the assumption that the handwritten text consists of horizontal patterns, vertically organized in a set of lines.

4. No human or machine effort has been spent on the exact association of a word shape and its ASCII representation. For the performance measurement, a crude labeling was performed by loosely typing the line of text corresponding to a line of handwriting.

Nice examples of shape-based queries

(not all matches are equally impressive).

Read the: Annotation harvest of NA KdK 1903, May 2006

NWO/CATCH, Scratch project Pilot experiment ("nulmeting"), results on: Sparse-knowledge search methods for handwritten collections

Performance for these three Feature Groups

Notes

Nice examples of shape-based queries

NWO/CATCH, Scratch project
Pilot experiment ("nulmeting"), results on:
Sparse-knowledge search methods for handwritten collections