NWO/CATCH programme - project SCRATCH
SCRipt Analysis Tools for the Cultural Heritage
[English version] [Dutch version]
The CATCH programme NWO/Catch is a Dutch national research programme: Continuous Access To Cultural Heritage. Its goal is to develop methods and technologies for document-collection and archive managers who want to enhance the accessibility of the Dutch cultural heritage on digital platforms such as public web sites. The project SCRATCH (Script Access to the Cultural Heritage) is focused on methods for information retrieval in large collection of handwritten-document images.
The automatic recognition of connected-cursive script in an arbitrary script style is still a tremendous challenge to science and technology. In the case of handwritten historical collections, such as those conserved by the Dutch Nationaal Archief, this challenge is even more tantalizing. Even experienced human readers have big difficulties in reading old manuscripts. There are, however, many ways in which the computer can be put to good use in this application domain. For one thing, the massive amount of text images of handwritten pages will allow for the exploitation of modern statistical techniques. Data mining techniques such as clustering will uncover regularities in handwritten shapes that may act as the bridge to document retrieval and text analysis. In this project we will not aim for a veridical left-to-right transcription of handwritten documents: this would be an unrealistic target. Alternatively, at the end of the project, we will have delivered tools for keyword-based search in handwritten archives, akin to existing flat-text search methods ("Googling").
The research in Scratch can be subdivided into:
(a) the study of character shapes, their statistical characteristics and their classification (pattern recognition);
(b) the study of the textual, linguistic regularities of document content in a given homogeneous archive (statistical computer linguistics). One of the exemplary archives which will be used is the Kabinet van de[n] Koning[in]. This administrative archive is handwritten by a limited number of office clerks over more than two centuries. Administrative and political decisions are meticulously and neatly written down in individual articles. A cross-indexing structure. If our methods for handwritten text retrieval are able to yield usable results on this material, we will attempt to generalize the methodology to other, possibly more complicated collections.
More information can be obtained from: