Monk [Dutch]

The Monk system is developed at the University of Groningen by a research group at the Artificial Intelligence institute ALICE, under supervision by prof. dr. Lambert Schomaker. In cooperation with the Dutch National Archive we have created methods for accessing historical archive collections which are difficult to process by traditional OCR methods, for example due to their historical character types or due to the fact that the material is handwritten. The system consists of two major components: (1) a setup for the storage and web-based annotation of scanned page images and parts thereof; (2) a set of (handwriting and text) recognition algorithms as well as retrieval and search methods.

  • Try our [Monk search engine]

    An example hit list is here.

    Indexed books and overview

    • From the Archief van het Kabinet der Koningin, 1898-1945, Periode 1814-1988 (totale omvang 598 meter; 4660 inventarisnummers)
      we started with the book: Indices op het verbaal, boek 7823, 1903, bladz. 1-1040

    • From Alg. Rijksarch. 1e Afd. Admiraliteitscolleges
      Nr. 1177, Journaal, beginnend 7 Sept. 1779, Capiteyn J.J. van Hoeij, [... s'Lands Schip van Oorloge Rotterdam, 50 stukken canon en 300 manschappen ...] (272 pages)

    • To the internal project site with demos, annotation and search tools for handwritten or historical manuscripts. Please contact me for obtaining access to this site. Volunteers who can read Dutch or Latin are invited to participate in transcription and annotation.

    • There is also a static index of Monk's KdK collection.

    • In the summer of 2010, we have ingested two different 15th century texts, on from Belgium (Schepenbank, Stadsarchief Leuven) and one from the archives of the province of Gelderland, with good success. This broadens the applicability of our algorithms, without any change to the basic methods. For each new collection, some customized layout analysis is needed, after which the scans enter into a generic pipeline and are exposed to the web.

    • As of April 2013, Monk contains 32 books/documents and 20,000 page scans of handwritten manuscripts. The machine-learning system has learned over 16,000 word-image classes for (lexical) words, terms, abbreviations and word contractions. The total number of harvested and human-confirmed word labels is 370,000. These numbers are continuously growing at a rate that depends on the available computing resources: Monk is a 24/7 machine-learning effort.

    • Currently, Monk runs on the test bed of the Target project at the High-Performance Computing center of the University of Groningen. The disk size, 'single mount point', is more than 2 petabyte, of which Monk is currently using less than 10%.

    Acknowledgements

    This web site is made possible thanks to grants by the Nederlandse Organisatie van Wetenschappelijk Onderzoek (NWO), project NWO/Catch "Scratch" and project NWO/EW "Learning to learn", project SNN/Target, Groningen, thanks to the Nationaal Archief, The Hague (dhr. H. van Schie) and a continuous support from the University of Groningen.

    People

    The following people have contributed to Monk since 2004: Lambert Schomaker, Henny van Schie, Marius Bulacu, Tijn van der Zant, Sveta Zinger, Fons Laan, Jean-Paul van Oosten, Michiel Holtkamp.

    The following annotators have produced substantial amounts of line transcriptions, word and character labels: J.A. Schomaker, H. van Schie, F. Laan, L. Schomaker, S. Zinger and several anonymous volunteers.


    schomaker with-affiliation ai.rug.nl