The Monk system is developed at the University of
Groningen by a research group at the Artificial Intelligence institute ALICE,
under supervision by prof. dr. Lambert Schomaker.
In cooperation with the Dutch National Archive we have created methods
for accessing historical archive collections which are difficult to
process by traditional OCR methods, for example due to their historical
character types or due to the fact that the material is handwritten.
The system consists of two major components: (1) a setup for the
storage and web-based annotation of scanned page images and parts thereof;
(2) a set of (handwriting and text) recognition algorithms as well
as retrieval and search methods.
Try our [Monk search engine]
An example hit list is here.
Indexed books and overview
- From the Archief van het Kabinet der Koningin, 1898-1945,
Periode 1814-1988 (totale omvang 598 meter; 4660 inventarisnummers)
we started with the book: Indices op het verbaal, boek 7823, 1903, bladz. 1-1040
- From Alg. Rijksarch. 1e Afd. Admiraliteitscolleges
Nr. 1177, Journaal, beginnend 7 Sept. 1779, Capiteyn J.J. van Hoeij,
[... s'Lands Schip van Oorloge Rotterdam, 50 stukken canon en
300 manschappen ...] (272 pages)
- To the internal project site with demos,
annotation and search tools for handwritten or historical manuscripts.
Please contact me for obtaining access to this site.
Volunteers who can read Dutch or Latin are invited to participate in transcription and
- There is also a static index
of Monk's KdK collection.
- In the summer of 2010, we have ingested two different 15th century texts, on from Belgium
(Schepenbank, Stadsarchief Leuven) and one from the archives of the province of Gelderland, with good success.
This broadens the applicability of our algorithms, without any change to the basic methods.
For each new collection, some customized layout analysis is needed, after which
the scans enter into a generic pipeline and are exposed to the web.
- As of April 2013, Monk contains 32 books/documents and 20,000 page scans of handwritten manuscripts.
The machine-learning system has learned over 16,000 word-image classes for (lexical) words, terms, abbreviations
and word contractions. The total number of harvested and human-confirmed word labels is 370,000.
These numbers are continuously growing at a rate that depends on the
available computing resources: Monk is a 24/7 machine-learning effort.
- Currently, Monk runs on the test bed of the Target project at the High-Performance Computing center of the University of Groningen.
The disk size, 'single mount point', is more than 2 petabyte, of which Monk is currently using less than 10%.
This web site is made possible thanks to grants by the Nederlandse Organisatie
van Wetenschappelijk Onderzoek (NWO), project NWO/Catch "Scratch" and
project NWO/EW "Learning to learn", project SNN/Target, Groningen, thanks to the Nationaal
Archief, The Hague (dhr. H. van Schie) and a continuous support from the University of Groningen.
The following people have contributed to Monk since 2004:
Lambert Schomaker, Henny van Schie, Marius Bulacu, Tijn van der Zant,
Sveta Zinger, Fons Laan, Jean-Paul van Oosten, Michiel Holtkamp.
The following annotators have produced substantial amounts of
line transcriptions, word and character labels:
J.A. Schomaker, H. van Schie, F. Laan, L. Schomaker, S. Zinger
and several anonymous volunteers.
schomaker with-affiliation ai.rug.nl