The Monk system is developed at the University of
Groningen by a research group at the Artificial Intelligence institute ALICE,
under supervision by prof. dr. Lambert Schomaker.
In cooperation with the Dutch National Archive we have created methods
for accessing historical archive collections which are difficult to
process by traditional OCR methods, for example due to their historical
character types or due to the fact that the material is handwritten.
The system consists of two major components: (1) a setup for the
storage and web-based annotation of scanned page images and parts thereof;
(2) a set of (handwriting and text) recognition algorithms as well
as retrieval and search methods.
Try our [Monk search engine]
An example hit list is here.
Indexed books and overview
- From the Archief van het Kabinet der Koningin, 1898-1945,
Periode 1814-1988 (totale omvang 598 meter; 4660 inventarisnummers)
we started with the book: Indices op het verbaal, boek 7823, 1903, bladz. 1-1040
- From Alg. Rijksarch. 1e Afd. Admiraliteitscolleges
Nr. 1177, Journaal, beginnend 7 Sept. 1779, Capiteyn J.J. van Hoeij,
[... s'Lands Schip van Oorloge Rotterdam, 50 stukken canon en
300 manschappen ...] (272 pages)
- To the internal project site with demos,
annotation and search tools for handwritten or historical manuscripts.
Please contact me for obtaining access to this site.
Volunteers who can read Dutch or Latin are invited to participate in transcription and
annotation.
- There is also a static index
of Monk's collections.
- In the summer of 2009, the Monk system went into autonomous training mode,
using a queuing system for addressing remote HPC resources and starting
re-learning tasks on the basis of user activity, autonomously for the
greater part, in '24/7' mode.
- In the summer of 2010, we have ingested two different 15th century texts, on from Belgium
(Schepenbank, Stadsarchief Leuven) and one from the archives of the province of Gelderland, with good success.
This broadens the applicability of our algorithms, without any change to the basic methods.
For each new collection, some customized layout analysis is needed, after which
the scans enter into a generic pipeline and are exposed to the web.
- As of April 2013, Monk contains 32 books/documents and 20,000 page scans of handwritten manuscripts.
The machine-learning system has learned over 16,000 word-image classes for (lexical) words, terms, abbreviations
and word contractions. The total number of harvested and human-confirmed word labels is 370,000.
These numbers are continuously growing at a rate that depends on the
available computing resources: Monk is a 24/7 machine-learning effort.
- Currently, Monk runs on the test bed of the Target project at the High-Performance Computing center of the University of Groningen.
The disk size, 'single mount point', is more than 2 petabyte, of which Monk is currently using less than 10%.
In May 2014, Monk started processing Chinese characters, handwritten and wood-block printed, of the Harvard Yenchin collection for Grace
Fong in the Digging into Data project 'Global Currents' with Andrew Piper et al. By the end of 2014, Monk was processing 300 Chinese
manuscripts.
- In December 2015, Arabic machine printed text was provided by Maxim Romanov. Fifty .pdf documents of typically 500-800 pages are now being ingested.
The lexicon is +320 words after two weeks, with 15k instances and growing quickly.
- April 2016, the system started to work on printed hieroglyphs, mixed with
French machine-printed text.
- In May 2016, we started to process war diaries for the NIOD, The Netherlands, in a pilot project.
The text concerns connected-cursive Western, mostly in Dutch, 1940-ies style, for a single writer.
- In August 2016, Monk starts to process Hieratic script on papyrus, for Leiden University
Acknowledgements
This web site is made possible thanks to grants by the Nederlandse Organisatie
van Wetenschappelijk Onderzoek (NWO), project NWO/Catch "Scratch" and
project NWO/EW "Learning to learn", project SNN/Target, Groningen, thanks to the Nationaal
Archief, The Hague (dhr. H. van Schie) and a continuous support from the University of Groningen.
People
The following people have contributed to Monk since 2004:
Lambert Schomaker, Henny van Schie, Marius Bulacu, Tijn van der Zant,
Sveta Zinger, Fons Laan, Jean-Paul van Oosten, Michiel Holtkamp.
The following annotators have produced substantial amounts of
line transcriptions, word and character labels:
J.A. Schomaker, H. van Schie, F. Laan, L. Schomaker, S. Zinger
and several anonymous volunteers.
schomaker with-affiliation ai.rug.nl
|