UNIPEN Database Conditions of Use

The term user will refer to the person or institution who has obtained the UNIPEN data distribution.

Two major types of use can be identified:

I. Non-commercial use

Non-commercial use refers to university and institutional research which aims at public dissemination of research results. This type of usage of UNIPEN data is highly advocated by the International Unipen Foundation (iUF). However, there is a Publication Policy which must be taken into account (See below).

II. Commercial use

II.a Commercial use of UNIPEN data proper - the textual content and the point coordinates - is prohibited. An example would be the extraction of handwriting coordinates to sell 'script fonts'.

II.b The usage of UNIPEN data for the training of commercial handwriting recognition systems is allowed.

II.c The UNIPEN logo will be presented by the user in the final documentation of the resulting software product.

II.d Reference to individual writer identities or the identity of individual data donator companies from within the UNIPEN data distribution should be avoided at all times.

Note: Also in the case of commercial development, the user is kindly asked to present the results of the underlying research and development via an acknowledged science & technology forum (journal or conference).


Ad I. UNIPEN Publication Policy

I.1 - Reference

Users are required to mention the Unipen Release version in their publications, and are strongly urged to use the latest version available.
    Reference example: 

        "As a training set, we used UNIPEN [xx] Train-R01/V07, 
         benchmark ..., subsets ..... 
         As a test set, we used UNIPEN DevTest-R01/V02, 
         benchmark ..., subsets .... 
         To the raw UNIPEN data, the following pre-processing 
         was applied: ...."
             .
             .
             .

        [xx] Guyon, I., Schomaker, L., Plamondon, R., 
             Liberman, M. & Janet, S. (1994). 
             UNIPEN project of on-line data exchange and recognizer 
             benchmarks, Proceedings of the 12th International
             Conference on Pattern Recognition, ICPR'94, 
             pp. 29-33, Jerusalem, Israel, October 1994. IAPR-IEEE.

In this example we assume the release of the set DevTest-R01/V02, which will actually take place in the future.

In case your training set and test set are derived from within a single distribution such as Train-R01/V07, please explain in detail how your random selection of samples from within this distribution was produced. Was the process actually random? Was manual pruning involved? Improvements to the labels (truth values) can be submitted by the users in the form of .SEGMENT... entries via email to the iUF.

I.2 - Which data?

A proper distinction between training and test sets is necessary. The best possible training/test set distinction involves data randomly selected from two exclusive sets of writers for both sets, respectively.

Note that there is a problem in the use of test sets. Iterated use of a particular training / test set pair in a development process can be considered as indirect training! Even if a development set as such is not formally used for training, it is a well-known fact that all parameter adjustments, code improvements, etc., are a form of training, regardless of the type of pattern recognition algorithm which is used. Therefore, it is good practice to explain the effort spent in iterated testing in the publications. The tendency to iterate a single training/test set pair within a complete PhD project has led to inflated reported recognition rates in the past. It is good practice to generate a random selection of multiple sets at the start of such projects.

I.3 - Benchmark (eq. database subset) overview

Benchmark Description

1a

isolated digits

1b

isolated upper case

1c

isolated lower case

1d

isolated symbols (punctuations etc.)

2

isolated characters, mixed case

3

isolated characters in the context of words or texts

4

isolated printed words, not mixed with digits and symbols

5

isolated printed words, full character set

6

isolated cursive or mixed-style words (without digits and symbols)

7

isolated words, any style, full character set

8

text: (minimally two words of) free text, full character set

Note that only Benchmark #8 is a realistic, application-oriented test, because the word segmentation problem must also have been solved by the recognizer. No manual word segmentation is allowed in test Benchmark #8.


Lambert Schomaker, January 1997, October 2000.