Tilburg CHaracters data set - TICH 

This data set contains labeled characters, cut out from the Firemaker collection
and was kindly provided to the Unipen Foundation by Laurens van der Maaten for
dissemination over the unipen.org web site. The data set is intended for scientific,
non-commercial use.

Citation:

L.J.P. van der Maaten. A New Benchmark Dataset for Handwritten Character Recognition. 
Tilburg University Technical Report, TiCC TR 2009-002, 2009. 
http://lvdmaaten.github.io/publications/papers/TR%20New_Dataset_2009.pdf

This dataset contains more than 40,000 handwritten characters, segmented out of 
the page scans of the Firemaker data set. The creation of the new dataset is motivated by the 
ceiling effect that hampers experiments on popular handwritten digits datasets, such as the 
MNIST dataset and the USPS dataset. Next to a character labeling, the dataset also contains 
labels for the 250 writers that wrote the handwritten character, which gives the dataset the 
additional potential to be used in forensic applications. The report discusses that data 
gathering process, as well as the preprocessing and normalization of the data. In addition, the
report presents the results of initial classification and visualization experiments on
the new dataset, in an attempt to provide a base line for the performance of learning
techniques on the dataset.

SetA Generalization error +/- 10fold variation, accuracy

1-NN 21.68% +/- 0.61% 78.32%
3-NN 20.79% +/- 0.44% 79.21%
5-NN 20.74% +/- 0.64% 79.26%
LKC 32.99% +/- 0.74%  67.01%

SetB

1-NN 17.79% +/- 0.53% 82.21%
3-NN 17.23% +/- 0.82% 82.77%
5-NN 17.28% +/- 0.66% 82.70%
LKC 30.20% +/- 0.55%  69.80%

(see the .pdf for details).

Laurens van der Maaten

................................................................................................................
Additional tests
Lambert Schomaker / 2014. Unpublished experiment, University of Groningen.

Example performances on the 56x56-pixel normalized images, i.e., 
features are pixel intensities, using split-half
selection for train/test taking odd and even numbered samples and
simple (1-NN) nearest neighbour classification:

INVCOS        16339/20067= 81.42% (Inverted cosine similarity)
BHATTACHARYYA 16339/20067= 81.42%
NEGCOR        16306/20067= 81.26% (Negated Pearson correlation as 'distance')
CHISQUARE     16303/20067= 81.24%
JACCARD       16302/20067= 81.24%
DICE          16303/20067= 81.24%
TANIMOTO      16303/20067= 81.24%
EUCLID        16048/20067= 79.97%
MANHATTAN     15831/20067= 78.89%
YULESQ        15261/20067= 76.05%
SOKAL         15247/20067= 75.98%
VARIANCE      14709/20067= 73.30%
MINKOWSKI-3   14416/20067= 71.84%
KULLBACK      13713/20067= 68.34%
NOISE          1077/20067=  5.37% (uniform pseudorandom noise from drand48 as output of the distance function)

Results confirm the findings of Laurens, using independent tools. Euclidean distance may not be optimal.
Also see: http://www.ai.rug.nl/~lambert/overslag/5d0c9b69079145f0162c1b74/TICH-plain-characters.zip
which contains a non-Matlab plain ASCII version of the character images, one per record.

................................................................................................................