Tilburg CHaracters data set - TICH This data set contains labeled characters, cut out from the Firemaker collection and was kindly provided to the Unipen Foundation by Laurens van der Maaten for dissemination over the unipen.org web site. The data set is intended for scientific, non-commercial use. Citation: L.J.P. van der Maaten. A New Benchmark Dataset for Handwritten Character Recognition. Tilburg University Technical Report, TiCC TR 2009-002, 2009. http://lvdmaaten.github.io/publications/papers/TR%20New_Dataset_2009.pdf This dataset contains more than 40,000 handwritten characters, segmented out of the page scans of the Firemaker data set. The creation of the new dataset is motivated by the ceiling effect that hampers experiments on popular handwritten digits datasets, such as the MNIST dataset and the USPS dataset. Next to a character labeling, the dataset also contains labels for the 250 writers that wrote the handwritten character, which gives the dataset the additional potential to be used in forensic applications. The report discusses that data gathering process, as well as the preprocessing and normalization of the data. In addition, the report presents the results of initial classification and visualization experiments on the new dataset, in an attempt to provide a base line for the performance of learning techniques on the dataset. SetA Generalization error +/- 10fold variation, accuracy 1-NN 21.68% +/- 0.61% 78.32% 3-NN 20.79% +/- 0.44% 79.21% 5-NN 20.74% +/- 0.64% 79.26% LKC 32.99% +/- 0.74% 67.01% SetB 1-NN 17.79% +/- 0.53% 82.21% 3-NN 17.23% +/- 0.82% 82.77% 5-NN 17.28% +/- 0.66% 82.70% LKC 30.20% +/- 0.55% 69.80% (see the .pdf for details). Laurens van der Maaten ................................................................................................................ Additional tests Lambert Schomaker / 2014. Unpublished experiment, University of Groningen. Example performances on the 56x56-pixel normalized images, i.e., features are pixel intensities, using split-half selection for train/test taking odd and even numbered samples and simple (1-NN) nearest neighbour classification: INVCOS 16339/20067= 81.42% (Inverted cosine similarity) BHATTACHARYYA 16339/20067= 81.42% NEGCOR 16306/20067= 81.26% (Negated Pearson correlation as 'distance') CHISQUARE 16303/20067= 81.24% JACCARD 16302/20067= 81.24% DICE 16303/20067= 81.24% TANIMOTO 16303/20067= 81.24% EUCLID 16048/20067= 79.97% MANHATTAN 15831/20067= 78.89% YULESQ 15261/20067= 76.05% SOKAL 15247/20067= 75.98% VARIANCE 14709/20067= 73.30% MINKOWSKI-3 14416/20067= 71.84% KULLBACK 13713/20067= 68.34% NOISE 1077/20067= 5.37% (uniform pseudorandom noise from drand48 as output of the distance function) Results confirm the findings of Laurens, using independent tools. Euclidean distance may not be optimal. Also see: http://www.ai.rug.nl/~lambert/overslag/5d0c9b69079145f0162c1b74/TICH-plain-characters.zip which contains a non-Matlab plain ASCII version of the character images, one per record. ................................................................................................................