of Nsamples and level of confidence alpha: How reliable are recognition rates?

In this demo, we show the theoretical (two-sided) confidence limits of recognition
rate as a function of the number of samples. We assume a Gaussian distribution
of the generated recognition rates (probabilities), which is not realistic
for small N.

The ensemble variance is estimated by taking *Sqrt(p(1-p)/Nsamples)*.
Now, the question is: Given a confidence level *alpha* (e.g., 0.01),
and a recognition rate *p* (e.g., 0.90), what are the upper and lower
confidence limits? The program draws a straight line at *p*, and the confidence
limits around it, for many values of *Nsamples* at the x-axis.

A *sample* may refer, e.g., to a single character or, alternatively,
to a single word. In English, a character recognition rate of 0.90 implies
a word recognition rate of 0.50, assuming an average word length of 5
characters.

(1) It should be stressed that, numerically, things look different from the
presented theoretical curves. The underlying binomial distribution is
skewed, which does not show up here. Furthermore, the underlying distribution is
actually discrete. With 20 word samples, I can only measure recognition rate
in steps of 5 percent. One word misrecognized? The rate drops 5%!
At the same time, a jump of 5% corresponds with a discrete step dP in the
binomial distribution of the recognition rate ensemble. Therefore, the resulting
actual confidence limits for a given *p, alpha* and *Nsamp* are much more ragged than the
smooth Gaussian approximation would suggest, especially for *Nsamp* < 100.

(2) When using the same test set samples in testing two recognizers, the obtained recognition rates are not statistically independent. This raises the question whether a test for dependent data, such as the Chi-square test for correlated (dependent) data must be used. This issue can be solved by realizing what is the goal of the statistical test. If one is (pragmatically) interested in the recognition rate levels of recognizer A vs B as such, then the underlying differences between recognizers may be ignored, and a test for independent data (binomial, normal or Chi-square) is suitable. In this case, only statements regarding the difference between observed rates can be made. On the other hand, if one is (scientifically) interested whether the recognizers can be considered as two versions of the same signal generator or are in fact different signal sources, then a test for dependent data must be used. Consider, for example the extremal case were two recognizers have a recognition rate of 50%. Applying a statistical test for independent data clearly yields a non-significant difference. However, such a test misses the fact that each recognizer may have recognized a distinct subset of 50% of the characters (or words) in that test set. In such a case, we cannot consider the recognizers as having "similar performance". A smart combination of these recognizers would have yielded a highly desirable 100% recognition rate.

This page was developed as a result of discussions in the UNIPEN consortium, notably involving Isabelle Guyon and John Makhoul.

Please refer to our **Publications**
when using anything from the shown material.

of Nsamples and level of confidence alpha

to Java-based Handwriting Demos page

Lambert Schomaker

(this page is online since April 21, 1995, Java program 1/3/96)