Confidence limits for a recognition rate p as a function
of Nsamples and level of confidence alpha: How reliable are recognition rates?

In this demo, we show the theoretical (two-sided) confidence limits of recognition rate as a function of the number of samples. We assume a Gaussian distribution of the generated recognition rates (probabilities), which is not realistic for small N.
The ensemble variance is estimated by taking Sqrt(p(1-p)/Nsamples). Now, the question is: Given a confidence level alpha (e.g., 0.01), and a recognition rate p (e.g., 0.90), what are the upper and lower confidence limits? The program draws a straight line at p, and the confidence limits around it, for many values of Nsamples at the x-axis.

A sample may refer, e.g., to a single character or, alternatively, to a single word. In English, a character recognition rate of 0.90 implies a word recognition rate of 0.50, assuming an average word length of 5 characters.

Caveats

(1) It should be stressed that, numerically, things look different from the presented theoretical curves. The underlying binomial distribution is skewed, which does not show up here. Furthermore, the underlying distribution is actually discrete. With 20 word samples, I can only measure recognition rate in steps of 5 percent. One word misrecognized? The rate drops 5%! At the same time, a jump of 5% corresponds with a discrete step dP in the binomial distribution of the recognition rate ensemble. Therefore, the resulting actual confidence limits for a given p, alpha and Nsamp are much more ragged than the smooth Gaussian approximation would suggest, especially for Nsamp < 100.

(2) When using the same test set samples in testing two recognizers, the obtained recognition rates are not statistically independent. This raises the question whether a test for dependent data, such as the Chi-square test for correlated (dependent) data must be used. This issue can be solved by realizing what is the goal of the statistical test. If one is (pragmatically) interested in the recognition rate levels of recognizer A vs B as such, then the underlying differences between recognizers may be ignored, and a test for independent data (binomial, normal or Chi-square) is suitable. In this case, only statements regarding the difference between observed rates can be made. On the other hand, if one is (scientifically) interested whether the recognizers can be considered as two versions of the same signal generator or are in fact different signal sources, then a test for dependent data must be used. Consider, for example the extremal case were two recognizers have a recognition rate of 50%. Applying a statistical test for independent data clearly yields a non-significant difference. However, such a test misses the fact that each recognizer may have recognized a distinct subset of 50% of the characters (or words) in that test set. In such a case, we cannot consider the recognizers as having "similar performance". A smart combination of these recognizers would have yielded a highly desirable 100% recognition rate.

This page was developed as a result of discussions in the UNIPEN consortium, notably involving Isabelle Guyon and John Makhoul.

Please refer to our Publications when using anything from the shown material.

_____________________________________________________________________________________