[161] define a system able to parse 3D and time-varying gestures. Their system captures gesture features such as postures of the hand (straight, relaxed, closed), its motion (moving, stopped), and its orientation (up, down, left, right, forward, backward --- derived from normal and longitudinal vectors from the palm). Over time, a stream is of gesture is them abstracted into more general gestlets (e.g., Pointing attack, sweep, end reference). Similarly, low level eye tracking input was classified into classes of events (fixations, saccades, and blinks). They integrate these multimodal features in a hybrid representation architecture.
It should be noted that the term and concept of 'gesture' is also used in pen computing. Although here, the recorded action takes place in the 2D plane, similar phenomena play a role as in the case of 3D hand gesturing, but with a much easier signal processing involved.
Other interesting references are [233,287,364,341].