In biological systems, acoustical and visual input devices and processing systems evolved to a very high degree of sophistication, enabling for receiving and extracting precise information about the current status of environment. Acoustical and visual sensory systems serve also for complex communication tasks which, at least in the case of humans, can carry information complexity of almost any degree. Taken separately, biological functions of acoustical and visual systems are implemented with neural systems showing remarkable organization, which is yet not fully comprehended and understood. Having not a full understanding of each of the systems is making the task of investigating their cooperation very difficult. However, looking at the integrative aspects of acoustical and visual system may in fact lead to a better understanding of their fundamental principles.
It can be seen that in many cases integrative functions of acoustical and visual systems are similar and complementing each other. One of the first tasks of the neural system is to built a representation of environment composed of objects placed in a physical space (practically cartesian) with three spatial dimensions and time. Both visual and acoustical system are extremely well fitted for recovering and building of spatial and temporal representations. They can both represent space by mechanisms like spatial vision and spatial hearing. These mechanisms are separate but their operation is integrated in subtle ways in order to build most probable and consistent representation of the environment. The consistency of the representation brings up interesting theoretical problems. One fundamental problem is the nature of integrated audiovisual representation, if and how it is built on top of the single-modality representation. Another problem is the usual bottom-up versus top-down division in the organization of neural processing, with both organizations participating in a precisely tuned way.
Sensory integration aims for building consistent representation and interpretation of information flowing from the different senses. This integration can deal with various aspects of information and can take place at different levels of processing. As we do not know exactly what are the different aspects and levels of processing, it is hard to devise precise taxonomy of the integration. The taxonomy can be approximately devised by looking into the functions and mechanisms of different sensory systems. One of the basic functions is representation of spatial and temporal properties of the environment. Time and spatial dimensions are basic variables for every representation and they obviously have to be represented consistently. Tuning of the measurements coming from the different senses concerning space and time to built the most consistent representation is the sensory integration process. In spatial representation, space filled with objects is considered, while for time representation we talk about events. One can talk about spatiotemporal objects, taking into account both time and spatial properties of objects. The main problem is how the both systems operation is integrated into a single audiovisual spatiotemporal representation.
It is widely recognized that our knowledge about the audiovisual integration is quite limited. This is because of the complexity of the systems involved. Both acoustical and visual systems taken separately have their own sophisticated organizations and integration of their functions is done on top of them. The current status of our understanding can be described as basic experimental level. In  there is a review of experimental and physiological facts concerning sensory integration. The topic has been studied mostly by psychological experiments which try to reveal specific properties without providing explanations for the mechanisms responsible. From the application and multimedia point of view these experiments are interesting as a background, but they seem to have a narrow scope depending very much on particular experimental conditions. This stems from the fact that the underlying mechanisms are usually well-hidden as emphasized by Radeau . To uncover the mechanisms, one has to devise experiments putting them in conflict or conditions for cross-modal effects like in [353,318,249,282]. There is one basic issue with these approaches, namely that the results are highly dependent on test signals and conditions.
At present only one rule can be formulated which seems to be in place: the
higher the signals interaction and complexity, the more prominent are
cross-modal effects between the acoustical and visual system. Relevant
taxonomy of interaction and signals is given in section 4.3