02-11-2016, 11:30 AM
1463481805-10.docx (Size: 20.12 KB / Downloads: 6)
Introduction
Multimodal interaction provides the user with multiple modes of interacting with a system. A multimodal interface provides several distinct tools for input and output of data. For example, a multimodal question answering system employs multiple modalities (such as text and photo) at both question (input) and answer (output) level. Multimodal human-computer interaction refers to the “interaction with the virtual and physical environment through natural modes of communication’s that the modes involving the five human senses. This implies that multimodal interaction enables a more free and natural communication, interfacing users with automated systems in both input and output. Specifically, multimodal systems can offer a flexible, efficient and usable environment allowing users to interact through input modalities, such as speech, handwriting, hand gesture and gaze, and to receive information by the system through output modalities, such as speech synthesis, smart graphics and others modalities, opportunely combined. Then a multimodal system has to recognize the inputs from the different modalities combining them according to temporal and contextual constraint in order to allow their interpretation. This process is known as multimodal fusion, and it is the object of several research works from nineties to now. The fused inputs are interpreted by the system. Naturalness and flexibility can produce more than one interpretation for each different modality (channel) and for their simultaneous use, and they consequently can produce multimodal ambiguity generally due to imprecision, noises or other similar factors. For solving ambiguities, several methods have been proposed. Finally the system returns to the user outputs through the various modal channels (disaggregated) arranged according to a consistent feedback (fission). The pervasive use of mobile devices, sensors and web technologies can offer adequate computational resources to manage the complexity implied by the multimodal interaction. “Using cloud for involving shared computational resources in managing the complexity of multimodal interaction represents an opportunity. In fact, cloud computing allows delivering shared scalable, configurable computing resources that can be dynamically and automatically provisioned and released".
Defining Multimodal Interaction
There are two views on multimodal interaction:-
The first focuses on the human side: perception and control. There the word modality refers to human input and output channels.
The second view focuses on using two or more computer input or output modalities to build system that make synergistic use of parallel input or output of these modalities.
Multimodal Interaction: A Human-Centered View
The focus is on multimodal perception and control, that is, human input and output
Channels.
Multimodal Interaction: A System-Centered View
In computer science multimodal user interfaces have been defined in many ways.
Chatty gives a summary of definitions for multimodal interaction by explaining that most authors defined systems that multiple input devices (multi-sensor interaction), multiple interpretations of input issued through a single device.
Chatty’s explanation of multimodal interaction is the one that most computer scientist uses. With the term multimodal user interface they mean a system that accepts many different inputs that are combined in a meaningful way.
Multimodal input
Two major groups of multimodal interfaces have merged, one concerned in alternate input methods and the other in combined input/output. The first group of interfaces combined various user input modes beyond the traditional keyboard and mouse input/output, such as speech, pen, touch, manual gestures, gaze and head and body movements. The most common such interface combines a visual modality (e.g. a display, keyboard, and mouse) with a voice modality (speech recognition for input, speech synthesis and recorded audio for output). However other modalities, such as pen based input or hepatic input/output may be used. Multimodal user interfaces are a research area in human-computer interaction (HCI).
The advantage of multiple input modalities is increased usability: the weaknesses of one modality are offset by the strengths of another. On a mobile device with a small visual interface and keypad, a word may be quite difficult to type but very easy to say.
Multimodal input user interfaces have implications for accessibility. A well-designed multimodal application can be used by people with a wide variety of impairments. Visually impaired users rely on the voice modality with some keypad input. Hearing-impaired users rely on the visual modality with some speech input. Other users will be "situational impaired" (e.g. wearing gloves in a very noisy environment, driving, or needing to enter a credit card number in a public place) and will simply use the appropriate modalities as desired. On the other hand, a multimodal application that requires users to be able to operate all modalities is very poorly designed.
Multimodal input and output
The second group of multimodal systems presents users with multimedia displays and multimodal output, primarily in the form of visual and auditory cues. Interface designers have also started to make use of other modalities, such as touch and olfaction. Proposed benefits of multimodal output system include synergy and redundancy. The information that is presented via several modalities is merged and refers to various aspects of the same process. The use of several modalities for processing exactly the same information provides an increased bandwidth of information transfer. Currently, multimodal output is used mainly for improving the mapping between communication medium and content and to support attention management in data-rich environment where operators face considerable visual attention demands.
An important step in multimodal interface design is the creation of natural mappings between modalities and the information and tasks.
Multimodal Fusion
The process of integrating information from various input modalities and combining them into a complete command is referred as Multimodal fusion. In literature, three main approaches to the fusion process have been proposed, according to the main architectural levels (recognition and decision) at which the fusion of the input signals can be performed: recognition-based, decision-based, and hybrid multi-level fusion.
The recognition-based fusion (also known as early fusion) consists in merging the outcomes of each modal recognizer by using integration mechanisms, such as, for example, statistical integration techniques, agent theory, hidden Markov models, artificial neural networks, etc. Examples of recognition-based fusion strategies are action frame, input vectors and slots.
The decision-based fusion (also known as late fusion) merges the semantic information that are extracted by using specific dialogue-driven fusion procedures to yield the complete interpretation. Examples of decision-based fusion strategies are typed feature structures, melting pots, semantic frames, and time stamped lattices.
The potential applications for multimodal fusion include learning environments, consumer relations, security/surveillance, computer animation, etc. Individually, modes are easily defined, but difficulty arises in having technology consider them a combined fusion. It's difficult for the algorithms to factor in dimensionality; there exist variables outside of current computation abilities. For example, semantic meaning: two sentences could have the same lexical meaning but different emotional information.
In the hybrid multi-level fusion, the integration of input modalities is distributed among the recognition and decision levels. The hybrid multi-level fusion includes the following three methodologies: finite-state transducers, multimodal grammars and dialogue moves.
Why is building Multimodal Interaction hard?
Often require “recognition” technology
Speech, handwriting, sketches, gesture, etc.
Recognition technology is immature
finally “just good enough”
single mode toolkits just appearing now
no prototyping tools
Hard to combine recognition technologies
still requires experts to build systems
Few toolkits or prototyping tools!
Multimodal Interaction – Why?
Provide transparent, flexible, and powerfully expressive means of HCI.
Easier to learn and use.
Robustness and Stability.
If used as front-ends to sophisticated application systems, conducting HCI in modes all users are familiar with, then the cost of training users would be reduced.
Potentially user, task and environment adaptive.
Task performance and user preference
Migration of Human-Computer Interaction away from the desktop
Error recovery and handling
Special situations where mode choice helps