09-09-2016, 11:12 AM
Eye Movement-Based Human-Computer Interaction
Techniques:
Toward Non-Command Interfaces
1453993925-soma.rtf (Size: 2.15 MB / Downloads: 8)
ABSTRACT
User-computer dialogues are typically one-sided, with the bandwidth from computer to user far greater than that from user to computer. The movement of a user’s eyes can provide a convenient, natural, and high-bandwidth source of additional user input, to help redress this imbalance. We therefore investi-gate the introduction of eye movements as a computer input medium. Our emphasis is on the study of interaction techniques that incorporate eye move-ments into the user-computer dialogue in a convenient and natural way. This chapter describes research at NRL on developing such interaction techniques and the broader issues raised by non-command-based interaction styles. It discusses some of the human factors and technical considerations that arise in trying to use eye movements as an input medium, describes our approach and the first eye movement-based interaction techniques that we have devised and implemented in our laboratory, reports our experiences and observations on them, and considers eye movement-based interaction as an exemplar of a new, more general class of non-command-based user-computer interaction.
INTRODUCTION
In searching for better interfaces between users and their computers, an additional mode
of communication between the two parties would be of great use. The problem of human-
computer interaction can be viewed as two powerful information processors (human and com-
puter) attempting to communicate with each other via a narrow-bandwidth, highly constrained
interface [25]. Faster, more natural, more convenient (and, particularly, more parallel, less
sequential) means for users and computers to exchange information are needed to increase the
useful bandwidth across that interface.
On the user’s side, the constraints are in the nature of the communication organs and abil-
- 2 -
ities with which humans are endowed; on the computer side, the only constraint is the range of devices and interaction techniques that we can invent and their performance. Current technol-ogy has been stronger in the computer-to-user direction than user-to-computer, hence today’s user-computer dialogues are typically one-sided, with the bandwidth from the computer to the user far greater than that from user to computer. We are especially interested in input media that can help redress this imbalance by obtaining data from the user conveniently and rapidly. We therefore investigate the possibility of using the movements of a user’s eyes to provide a high-bandwidth source of additional user input. While the technology for measuring a user’s visual line of gaze (where he or she is looking in space) and reporting it in real time has been improving, what is needed is appropriate interaction techniques that incorporate eye move-ments into the user-computer dialogue in a convenient and natural way. An interaction tech-nique is a way of using a physical input device to perform a generic task in a human-computer dialogue [7].
Because eye movements are so different from conventional computer inputs, our basic approach to designing interaction techniques has been, wherever possible, to obtain information from the natural movements of the user’s eye while viewing the display, rather than requiring the user to make specific trained eye movements to actuate the system. We therefore begin by studying the characteristics of natural eye movements and then attempt to recognize corresponding patterns in the raw data obtainable from the eye tracker, convert them into tokens with higher-level meaning, and then build dialogues based on the known characteristics of eye movements.
In addition, eye movement-based interaction techniques provide a useful exemplar of a new, non-command style of interaction. Some of the qualities that distinguish eye movement-based interaction from more conventional types of interaction are shared by other newly emerg-ing styles of human-computer interaction that can collectively be characterized as ‘‘non-
- 3 -
command-based.’’ In a non-command-based dialogue, the user does not issue specific com-mands; instead, the computer passively observes the user and provides appropriate responses. Non-command-based interfaces will also have a significant effect on user interface software because of their emphasis on continuous, parallel input streams and real-time timing con-straints, in contrast to conventional single-thread dialogues based on discrete tokens. We describe the simple user interface management system and user interface description language incorporated into our system and the more general requirements of user interface software for highly interactive, non-command styles of interaction.
Outline
This chapter begins by discussing the non-command interaction style. Then it focuses on eye movement-based interaction as an instance of this style. It introduces a taxonomy of the interaction metaphors pertinent to eye movements. It describes research at NRL on developing and studying eye movement-based interaction techniques. It discusses some of the human fac-tors and technical considerations that arise in trying to use eye movements as an input medium, describes our approach and the first eye movement-based interaction techniques that we have devised and implemented in our laboratory, and reports our experiences and observations on them. Finally, the chapter returns to the theme of new interaction styles and attempts to iden-tify and separate out the characteristics of non-command styles and to consider the impact of these styles on the future of user interface software.
II. NON-COMMAND INTERFACE STYLES
Eye movement-based interaction is one of several areas of current research in human-computer interaction in which a new interface style seems to be emerging. It represents a change in input from objects for the user to actuate by specific commands to passive equip-ment that simply senses parameters of the user’s body. Jakob Nielsen describes this property
- 4 -
as non-command-based:
The fifth generation user interface paradigm seems to be centered around non-command-based dialogues. This term is a somewhat negative way of characterizing a new form of interaction but so far, the unifying concept does seem to be exactly the abandonment of the principle underlying all earlier paradigms: That a dialogue has to be controlled by specific and precise commands issued by the user and pro-cessed and replied to by the computer. The new interfaces are often not even dialo-gues in the traditional meaning of the word, even though they obviously can be analyzed as having some dialogue content at some level since they do involve the exchange of information between a user and a computer. The principles shown at CHI’90 which I am summarizing as being non-command-based interaction are eye tracking interfaces, artificial realities, play-along music accompaniment, and agents [19].
Previous interaction styles–batch, command line, menu, full-screen, natural language, and even current desktop or "WIMP" (window-icon-menu-pointer) styles–all await, receive, and respond to explicit commands from the user to the computer. In the non-command style, the computer passively monitors the user and responds as appropriate, rather than waiting for the user to issue specific commands. This distinction can be a subtle one, since any user action, even a non-voluntary one, could be viewed as a command, particularly from the point of view of the software designer. The key criterion should therefore be whether the user thinks he or she is issuing an explicit command. It is of course possible to control one’s eye movements, facial expressions, or gestures voluntarily, but that misses the point of a non command-based interface; rather, it is supposed passively to observe, for example, the user’s natural eye move-ments, and respond based on them. The essence of this style is thus its non-intentional quality. Following Rich’s taxonomy of adaptive systems [13, 21], we can view this distinction as expli-
- 5 -
cit vs. implicit commands, thus non-command really means implicit commands.
This style of interface requires the invention of new interaction techniques that are helpful but do not annoy the user. Because the inputs are often non-intentional, they must be inter-preted carefully to avoid annoying the user with unwanted responses to inadvertent actions. For eye movements, we have called this the "Midas Touch" problem, since the highly respon-sive interface is both a boon and a curse. Our investigation of eye movement-based interaction techniques, described in this chapter, provides an example of how these problems can be attacked.
III. PERSPECTIVES ON EYE MOVEMENT-BASED INTERACTION
As with other areas of user interface design, considerable leverage can be obtained by drawing analogies that use people’s already-existing skills for operating in the natural environ-ment and searching for ways to apply them to communicating with a computer. Direct mani-pulation interfaces have enjoyed great success, particularly with novice users, largely because they draw on analogies to existing human skills (pointing, grabbing, moving objects in physical space), rather than trained behaviors; and virtual realities offer the promise of usefully exploit-ing people’s existing physical navigation and manipulation abilities. These notions are more difficult to extend to eye movement-based interaction, since few objects in the real world respond to people’s eye movements. The principal exception is, of course, other people: they detect and respond to being looked at directly and, to a lesser and much less precise degree, to what else one may be looking at. In describing eye movement-based human-computer interac-tion we can draw two distinctions, as shown in Figure 1: one is in the nature of the user’s eye movements and the other, in the nature of the responses. Each of these could be viewed as natural (that is, based on a corresponding real-world analogy) or unnatural (no real world counterpart):
- 6 -
• Within the world created by an eye movement-based interface, users could move their eyes to scan the scene, just as they would a real world scene, unaffected by the presence of eye tracking equipment (natural eye movement, on the eye movement axis of Figure 1). The alternative is to instruct users of the eye movement-based interface to move their eyes in particular ways, not necessarily those they would have employed if left to their own devices, in order to actuate the system (unnatural or learned eye movements).
• On the response axis, objects could respond to a user’s eye movements in a natural way, that is, the object responds to the user’s looking in the same way real objects do. As noted, there is a limited domain from which to draw such analogies in the real world. The alternative is unnatural response, where objects respond in ways not experienced in the real world.
This suggests the range of possible eye movement-based interaction techniques shown in Figure 1 (although the two axes are more like continua than sharp categorizations). The natural eye movement/natural response area is a difficult one, because it draws on a limited and subtle domain, principally how people respond to other people’s gaze. Starker and Bolt [23] provide an excellent example of this mode, drawing on the analogy of a tour guide or host who estimates the visitor’s interests by his or her gazes. In the work described in this chapter, we try to use natural (not trained) eye movements as input, but we provide responses unlike those in the real world. This is a compromise between full analogy to the real world and an entirely artificial interface. We present a display and allow the user to observe it with his or her nor-mal scanning mechanisms, but such scans then induce responses from the computer not nor-mally exhibited by real world objects. Most previous eye movement-based systems have used learned ("unnatural") eye movements for operation and thus, of necessity, unnatural responses. Much of that work has been aimed at disabled or hands-busy applications, where the cost of
- 7 -
learning the required eye movements ("stare at this icon to activate the device") is repaid by the acquisition of an otherwise impossible new ability. However, we believe that the real benefits of eye movement interaction for the majority of users will be in its naturalness, fluidity, low cognitive load, and almost unconscious operation; these benefits are attenuated if unnatural, and thus quite conscious, eye movements are required. The remaining category in Figure 1, unnatural eye movement/natural response, is anomalous and has not been used in practice.
IV. CHARACTERISTICS OF EYE MOVEMENTS
In order to proceed with the design of effective eye movement-based human-computer interaction, we must first examine the characteristics of natural eye movements, with emphasis on those likely to be exhibited by a user in front of a conventional (non-eyetracking) computer console.
The Eye
The retina of the eye is not uniform. Rather, one small portion near its center contains many densely-packed receptors and thus permits sharp vision, while the rest of the retina per-mits only much blurrier vision. That central portion (the fovea) covers a field of view approxi-mately one degree in diameter (the width of one word in a book held at normal reading dis-tance or slightly less than the width of your thumb held at the end of your extended arm). Anything outside that area is seen only with ‘‘peripheral vision,’’ with 15 to 50 percent of the acuity of the fovea. It follows that, to see an object clearly, it is necessary to move the eye so that the object appears on the fovea. Conversely, because peripheral vision is so poor relative to foveal vision and the fovea so small, a person’s eye position gives a rather good indication (to within the one-degree width of the fovea) of what specific portion of the scene before the person is being examined.
- 8 -
Types of Eye Movements
Human eye movements can be grouped into several categories [10, 27].
• First, the principal method for moving the fovea to view a different portion of the visual scene is a sudden and rapid motion called a saccade. Saccades take approxi-mately 30-120 milliseconds and traverse a range between 1 and 40 degrees of visual angle (15-20 degrees being most typical). Saccades are ballistic, that is, once begun, their trajectory and destination cannot be altered. Vision is suppressed (but not entirely prevented) during a saccade. There is a 100-300 ms. delay between the onset of a stimulus that might attract a saccade (e.g., an object appearing in peri-pheral vision) and the saccade itself. There is also a 200 ms. refractory period after one saccade before it is possible to make another one. Typically, a saccade is fol-lowed by a 200-600 ms. period of relative stability, called a fixation, during which an object can be viewed. The purpose of a saccade appears to be to get an object that lies somewhere in the visual field onto one’s fovea for sharp viewing. Since the saccade is ballistic, such an object must be selected before the saccade is begun; peripheral vision must therefore be the means for selecting the target of each sac-cade.
• During a fixation, the eye does not remain still. Several types of small, jittery motions occur, generally less than one degree in size. There is a sequence of a slow drift followed by a sudden, tiny saccade-like jump to correct the effect of the drift (a microsaccade). Superimposed on these is a high-frequency tremor, like the noise seen in an imperfect servomechanism attempting to hold a fixed position.
• Another type of eye movement occurs only in response to a moving object in the visual field. This is a pursuit motion, much slower than a saccade and in synchrony with the moving object being viewed. Smooth pursuit motions cannot be induced
- 9 -
voluntarily; they require a moving stimulus.
• Yet another type of movement, called nystagmus, can occur in response to motions of the head. This is a pattern of smooth motion to follow an object (as the head motion causes it to move across the visual field), followed by a rapid motion in the opposite direction to select another object (as the original object moves too far away to keep in view). It can be induced by acceleration detected by the inner ear canals, as when a person spins his or her head around or twirls rapidly, and also by viewing a moving, repetitive pattern.
• The eyes also move relative to one another, to point slightly toward each other when viewing a near object or more parallel for a distant object. Finally, they exhi-bit a small rotation around an axis extending from the fovea to the pupil, depending on neck angle and other factors.
Thus the eye is rarely entirely still, even when viewing a static display. It constantly moves and fixates different portions of the visual field; it makes small, jittery motions even during a fixation; and it seldom remains in one fixation for long. Visual perception of a static scene appears to require the artificially induced changes caused by moving the eye around the scene. In fact, an image that is artificially fixed on the retina (every time the eye moves, the target immediately moves precisely the same amount) will appear to fade from view after a few seconds [20]. The large and small motions the eye normally makes prevent this fading from occurring outside the laboratory.
Implications
The overall picture of eye movements for a user sitting in front of a computer is, then, a collection of steady (but slightly jittery) fixations connected by sudden, rapid saccades. Figure 2 shows a trace of eye movements (with intra-fixation jitter removed) for a user using a com-
- 10 -
puter for 30 seconds. Compared to the slow and deliberate way people operate a mouse or other manual input device, eye movements careen wildly about the screen.
V. METHODS FOR MEASURING EYE MOVEMENTS
What to Measure
For human-computer dialogues, we wish to measure visual line of gaze, rather than sim-ply the position of the eye in space or the relative motion of the eye within the head. Visual line of gaze is a line radiating forward in space from the eye; the user is looking at something along that line. To illustrate the difference, suppose an eye-tracking instrument detected a small lateral motion of the pupil. It could mean either that the user’s head moved in space (and his or her eye is still looking at nearly the same point) or that the eye rotated with respect to the head (causing a large change in where the eye is looking). We need to measure where the eye is pointing in space; not all eye tracking techniques do this. We do not normally measure how far out along the visual line of gaze the user is focusing (i.e., accommodation), but when viewing a two-dimensional surface like a computer console, it will be easy to deduce. Since both eyes generally point together, it is customary to track only one eye.
Electronic Methods
The simplest eye tracking technique is electronic recording, using electrodes placed on the skin around the eye to measure changes in the orientation of the potential difference that exists between the cornea and the retina. However, this method is more useful for measuring relative eye movements (i.e., AC electrode measurements) than absolute position (which requires DC measurements). It can cover a wide range of eye movements, but gives poor accuracy (particu-larly in absolute position). It is principally useful for diagnosing neurological problems revealed by eye movement patterns. Further details on this and the other eye tracking methods discussed here can be found in [27].
- 11 -
Mechanical Methods
Perhaps the least user-friendly approach uses a non-slipping contact lens ground to fit pre-cisely over the corneal bulge. A slight suction is applied between the lens and the eye to hold it in place. The contact lens then has either a small mechanical lever, magnetic coil, or mirror attached for tracking. This method is extremely accurate, particularly for investigation of tiny eye movements, but practical only for laboratory studies. It is very awkward and uncomfort-able, covers only a limited range, and interferes with blinking.
Optical/Video Methods – Single Point
More practical methods use remote imaging of some visible feature located on the eye, such as the boundary between the sclera (white portion of the front of the eye) and iris (colored portion)–this boundary is only partially visible at any one time, the outline of the pupil (works best for subjects with light-colored eyes or else the pupil can be illuminated so it appears lighter than the iris regardless of eye color), or the reflection off the front of the cornea of a collimated light beam shone at the eye. Any of these can then be used with photographic or video recording (for retrospective analysis) or with real-time video processing. They all require the head to be held absolutely stationary to be sure that any movement detected represents movement of the eye, rather than the head moving in space; a bite board is customarily used.
Optical/Video Methods – Two Point
However, by simultaneously tracking two features of the eye that move differentially with respect to one another as the line of gaze changes, it is possible to distinguish head movements (the two features move together) from eye movements (the two move with respect to one another). The head no longer need be rigidly fixed, although it must stay within camera range (which is quite small, due to the extreme telephoto lens required). Both the corneal reflection (from the light shining on the eye) and outline of the pupil (illuminated by the same light) are
- 12 -
tracked. Infrared light is used, which is not disturbing to the subject. Then absolute visual line of gaze is computed from the relationship between the two tracked points. Temporal reso-lution is limited to the video frame rate (in particular, it cannot generally capture the dynamics of a saccade).
A related method used in the SRI eye tracker [5] tracks the corneal reflection plus the fourth Purkinje image (reflection from rear of lens); the latter is dim, so a bright illumination of the eye is needed. Reflections are captured by a photocell, which drives a servo-controlled mirror with an analog signal, avoiding the need for discrete sampling. Hence this method is not limited by video frame rate. The technique is accurate, fast, but very delicate to operate; it can also measure accommodation (focus distance).
Implications
While there are many approaches to measuring eye movements, most are more suitable for laboratory experiments than as an adjunct to normal computer use. The most reasonable method is the corneal reflection-plus-pupil outline approach, since nothing contacts the subject and the device permits his or her head to remain unclamped. In fact the eye tracker sits several feet away from the subject. Head motion is restricted only to the extent necessary to keep the pupil of the eye within view of the tracking camera. The camera is panned and focussed by a servomechanism that attempts to follow the eye as the subject’s head moves. The result is that the subject can move within approximately one cubic foot of space without losing contact with the eye tracker. Attached to the camera is an infrared illuminator that lights up the pupil (so that it is a bright circle against the dark iris) and also creates the corneal reflection; because the light is infrared, it is barely visible to the subject. With this method, the video image of the pupil is then analyzed to identify a large, bright circle (pupil) and a still brighter dot (corneal reflection) and compute the center of each; line of gaze is determined from these two points. This type of equipment is manufactured commercially; in our laboratory, we use an Applied
- 13 -
Science Laboratories (Waltham, Mass.) Model 3250R corneal reflection eye tracker [17, 27]. Figure 3 shows the components of this type of eye tracker.
VI. PREVIOUS WORK
While the current technology for measuring visual line of gaze is adequate, there has been little research on using this information in real time. There is a considerable body of research using eye tracking, but it has concentrated on using eye movement data as a tool for studying motor and cognitive processes [14, 18]. Such work involves recording the eye movements and subsequently analyzing them; the user’s eye movements do not have any effect on the com-puter interface while it is in operation.
For use as a component of a user interface, the eye movement data must be obtained in real time and used in some way that has an immediate effect on the dialogue. This situation has been studied most often for disabled (quadriplegic) users, who can use only their eyes for input. (e.g., [11, 15, 16] report work for which the primary focus was disabled users). Because all other user-computer communication modes are unavailable, the resulting interfaces are rather slow and tricky to use for non-disabled people, but, of course, a tremendous boon to their intended users.
One other case in which real-time eye movement data has been used in an interface is to create the illusion of a large, ultra-high resolution display in a flight simulator [24]. With this approach, the portion of the display that is currently being viewed is depicted with high resolu-tion, while the larger surrounding area (visible only in peripheral vision) is depicted in lower resolution. Here, however, the eye movements are used essentially to simulate a better display device; the basic user-computer dialogue is not altered by the eye movements.
Our interest is, in contrast, in user-computer dialogues that combine real-time eye move-ment data with other, more conventional modes of user-computer communication. A relatively
- 14 -
small amount of work has focussed on this particular problem. Richard Bolt did some of the earliest work and demonstrated several innovative uses of eye movements [1, 2, 23]. Floyd Glenn [8] used eye movements for several tracking tasks involving moving targets. Ware and Mikaelian [26] reported an experiment in which simple target selection and cursor positioning operations were performed approximately twice as fast with an eye tracker than with any of the more conventional cursor positioning devices. The Fitt’s law relationship as seen in experi-ments with other cursor positioning devices [4] remained true of the eye tracker; only the speed was different.
VII. PROBLEMS IN USING EYE MOVEMENTS IN A HUMAN-COMPUTER DIALO-
GUE
The most naive approach to using eye position as an input might be to use it as a direct substitute for a mouse: changes in the user’s line of gaze would directly cause the mouse cur-sor to move. This turns out to be an unworkable (and annoying) design. There are two culprits for why direct substitution of an eye tracker for a mouse is not possible. The first is the eye itself, the jerky way it moves and the fact that it rarely sits still, even when its owner thinks he or she is looking steadily at a single object; the other is the instability of the avail-able eye tracking hardware. There are significant differences between a manual input source like the mouse and eye position; some are advantages and some, disadvantages; they must all be considered in designing eye movement-based interaction techniques:
• First, as Ware and Mikaelian [26] observed, eye movement input is faster than other current input media. Before the user operates any mechanical pointing device, he or she usually looks at the destination to which he wishes to move. Thus the eye movement is available as an indication of the user’s goal before he or she could actuate any other input device.
- 15 -
• Second, it is easy to operate. No training or particular coordination is required of normal users for them to be able to cause their eyes to look at an object; and the control-to-display relationship for this device is already established in the brain.
• The eye is, of course, much more than a high speed cursor positioning tool. Unlike any other input device, an eye tracker also tells where the user’s interest is focussed. By the very act of pointing with this device, the user changes his or her focus of attention; and every change of focus is available as a pointing command to the com-puter. A mouse input tells the system simply that the user intentionally picked up the mouse and pointed it at something. An eye tracker input could be interpreted in the same way (the user intentionally pointed his or her eye at something, because he was trained to operate this system that way). But it can also be interpreted as an indication of what the user is currently paying attention to, without any explicit input action on his or her part.
• This same quality is the prime drawback of the eye as a computer input device. Moving one’s eyes is often an almost subconscious act. Unlike a mouse, it is rela-tively difficult to control eye position consciously and precisely at all times. The eyes continually dart from spot to spot, and it is not desirable for each such move to initiate a computer command.
• Similarly, unlike a mouse, eye movements are always ‘‘on.’’ There is no natural way to indicate when to engage the input device, as there is with grasping or releas-ing the mouse. Closing the eyes is rejected for obvious reasons–even with eye-tracking as input, the principal function of the eyes in the user-computer dialogue is for communication to the user. Using blinks as a signal is unsatisfactory because it detracts from the naturalness possible with an eye movement-based dialogue by requiring the user to think about when to blink.
- 16 -
• Also, in comparison to a mouse, eye tracking lacks an analogue of the integral but-tons most mice have. Using blinks or eye closings for this purpose is rejected for the reason mentioned.
• Finally, the eye tracking equipment is far less stable and accurate than most manual input devices.
‘‘Midas Touch’’ Problem
The problem with a simple implementation of an eye tracker interface is that people are not accustomed to operating devices simply by moving their eyes. They expect to be able to look at an item without having the look ‘‘mean’’ something. At first, it is empowering to be able simply to look at what you want and have it happen, rather than having to look at it (as you would anyway) and then point and click it with the mouse. Before long, though, it becomes like the Midas Touch. Everywhere you look, another command is activated; you can-not look anywhere without issuing a command. The challenge in building a useful eye tracker interface is to avoid this Midas Touch problem. Ideally, the interface should act on the user’s eye input when he or she wants it to and let the user just look around when that’s what he wants, but the two cases are impossible to distinguish in general. Instead, we investigate interaction techniques that address this problem in specific cases.
Jitter of Eye
During a fixation, a user generally thinks he or she is looking steadily at a single object–he is not consciously aware of the small, jittery motions. This suggests that the human-computer dialogue should be constructed so that it, too, ignores those motions, since, ideally, it should correspond to what the user thinks he or she is doing, rather than what his eye muscles are actually doing. This will require filtering of the raw eye position data to elim-inate the high-frequency jitter, but at the same time we must not unduly slow response to the
- 17 -
high-frequency component of a genuine saccade.
Multiple ‘‘Fixations’’ in a Single ‘‘Gaze’’
A user may view a single object with a sequence of several fixations, all in the general area of the object. Since they are distinct fixations, separated by measurable saccades larger than the jitter mentioned above, they would be reported as individual fixations. Once again, if the user thinks he or she is looking at a single object, the user interface ought to treat the eye tracker data as if there were one event, not several. Therefore, following the approach of Just and Carpenter [14], if the user makes several fixations near the same screen object, connected by small saccades, we group them together into a single ‘‘gaze.’’ Further dialogue processing is performed in terms of these gazes, rather than fixations, since the former should be more indicative of the user’s intentions.
Instability in Eye Tracking Equipment
During operation of the eye tracker, there are often moments when the eye position is not available–the eye tracker fails to obtain an adequate video image of the eye for one or more frames. This could mean that the user blinked or moved his or her head outside the tracked region; if so, such information could be passed to the user interface. However, it could also mean simply that there was a spurious reflection in the video camera or any of a variety of other momentary artifacts. The two cases may not be distinguishable; hence, it is not clear how the user interface should respond to brief periods during which the eye tracker reports no position. The user may indeed have looked away, but he or she may also think he is looking right at some target on the screen, and the system is failing to respond.
Visual Feedback to User
One obvious question is whether the system should provide a screen cursor that follows the user’s eye position (as is done for mice and other conventional devices). If the eye tracker
- 18 -
were perfect, the image of such a cursor would become stationary on the user’s retina and thus disappear from perception. In fact, few eye trackers can track small, high-frequency motions rapidly or precisely enough for this to be a problem, but it does illustrate the subtlety of the design issues.
A more immediate problem is that an eye-following cursor will tend to move around and thus attract the user’s attention. Yet it is perhaps the least informative aspect of the display (since it tells you where you are already looking). Further, if there is any systematic calibra-tion error, the cursor will be slightly offset from where the user is actually looking, causing the user’s eye to be drawn to the cursor, which will further displace the cursor, creating a positive feedback loop. This is indeed a practical problem, and we often observe it.
Finally, if the calibration and response speed of the eye tracker were perfect, feedback would not be necessary, since a person knows exactly where he or she is looking (unlike the situation with a mouse cursor, which helps one visualize the relationship between mouse posi-tions and points on the screen).
Implications
Our approach to processing eye movement data is to partition the problem into two stages. First we process the raw data from the eye tracker in order to filter noise, recognize fixations, compensate for local calibration errors, and generally try to reconstruct the user’s more conscious intentions from the available information. This processing stage converts the continuous, somewhat noisy stream of raw eye position reports into discrete tokens (described below) that are claimed to approximate more closely the user’s intentions in a higher-level user-computer dialogue. In doing so, jitter during fixations is smoothed, fixations are grouped into gazes, and brief eye tracker artifacts are removed.
Next, we design generic interaction techniques based on these tokens as inputs. Because
- 19 -
eye movements are so different from conventional computer inputs, we achieve best results with a philosophy that tries, as much as possible, to use natural eye movements as an implicit input, rather than to train a user to move the eyes in a particular way to operate the system. We address the ‘‘Midas Touch’’ problem by trying to think of eye position more as a piece of information available to the user-computer dialogue involving a variety of input devices than as the intentional actuation of the principal input device.
VIII. EXPERIENCE WITH EYE MOVEMENTS
Configuration
As noted, we use an Applied Science Laboratories eye tracker in our laboratory. The user sits at a conventional (government-issue) desk, with a 16" Sun computer display, mouse, and keyboard, in a standard chair and office. The eye tracker camera/illuminator sits on the desk next to the monitor. Other than the illuminator box with its dim red glow, the overall set-ting is thus far just like that for an ordinary office computer user. In addition, the room lights are dimmed to keep the user’s pupil from becoming too small. The eye tracker transmits the x and y coordinates for the user’s visual line of gaze every 1/60 second, on a serial port, to a Sun 4/260 computer. The Sun performs all further processing, filtering, fixation recognition, and some additional calibration. Software on the Sun parses the raw eye tracker data stream into tokens that represent events meaningful to the user-computer dialogue. Our user interface management system, closely modeled after that described in [12], multiplexes these tokens with other inputs (such as mouse and keyboard) and processes them to implement the user interfaces under study.
The eye tracker is, strictly speaking, non-intrusive and does not touch the user in any way. Our setting is almost identical to that for a user of a conventional office computer. Nevertheless, we find it is difficult to ignore the eye tracker. It is noisy; the dimmed room
- 20 -
lighting is unusual; the dull red light, while not annoying, is a constant reminder of the equip-ment; and, most significantly, the action of the servo-controlled mirror, which results in the red light following the slightest motions of user’s head gives one the eerie feeling of being watched. One further wrinkle is that the eye tracker is designed for use in experiments, where there is a ‘‘subject’’ whose eye is tracked and an ‘‘experimenter’’ who monitors and adjusts the equipment. Operation by a single user playing both roles simultaneously is somewhat awk-ward because, as soon as you look at the eye tracker control panel to make an adjustment, your eye is no longer pointed where it should be for tracking.
Accuracy and Range
A user generally need not position his or her eye more accurately than the width of the fovea (about one degree) to see an object sharply. Finer accuracy from an eye tracker might be needed for studying the operation of the eye muscles but adds little for our purposes. The eye’s normal jittering further limits the practical accuracy of eye tracking. It is possible to improve accuracy by averaging over a fixation, but not in a real-time interface.
Despite the servo-controlled mirror mechanism for following the user’s head, we find that the steadier the user holds his or her head, the better the eye tracker works. We find that we can generally get two degrees accuracy quite easily, and sometimes can achieve one degree (or approximately 0.4" or 40 pixels on the screen at a 24" viewing distance). The eye tracker should thus be viewed as having a resolution much coarser than that of a mouse or most other pointing devices, perhaps more like a traditional touch screen. An additional problem is that the range over which the eye can be tracked with this equipment is fairly limited. In our configuration, it cannot quite cover the surface of a 19" monitor at a 24" viewing distance.
- 21 -
Local Calibration
Our first step in processing eye tracker data was to introduce an additional layer of cali-bration into the chain. The eye tracker calibration procedure produces a mapping that is applied uniformly to the whole screen. Ideally, no further calibration or adjustment is neces-sary. In practice, we found small calibration errors appear in portions of the screen, rather than systematically across it. We introduced an additional layer of calibration into the chain, out-side of the eye tracker computer, which allows the user to make local modifications to the cali-bration, based on arbitrary points he or she inputs whenever he feels it would be helpful. The procedure is that, if the user feels the eye tracker is not responding accurately in some area of the screen, he or she moves the mouse cursor to that area, looks at the cursor, and clicks a but-ton. That introduces an offset, which warps future eye tracker reports in the vicinity of the given point, i.e., all reports nearer to that point than to the next-nearest local calibration point. (We found this gave better results than smoothly interpolating the local calibration offsets.) The user can do this at any time and in any position, as needed.
Surprisingly, this had the effect of increasing the apparent response speed for object selection and other interaction techniques. The reason is that, if the calibration is slightly wrong in a local region and the user stares at a single target in that region, the eye tracker will report the eye position somewhere slightly outside the target. If the user continues to stare at it, though, his or her eyes will in fact jitter around to a spot that the eye tracker will report as being on the target. The effect feels as though the system is responding too slowly, but it is a problem of local calibration. The local calibration procedure results in a marked improvement in the apparent responsiveness of the interface as well as an increase in the user’s control over the system (since the user can re-calibrate when and where desired).
- 22 -
Fixation Recognition
After improving the calibration, we still observed what seemed like erratic behavior in the user interface, even when the user thought he or she was staring perfectly still. This was caused by both natural and artificial sources: the normal jittery motions of the eye during fixations as well as artifacts introduced when the eye tracker momentarily fails to obtain an adequate video image of the eye.
Figure 4 shows the type of data obtained from the eye tracker. It plots the x coordinate of the eye position output against time over a relatively jumpy three-second period. (A plot of the y coordinate for the same period would show generally the same areas of smooth vs. jumpy behavior, but different absolute positions.) Zero values on the ordinate represent periods when the eye tracker could not locate the line of gaze, due either to eye tracker artifacts, such as glare in the video camera, lag in compensating for head motion, or failure of the processing algorithm, or by actual user actions, such as blinks or movements outside the range of the eye tracker. Unfortunately, the two cases are indistinguishable in the eye tracker output. During the period represented by Figure 4, this subject thought he was simply looking around at a few different points on a CRT screen. Buried in these data, thus, are a few relatively long gazes along with some motions to connect the gazes. Such raw data are quite unusable as input to a human-computer dialogue: while the noise and jumpiness do partly reflect the actual motion of the user’s eye muscles, they do not reflect his intentions nor his impression of what his eyes were doing. The difference is attributable not only to the eye tracker artifacts but to the fact that much of the fine-grained behavior of the eye muscles is not intentional.
The problem is to extract from the noisy, jittery, error-filled stream of position reports produced by the eye tracker some ‘‘intentional’’ components of the eye motions, which make sense as tokens in a user-computer dialogue. Our first solution was to use a simple moving average filter to smooth the data. It improves performance during a fixation, but tends to dam-
- 23 -
pen the sudden saccades that move the eye from one fixation to the next. Since one of the principal benefits we hope to obtain from eye motions as input is speed, damping them is counterproductive. Further, the resulting smoothed data do not correctly reflect the user’s intentions. The user was not slowly gliding from one fixation to another; he was, in fact, fixat-ing a spot and then jumping ballistically to a new fixation.
Instead, we return to the picture of a computer user’s eye movements as a collection of jittery fixations connected by essentially instantaneous saccades. We start with an a priori model of such saccades and fixations and then attempt to recognize those events in the data stream. We then identify and quickly report the start and approximate position of each recog-nized fixation. We ignore any reports of eye position during saccades themselves, since they are difficult for the eye tracker to catch and their dynamics are not particularly meaningful to the user-computer dialogue.
Specifically, our algorithm, which is based on that used for analyzing previously-recorded files of raw eye movement data and on the known properties of fixations and saccades, watches the input data for a sequence of 100 milliseconds during which the reported eye position remains within approximately 0.5 degrees. As soon as the 100 ms. have passed, it reports the start of a fixation and takes the mean of the 100 ms. worth of data as the location of that fixation. A better estimate of the location of a fixation could be obtained by averaging over more eye tracker data, but this would mean a longer delay before the fixation position could be reported to the user interface software. Our algorithm implies a delay of 100 ms. before reporting the start of a fixation, and, in practice this delay is nearly undetectable to the user. Further eye positions within approximately one degree are assumed to represent continuations of the same fixation (rather than a saccade to a new one). To terminate a fixation, 50 ms. of data lying outside one degree of the current fixation must be received. Blinks or artifacts of up to 200 ms. may occur during a fixation without terminating it. (These occur when the eye
- 24 -
tracker reports a "no position" code.) At first, blinks seemed to present a problem, since, obvi-ously, we cannot obtain eye position data during a blink. However (equally obviously in retrospect), the screen need not respond to the eye during that blink period, since the user can’t see it anyway.
After applying this algorithm, the noisy data shown in Figure 4 are found to comprise about 6 fixations, which more accurately reflects what the user thought he was doing (rather than what his eye muscles plus the eye tracking equipment actually did). Figure 5 shows the same data, with a horizontal line marking each recognized fixation at the time and location it would be reported.
Applying the fixation recognition approach to the real-time data coming from the eye tracker yielded a significant improvement in the user-visible behavior of the interface. Filtering the data based on an a priori model of eye motion is an important step in transforming the raw eye tracker output into a user-computer dialogue.
Re-assignment of Off-target Fixations
The processing steps described thus far are open-loop in the sense that eye tracker data are translated into recognized fixations at specific screen locations without reference to what is displayed on the screen. The next processing step is applied to fixations that lie outside the boundaries of the objects displayed on the screen. This step uses knowledge of what is actu-ally on the screen, and serves further to compensate for small inaccuracies in the eye tracker data. It allows a fixation that is near, but not directly on, an eye-selectable screen object to be accepted. Given a list of currently displayed objects and their screen extents, the algorithm will reposition a fixation that lies outside any object, provided it is "reasonably" close to one object and "reasonably" further from all other such objects (i.e., not halfway between two objects, which would lead to unstable behavior). It is important that this procedure is applied only to fixations detected by the recognition algorithm, not to individual raw eye tracker
- 25 -
position reports. The result of this step is to improve performance in areas of the screen at which the eye tracker calibration is imperfect; without increasing false selections in areas where the calibration is good, since fixations in those areas fall directly on their targets and would not activate this processing step.
IX. USER INTERFACE MANAGEMENT SYSTEM
In order to make the eye tracker data more tractable for use as input to an interactive user interface, we turn the output of the recognition algorithm into a stream of tokens. We report tokens for eye events considered meaningful to the user-computer dialogue, analogous to the way that raw input from a keyboard (shift key went down, letter a key went down, etc.) is turned into meaningful events (one ASCII upper case A was typed). We report tokens for the start, continuation (every 50 ms., in case the dialogue is waiting to respond to a fixation of a certain duration), and end of each detected fixation. Each such token is tagged with the actual fixation duration to date, so an interaction technique that expects a fixation of a particular length will not be skewed by delays in processing by the UIMS (user interface management system) or by the delay inherent in the fixation recognition algorithm. Between fixations, we periodically report a non-fixation token indicating where the eye is, although our current interaction techniques ignore this token in preference to the fixation tokens, which are more filtered. A token is also reported whenever the eye tracker fails to determine eye position for 200 ms. and again when it resumes tracking. In addition, tokens are generated whenever a new fixation enters or exits a monitored region, just as is done for the mouse. Note that jitter dur-ing a single fixation will never cause such an enter or exit token, though, since the nominal position of a fixation is determined at the start of a fixation and never changes during the fixation. These tokens, having been processed by the algorithms described above, are suitable for use in a user-computer dialogue in the same way as tokens generated by mouse or keyboard events.
- 26 -
We then multiplex the eye tokens into the same stream with those generated by the mouse and keyboard and present the overall token stream as input to our user interface management system [12]. The desired user interface is specified to the UIMS as a collection of relatively simple individual dialogues, represented by separate interaction objects, which comprise the user interface description language (UIDL). They are connected by an executive that activates and suspends them with retained state, like coroutines.
A typical object might be a screen button, scroll bar, text field, or eye-selectable graphic object. Since, at the level of individual objects, each such object conducts only a single-thread dialogue, with all inputs serialized and with a remembered state whenever the individual dialo-gue is interrupted by that of another interaction object, the operation of each interaction object is conveniently specified as a simple single-thread state transition diagram that accepts the tokens as input. Each object can accept any combination of eye, mouse, and keyboard tokens, as specified in its own syntax diagram, and provides a standard method that the executive can call to offer it an input token and traverse its diagram. Each interaction object is also capable of redrawing itself upon command (it contains the needed state information or else contains calls to the access functions that will obtain such from its domain object). An interaction object can have different screen extents for purposes of re-drawing, accepting mouse tokens, and accepting eye tokens.
A standard executive is then defined for the outer dialogue loop. It operates by collecting all of the state diagrams of the interaction objects and executing them as a collection of corou-tines, assigning input tokens to them and arbitrating among them as they proceed. Whenever the currently-active dialogue receives a token it cannot accept, the executive causes it to relin-quish control by coroutine call to whatever dialogue can, given its current state, accept it. If none can, executive discards the token and proceeds.
For example, each ship (see Figure 6) is a separate interaction object (but all are of the
- 27 -
same class, Ship). An additional lower-level interaction object (Gazer) is provided to perform the translation of fixations into gazes, as described above. That is, every interaction object such as Ship also has a Gazer interaction object associated with it. The Gazer accepts fixations on (or near, according to the criteria described above, and by means of a class vari-able shared by all the active Gazers) its parent object and then combines such consecutive fixations into a single gaze token, which it sends to its parent object (the Ship). Figure 7 shows the syntax diagram for Gazer; it accepts the tokens generated by the fixation recognition algorithm (EYEFIXSTART, EYEFIXCONT, and EYEFIXEND), tests whether they lie within its extent or else meet the criteria for off-target fixations described above (implemented in the call to IsMine), accumulates them into gazes, and sends gaze tokens (EYEGAZES-TART, EYEGAZECONT, and EYEGAZEEND) directly to its parent object. The Ship interaction object syntax then need only accept and respond to the gaze tokens sent by its
Gazer. Figure 7 also shows the portion of the Ship interaction object syntax diagram con-cerned with selecting a ship by looking at it for a given dwell time (for clarity the syntax for dragging and other operations described below is not shown in the figure; also not shown are the tokens that the selected ship sends to the other ships to deselect the previously-selected ship, if any). When a user operation upon a ship causes a semantic-level consequence (e.g., moving a ship changes the track data), the Ship interaction object calls its parent, an applica-tion domain object, to do the work. Although the syntax may seem complicated as described here, it is well matched to the natural saccades and fixations of the eye.
X. INTERACTION TECHNIQUES
Interaction techniques provide a useful focus for this type of research because they are specific, yet not bound to a single application. An interaction technique represents an abstrac-tion of some common class of interactive task, for example, choosing one of several objects shown on a display screen. Research in this area studies the primitive elements of human-
- 28 -
computer dialogues, which apply across a wide variety of individual applications. The goal is to add new, high-bandwidth methods to the available store of input/output devices, interaction techniques, and generic dialogue components. Mockups of such techniques are then studied by measuring their properties, and attempts are made to determine their composition rules. This section describes the first few eye movement-based interaction techniques that we have imple-mented and our initial observations from using them.
Object Selection
This task is to select one object from among several displayed on the screen, for example, one of several file icons on a desktop or, as shown in Figure 6, one of several ships on a map in a hypothetical ‘‘command and control’’ system. With a mouse, this is usually done by pointing at the object and then pressing a button. With the eye tracker, there is no natural counterpart of the button press. As noted, we rejected using a blink and instead tested two alternatives. In one, the user looks at the desired object then presses a button on a keypad to indicate his or her choice. In Figure 6, the user has looked at ship ‘‘EF151’’ and caused it to be selected (for attribute display, described below). The second alternative uses dwell time–if the user continues to look at the object for a sufficiently long time, it is selected without further operations. The two techniques are actually implemented simultaneously, where the button press is optional and can be used to avoid waiting for the dwell time to expire, much as an optional menu accelerator key is used to avoid traversing a menu. The idea is that the user can trade between speed and a free hand: if the user needs speed and can push the button he or she need not be delayed by eye tracker dwell time; if the user does not need maximum speed, then object selection reverts to the more passive eye-only mode using dwell time.
At first this seemed like a good combination. In practice, however, the dwell time approach is much more convenient. While a long dwell time might be used to ensure that an inadvertent selection will not be made by simply ‘‘looking around’’ on the display, this
- 29 -
mitigates the speed advantage of using eye movements for input and also reduces the respon-siveness of the interface. To reduce dwell time, we make a further distinction. If the result of selecting the wrong object can be undone trivially (selection of a wrong object followed by a selection of the right object causes no adverse effect–the second selection instantaneously over-rides the first), then a very short dwell time can be used. For example, if selecting an object causes a display of information about that object to appear and the information display can be changed instantaneously, then the effect of selecting wrong objects is immediately undone as long as the user eventually reaches the right one. This approach, using a 150-250 ms. dwell time gives excellent results. The lag between eye movement and system response (required to reach the dwell time) is hardly detectable to the user, yet long enough to accumulate sufficient data for our fixation recognition and processing. The subjective feeling is of a highly respon-sive system, almost as though the system is executing the user’s intentions before he or she expresses them. For situations where selecting an object is more difficult to undo, button confirmation is used rather than a longer dwell time. We found no case where a long dwell time (over 3/4 second) alone was useful, probably because that is not a natural eye movement (people do not normally fixate one spot for that long) and also creates the suspicion that the system has crashed.
As described in the section on calibration, our initial implementation of this interaction technique performed very poorly. Even when we reduced the dwell time parameter drastically, it seemed too high. The culprit turned out to be small discrepancies between where the user was looking and what the eye tracker reported. Eliminating these errors improved the apparent responsiveness of the object selection technique. Further resistance to calibration errors is pro-vided by an algorithm that accepts fixations outside a selectable object, provided they are fairly close to it and are substantially closer to it than to any other selectable objects. The processing of raw eye position data into fixations and then grouping them into gazes (described above) is
also applied to this interaction technique.