12-10-2016, 03:54 PM
1458834086-report.pdf (Size: 952.57 KB / Downloads: 142)
ABSTRACT
Skinput is an input technology that uses bio-acoustic sensing to localize finger
taps on the skin. When augmented with a Pico-projector, the device can provide a direct
manipulation, graphical user interface on the body. The technology was developed by
Chris Harrison, Desney Tan, and Dan Morris, at Microsoft Research's Computational
User Experiences Group. Skinput represents one way to decouple input from electronic
devices with the aim of allowing devices to become smaller without simultaneously
shrinking the surface area on which input can be performed. While other systems, like
Sixth Sense have attempted this with computer vision, Skinput employs acoustics, which
take advantage of the human body's natural sound conductive properties (e.g., bone
conduction). This allows the body to be annexed as an input surface without the need for
the skin to be invasively instrumented with sensors, tracking markers, or other items.
INTRODUCTION
Devices with significant computational power and capabilities can now be easily
carried on our bodies. However, their small size typically leads to limited interaction
space (e.g. diminutive screens, buttons, and jog wheels) and consequently diminishes
their usability and functionality. Since it cannot simply make buttons and screens larger
without losing the primary benefit of small size, consider alternative approaches that
enhance interactions with small mobile systems. One option is to opportunistically
appropriate surface area from the environment for interactive purposes. For example, [10]
describes a technique that allows a small mobile device to turn tables on which it rests
into a gestural finger input canvas. However, tables are not always present, and in a
mobile context, users are unlikely to want to carry appropriated surfaces with them (at
this point, one might as well just have a larger device). However, there is one surface that
has been previous overlooked as an input canvas, and one that happens to always travel
with us: our skin. Appropriating the human body as an input device is appealing not only
because we have roughly two square meters of external surface area, but also because
much of it is easily accessible by our hands (e.g., arms, upper legs, torso). Furthermore,
proprioception – our sense of how our body is configured in three-dimensional space –
allows us to accurately interact with our bodies in an eyes-free manner. For example, we
can readily flick each of our fingers, touch the tip of our nose, and clap our hands
together without visual assistance. Few external input devices can claim this accurate,
eyes-free input characteristic and provide such a large interaction area. In this paper, we
present our work on Skinput – a method that allows the body to be appropriated for finger
input using a novel, non-invasive, wearable bio-acoustic sensor.
The contributions of this paper are:
1) We describe the design of a novel, wearable sensor for bio-acoustic signal acquisition
2) We describe an analysis approach that enables our system to resolve the location of
finger taps on the body.
3) We assess the robustness and limitations of this system through a user study.
RELATED WORK
2.1 Always-Available Input
The primary goal of Skinput is to provide an always available mobile input system
– that is, an input system that does not require a user to carry or pick up a device. A
number of alternative approaches have been proposed that operate in this space.
Techniques based on computer vision are popular (e.g. [3,26,27], see [7] for a recent
survey). These, however, are computationally expensive and error prone in mobile
scenarios (where, e.g., non-input optical flow is prevalent). Speech input (e.g. [13,15]) is
a logical choice for always-available input, but is limited in its precision in unpredictable
acoustic environments, and suffers from privacy and scalability issues in shared
environments. Other approaches have taken the form of wearable computing. This
typically involves a physical input device built in a form considered to be part of one’s
clothing. For example, glove-based input systems (see [25] for a review) allow users to
retain most of their natural hand movements, but are cumbersome, uncomfortable, and
disruptive to tactile sensation. Post and Orth [22] present a “smart fabric” system that
embeds sensors and conductors into a brick, but taking this approach to always-available
input necessitates embedding technology in all clothing, which would be prohibitively
complex and expensive. The Sixth-Sense project [19] proposes a mobile, always
available input/output capability by combining projected information with a colormarker-based
vision tracking system. This approach is feasible, but suffers from serious
occlusion and accuracy limitations. For example, determining whether, e.g., a finger has
tapped a button, or is merely hovering above it, is extraordinarily difficult. In the present
work, we briefly explore the combination of on-body sensing with on-body projection.
2.2 Bio-Sensing
Skinput leverages the natural acoustic conduction properties of the human body to
provide an input system, and is thus related to previous work in the use of biological
signals for computer input. Signals traditionally used for diagnostic medicine, such as
heart rate and skin resistance, have been appropriated for assessing a user’s emotional
state (e.g. [16,17,20]). These features are generally subconsciously driven and cannot be
controlled with sufficient precision for direct input. Similarly, brain sensing technologies
such as electroencephalography (EEG) & functional near-infrared spectroscopy (fNIR)
have been used by HCI researchers to assess cognitive and emotional state (e.g.
[9,11,14]); this work also primarily looked at involuntary signals. In contrast, brain
signals have been harnessed as a direct input for use by paralyzed patients (e.g. [8,18]),
but direct brain computer interfaces (BCIs) still lack the bandwidth required for everyday
computing tasks, and require levels of focus, training, and concentration that are
incompatible with typical computer interaction. There has been less work relating to the
intersection of finger input and biological signals. Researchers have harnessed the
electrical signals generated by muscle activation during normal hand movement through
electromyography (EMG) (e.g. [23,24]). At present, however, this approach typically
requires expensive amplification systems and the application of conductive gel for
effective signal acquisition, which would limit the acceptability of this approach for most
users. The input technology most related to our own is that of Amento et al. [2], who
placed contact microphones on a user’s wrist to assess finger movement. However, this
work was never formally evaluated, as is constrained to finger motions in one hand. The
Hambone system [6] employs a similar setup, and through an HMM, yields classification
accuracies around 90% for four gestures (e.g., raise heels, snap fingers). Performance of
false positive rejection remains untested in both systems at present. Moreover, both
techniques required the placement of sensors near the area of interaction (e.g., the wrist),
increasing the degree of invasiveness and visibility. Finally, bone conduction
microphones and headphones – now common consumer technologies - represent an
additional bio-sensing technology that is relevant to the present work. These leverage the
fact that sound frequencies relevant to human speech propagate well through bone. Bone
conduction microphones are typically worn near the ear, where they can sense vibrations
propagating from the mouth and larynx during speech. Bone conduction headphones send
sound through the bones of the skull and jaw directly to the inner ear, bypassing
transmission of sound through the air and outer ear, leaving an unobstructed path for
environmental sounds.
Acoustic Input
Our approach is also inspired by systems that leverage acoustic transmission
through (non-body) input surfaces. Paradiso et al. [21] measured the arrival time of a
sound at multiple sensors to locate hand taps on a glass window. Ishii et al. [12] use a
similar approach to localize a ball hitting a table, for computer augmentation of a realworld
game. Both of these systems use acoustic time-of-flight for localization, which we
explored, but found to be insufficiently robust on the human body, leading to the
fingerprinting approach described in this paper.
SKINPUT
To expand the range of sensing modalities for always available input systems, we
introduce Skinput, a novel input technique that allows the skin to be used as a finger input
surface. In our prototype system, we choose to focus on the arm (although the technique
could be applied elsewhere). This is an attractive area to appropriate as it provides
considerable surface area for interaction, including a contiguous and flat area for
projection (discussed subsequently). Furthermore, the forearm and hands contain a
complex assemblage of bones that increases acoustic distinctiveness of different
locations. To capture this acoustic information, we developed a wearable armband that is
non-invasive and easily removable In this section, we discuss the mechanical phenomena
that enables Skinput, with a specific focus on the mechanical properties of the arm. Then
we will describe the Skinput sensor and the processing techniques we use to segment,
analyze, and classify bio-acoustic signals.
3.1 Bio-Acoustics
When a finger taps the skin, several distinct forms of acoustic energy are
produced. Some energy is radiated into the air as sound waves; this energy is not captured
by the Skinput system. Among the acoustic energy transmitted through the arm, the most
readily visible are transverse waves, created by the displacement of the skin from a finger
impact (Figure 2). When shot with a high-speed camera, these appear as ripples, which
propagate outward from the point of contact. The amplitude of these ripples is correlated
to both the tapping force and to the volume and compliance of soft tissues under the
impact area. In general, tapping on soft regions of the arm creates higher amplitude
transverse waves than tapping on boney areas (e.g., wrist, palm, fingers), which have
negligible compliance. In addition to the energy that propagates on the surface of the arm,
some energy is transmitted inward, toward the skeleton. These longitudinal (compressive)
waves travel through the soft tissues of the arm, exciting the bone, which is much less
deformable then the soft tissue but can respond to mechanical excitation by rotating and
translating as a rigid body. This excitation vibrates soft tissues surrounding the entire
length of the bone, resulting in new longitudinal waves that propagate outward to the
skin. We highlight these two separate forms of conduction –transverse waves moving
directly along the arm surface, and longitudinal waves moving into and out of the bone
through soft tissues – because these mechanisms carry energy at different frequencies and
over different distances. Roughly speaking, higher frequencies propagate more readily
through bone than through soft tissue, and bone conduction carries energy over larger
distances than soft tissue conduction. While we do not explicitly model the specific
mechanisms of conduction, or depend on these mechanisms for our analysis, we do
believe the success of our technique depends on the complex acoustic patterns that result
from mixtures of these modalities. Similarly, we also believe that joints play an important
role in making tapped locations acoustically distinct. Bones are held together by
ligaments, and joints often include additional biological structures such as fluid cavities.
This makes joints behave as acoustic filters. In some cases, these may simply dampen
acoustics; in other cases, these will selectively attenuate specific frequencies, creating
location specific acoustic signatures.
To capture the rich variety of acoustic information described in the previous
section, we evaluated many sensing technologies, including bone conduction
microphones, conventional microphones coupled with stethoscopes [10], piezo contact
microphones [2], and accelerometers. However, these transducers were engineered for
very different applications than measuring acoustics transmitted through the human body.
As such, we found them to be lacking in several significant ways. Foremost, most
mechanical sensors are engineered to provide relatively flat response curves over the
range of frequencies that is relevant to our signal. This is a desirable property for most
applications where a faithful representation of an input signal – uncolored by the
properties of the transducer – is desired. However, because only a specific set of
frequencies is conducted through the arm in response to tap input, a flat response curve
leads to the capture of irrelevant frequencies and thus to a high signal- to-noise ratio.
While bone conduction microphones might seem a suitable choice for Skinput, these
devices are typically engineered for capturing human voice, and filter out energy below
the range of human speech (whose lowest frequency is around 85Hz). Thus most sensors
in this category were not especially sensitive to lower-frequency signals (e.g., 25Hz),
which we found in our empirical pilot studies to be vital in characterizing finger taps. To
overcome these challenges, we moved away from a single sensing element with a flat
response curve, to an array of highly tuned vibration sensors. Specifically, we employ
small, cantilevered piezo films (MiniSense100, Measurement Specialties, Inc.). By
adding small weights to the end of the cantilever, we are able to alter the resonant
frequency, allowing the sensing element to be responsive to a unique, narrow, low- frequency band of the acoustic spectrum. Adding more mass lowers the range of
excitation to which a sensor responds; we weighted each element such that it aligned with
particular frequencies that pilot studies showed to be useful in characterizing bio-acoustic
input. Figure 4 shows the response curve for one of our sensors, tuned to a resonant
frequency of 78Hz.
The curve shows a ~14dB drop-off ±20Hz away from the resonant frequency.
Additionally, the cantilevered sensors were naturally insensitive to forces parallel to the
skin (e.g., shearing motions caused by stretching). Thus, the skin stretch induced by many
routine movements (e.g., reaching for a doorknob) tends to be attenuated. However, the
sensors are highly responsive to motion perpendicular to the skin plane – perfect for
capturing transverse surface waves (Figure 2) and longitudinal waves emanating from
interior structures (Figure 3). Finally, our sensor design is relatively inexpensive and can
be manufactured in a very small form factor (e.g., MEMS), rendering it suitable for
inclusion in future mobile devices (e.g., an arm-mounted audio player).
3.2 Armband Prototype
Our final prototype, shown in Figures 1 and 5, features two arrays of five sensing
elements, incorporated into an armband form factor. The decision to have two sensor
packages was motivated by our focus on the arm for input. In particular, when placed on the upper arm (above the elbow), we hoped to collect acoustic information from the
fleshy bicep area in addition to the firmer area on the underside of the arm, with better
acoustic coupling to the Humorous, the main bone that runs from shoulder to elbow.
When the sensor was placed below the elbow, on the forearm, one package was located
near the Radius, the bone that runs from the lateral side of the elbow to the thumb side of
the wrist, and the other near the Ulna, which runs parallel to this on the medial side of the
arm closest to the body. Each location thus provided slightly different acoustic coverage
and information, helpful in disambiguating input location. Based on pilot data collection,
we selected a different set of resonant frequencies for each sensor package (Table 1). We
tuned the upper sensor package to be more sensitive to lower frequency signals, as these
were more prevalent in fleshier areas. Conversely, we tuned the lower sensor array to be
sensitive to higher frequencies, in order to better capture signals transmitted though
(denser) bones.
3.3 Processing
In our prototype system, we employ a Mackie Onyx 1200F audio interface to
digitally capture data from the ten sensors (http://mackie.com). This was connected via
Firewire to a conventional desktop computer, where a thin client written in C interfaced
with the device using the Audio Stream Input/ Output (ASIO) protocol. Each channel was
sampled at 5.5kHz, a sampling rate that would be considered too low for speech or
environmental audio, but was able to represent the relevant spectrum of frequencies
transmitted through the arm. This reduced sample rate (and consequently low processing
bandwidth) makes our technique readily portable to embedded processors.
For example, the ATmega168 processor employed by the Arduino platform can sample
analog readings at 77kHz with no loss of precision, and could therefore provide the full
sampling power required for Skinput (55kHz total). Data was then sent from our thin
client over a local socket
to our primary application, written in Java. This program performed three key functions.
First, it provided a live visualization of the data from our ten sensors, which was useful in
identifying acoustic features (Figure 6). Second, it segmented inputs from the data stream
into independent instances (taps). Third, it classified these input instances. The audio
stream was segmented into individual taps using an absolute exponential average of all
ten channels (Figure6, red waveform). When an intensity threshold was exceeded (Figure
6, upper blue line), the program recorded the timestamp as a potential start of a tap. If the
intensity did not fall below a second, independent “closing” threshold (Figure 6, lower
purple line) between 100ms and 700ms after the onset crossing (a duration we found to
be the common for finger impacts), the event was discarded. If start and end crossings
were detected that satisfied these criteria, the acoustic data in that period (plus a 60ms
buffer on either end) was considered an input event (Figure 6, vertical green regions).
Although simple, this heuristic proved to be highly robust, mainly due to the extreme
noise suppression provided by our sensing approach.