03-05-2013, 04:12 PM
The 8 Requirements of Real-Time Stream Processing
The 8 Requirements.pdf (Size: 623.65 KB / Downloads: 19)
ABSTRACT
Applications that require real-time processing of high-volume
data steams are pushing the limits of traditional data processing
infrastructures. These stream-based applications include market
feed processing and electronic trading on Wall Street, network
and infrastructure monitoring, fraud detection, and command and
control in military environments. Furthermore, as the “sea
change” caused by cheap micro-sensor technology takes hold, we
expect to see everything of material significance on the planet get
“sensor-tagged” and report its state or location in real time. This
sensorization of the real world will lead to a “green field” of
novel monitoring and control applications with high-volume and
low-latency processing requirements.
Recently, several technologies have emerged—including off-theshelf
stream processing engines—specifically to address the
challenges of processing high-volume, real-time data without
requiring the use of custom code. At the same time, some existing
software technologies, such as main memory DBMSs and rule
engines, are also being “repurposed” by marketing departments to
address these applications.
INTRODUCTION
On Wall Street and other global exchanges, electronic trading
volumes are growing exponentially. Market data feeds can
generate tens of thousands of messages per second. The Options
Price Reporting Authority (OPRA), which aggregates all the
quotes and trades from the options exchanges, estimates peak
rates of 122,000 messages per second in 2005, with rates doubling
every year [13]. This dramatic escalation in feed volumes is
stressing or breaking traditional feed processing systems.
Furthermore, in electronic trading, a latency of even one second is
unacceptable, and the trading operation whose engine produces
the most current results will maximize arbitrage profits. This fact
is causing financial services companies to require very highvolume
processing of feed data with very low latency.
EIGHT RULES FOR STREAM
PROCESSING
Keep the Data Moving
To achieve low latency, a system must be able to perform
message processing without having a costly storage operation in
the critical processing path. A storage operation adds a great deal
of unnecessary latency to the process (e.g., committing a database
record requires a disk write of a log record). For many stream
processing applications, it is neither acceptable nor necessary to
require such a time-intensive operation before message processing
can occur. Instead, messages should be processed “in-stream” as
they fly by. See Figure 1 for an architectural illustration of this
straight-through processing paradigm.
An additional latency problem exists with systems that are
passive, as such systems wait to be told what to do by an
application before initiating processing. Passive systems require
applications to continuously poll for conditions of interest.
Unfortunately, polling results in additional overhead on the
system as well as the application, and additional latency, because
(on average) half the polling interval is added to the processing
delay. Active systems avoid this overhead by incorporating builtin
event/data-driven processing capabilities.
Query using SQL on Streams (StreamSQL)
In streaming applications, some querying mechanism must be
used to find output events of interest or compute real-time
analytics. Historically, for streaming applications, general
purpose languages such as C++ or Java have been used as the
workhorse development and programming tools. Unfortunately,
relying on low-level programming schemes results in long
development cycles and high maintenance costs.
In contrast, it is very much desirable to process moving real-time
data using a high-level language such as SQL. SQL has remained
the enduring standard database language over three decades.
SQL’s success at expressing complex data transformations derives
from the fact that it is based on a set of very powerful data
processing primitives that do filtering, merging, correlation, and
aggregation.
Handle Stream Imperfections (Delayed, Missing
and Out-of-Order Data)
In a conventional database, data is always present before it is
queried, but in a real-time system, since the data is never stored,
the infrastructure must make provision for handling data that is
late or delayed, missing, or out-of-sequence.
One requirement here is the ability to time out individual
calculations or computations. For example, consider a simple realtime
business analytic that computes the average price of the last
tick for a collection of 25 securities. One need only wait for a
tick from each security and then output the average price.
However, suppose one of the 25 stocks is thinly traded, and no
tick for that symbol will be received for the next 10 minutes. This
is an example of a computation that must block, waiting for input
to complete its calculation. Such input may or may not arrive in a
timely fashion. In fact, if the SEC orders a stop to trading in one
of the 25 securities, then the calculation will block indefinitely.
In a real-time system, it is never a good idea to allow a program to
wait indefinitely. Hence, every calculation that can block must be
allowed to time out, so that the application can continue with
partial data. Any real-time processing system must have such
time-outs for any potentially blocking operation.
Integrate Stored and Streaming Data
For many stream processing applications, comparing “present”
with “past” is a common task. Thus, a stream processing system
must also provide for careful management of stored state. For
example, in on-line data mining applications (such as detecting
credit card or other transactional fraud), identifying whether an
activity is “unusual” requires, by definition, gathering the usual
activity patterns over time, summarizing them as a “signature”,
and comparing them to the present activity in real time. To realize
this task, both historical and live data need to be integrated within
the same application for comparison.
A very popular extension of this requirement comes from firms
with electronic trading applications, who want to write a trading
algorithm and then test it on historical data to see how it would
have performed on alternative scenarios. When the algorithm
works well on historical data, the customer wants to switch it over
to a live feed seamlessly; i.e., without modifying the application
code.
Guarantee Data Safety and Availability
To preserve the integrity of mission-critical information and avoid
disruptions in real-time processing, a stream processing system
must use a high-availability (HA) solution.
High availability is a critical concern for most stream processing
applications. For example, virtually all financial services firms
expect their applications to stay up all the time, no matter what
happens. If a failure occurs, the application needs to failover to
backup hardware and keep going. Restarting the operating system
and recovering the application from a log incur too much
overhead and is thus not acceptable for real-time processing.
Hence, a “Tandem-style” hot backup and real-time failover
scheme [6], whereby a secondary system frequently synchronizes
its processing state with a primary and takes over when primary
fails, is the best reasonable alternative for these types of
applications. This HA model is shown in Figure 3.
Process and Respond Instantaneously
None of the preceding rules will make any difference alone unless
an application can “keep up”, i.e., process high-volumes of
streaming data with very low latency. In numbers, this means
capability to process tens to hundreds of thousands of messages
per second with latency in the microsecond to millisecond range
on top of COTS hardware.
To achieve such high performance, the system should have a
highly-optimized execution path that minimizes the ratio of
overhead to useful work. As exemplified by the previous rules, a
critical issue here is to minimize the number of “boundary
crossings” by integrating all critical functionality (e.g., processing
and storage) into a single system process. However, this is not
sufficient by itself; all system components need to be designed
with high performance in mind.
How do they measure up?
DBMSs use a “process-after-store” model, where input data are
first stored, potentially indexed, and then get processed. Mainmemory
DBMSs are faster because they can avoid going to disk
for most updates, but otherwise use the same basic model.
DBMSs are passive; i.e., they wait to be told what to do by an
application. Some have built-in triggering mechanisms, but it is
well-known that triggers have poor scalability. On the other hand,
rule engines and SPEs are both active and do not require any
storage prior to processing. Thus, DBMSs do not keep the data
moving, whereas rule engines and SPEs do.
CONCLUDING REMARKS
There is a large class of existing and newly emerging applications
that require sophisticated, real-time processing of high-volume
data streams. Although these applications have traditionally been
served by “point” solutions through custom coding, infrastructure
software that specifically target them have also recently started to
emerge in the research labs and marketplace.