16-11-2012, 06:32 PM
i want ppt for this topic
"scalable scheduling of updates in streaming data warehouses"
16-11-2012, 06:32 PM
i want ppt for this topic "scalable scheduling of updates in streaming data warehouses"
19-01-2013, 04:09 PM
Scalable Scheduling of Updates in Streaming Data Warehouses
Scalable Scheduling.doc (Size: 1.52 MB / Downloads: 140) Abstract We discuss update scheduling in streaming data warehouses, which combine the features of traditional data warehouses and data stream systems. In our setting, external sources push append-only data streams into the warehouse with a wide range of interarrival times. While traditional data warehouses are typically refreshed during downtimes, streaming warehouses are updated as new data arrive. We model the streaming warehouse update problem as a scheduling problem, where jobs correspond to processes that load new data into tables, and whose objective is to minimize data staleness over time (at time t, if a table has been updated with information up to some earlier time r, its staleness is t minus r). We then propose a scheduling framework that handles the complications encountered by a stream warehouse: view hierarchies and priorities, data consistency, inability to preempt updates, heterogeneity of update jobs caused by different interarrival times and data volumes among different sources, and transient overload. A novel feature of our framework is that scheduling decisions do not depend on properties of update jobs (such as deadlines), but rather on the effect of update jobs on data staleness. Finally, we present a suite of update scheduling algorithms and extensive simulation experiments to map out factors which affect their performance. INTRODUCTION RADITIONAL data warehouses are updated during down- times [25] and store layers of complex materialized views over terabytes of historical data. On the other hand, Data Stream Management Systems (DSMS) support simple analyses on recently arrived data in real time. Streaming warehouses such as DataDepot [15] combine the features of these two systems by maintaining a unified view of current and historical data. This enables a real-time decision support for business-critical applications that receive streams of append-only data from external sources. Applications include: . Online stock trading, where recent transactions generated by multiple stock exchanges are com- pared against historical trends in nearly real time to identify profit opportunities; . Credit card or telephone fraud detection, where streams of point-of-sale transactions or call details are collected in nearly real time and compared with past customer behavior; Scheduling Challenges Real-time scheduling is a well-studied topic with a lengthy literature [7]. However, our problem introduces unique challenges that must be simultaneously dealt with a streaming warehouse. Scheduling metric. Many metrics have been considered in the real-time scheduling literature. In a typical hard real- time system, jobs must be completed before their dead- lines—a simple metric to understand and to prove results about. In a firm real-time system, jobs can miss their deadlines, and if they do, they are discarded. The perfor- mance metric in a firm real-time system is the fraction of jobs that meet their deadlines. However, a streaming warehouse must load all of the data that arrive; therefore no updates can be discarded. In a soft real-time system, late jobs are allowed to stay in the system, and the performance metric is lateness (or tardiness), which is the difference between the comple- tion times of late jobs and their deadlines. However, we are not concerned about properties of the update jobs. Instead, we will define a scheduling metric in terms of data staleness, roughly defined as the difference between the current time and the time stamp of the most recent record in a table. Contributions and Organization We address the update scheduling problem in real-time streaming warehouses. In Section 2, we introduce our system model, formalize the notion of staleness, and describe our solution for maintaining data consistency under arbitrary update schedules. Section 3 presents our scheduling framework. In the proposed framework, a basic algorithm orders jobs by some reasonable means, e.g., by the marginal benefit of executing them. SYSTEM MODEL Streaming Warehouse Architecture Table 1 lists the symbols used in this paper. Fig. 1 illustrates a streaming data warehouse. Each data stream i is generated by an external source, with a batch of new data, consisting of one or more records, being pushed to the warehouse with period Pi . If the period of a stream is unknown or unpredictable, we let the user choose a period with which the warehouse should check for new data. Examples of streams collected by an Internet Service Provider include router performance statistics such as CPU usage, system logs, routing table updates, link layer alerts, etc. An important property of the data streams in our motivating applications is that they are append-only, i.e., existing records are never modified or deleted. For example, a stream of average router CPU utilization measurement may consist of records with fields (time stamp, router_name, CPU_utilization), and a new data file with updated CPU measurement for each router may arrive at the warehouse every 5 minutes. Basic Algorithms The basic scheduling algorithms prioritize jobs to be executed on individual tracks, and will serve as building blocks of our multitrack solutions that we will present in Section 3.2. For example, the Earliest-Deadline-First (EDF) algorithm orders released jobs by proximity to their deadlines. EDF is known to be an optimal hard real-time scheduling algorithm for a single track (w.r.t. maximizing the number of jobs that meet their deadlines), if the jobs are preemptible [7]. Since our jobs are prioritized, using EDF directly does not result in the best performance. Instead we use one of the following basic algorithms. CONCLUSIONS In this paper, we motivated, formalized, and solved the problem of nonpreemptively scheduling updates in a real- time streaming warehouse. We proposed the notion of average staleness as a scheduling metric and presented scheduling algorithms designed to handle the complex environment of a streaming data warehouse. We then proposed a scheduling framework that assigns jobs to processing tracks and uses basic algorithms to schedule jobs within a track. The main feature of our framework is the ability to reserve resources for short jobs that often correspond to important frequently refreshed tables, while avoiding the inefficiencies associated with partitioned scheduling techniques. |
|