09-10-2017, 10:29 AM
While traditional data stores are updated normally during downtime, the streaming stores are updated as new data arrives. We model the problem of updating a streaming warehouse as a programming problem, where the works correspond to processes that load new data into tables and whose objective is to minimize the instability of data in time (at time t, if a table has been updated with information up to some previous r, its staleness is t less r). Next, we propose a programming framework that handles the complications encountered by a flow store: see hierarchies and priorities, data coherence, inability for previous updates, heterogeneity of update jobs caused by different interarrivios times and volumes of data between different sources and transient overhead. A novel feature of our framework is that scheduling decisions do not depend on the properties of the update jobs (such as deadlines), but rather on the effect of updating jobs on data loss. Finally, we present a set of algorithms for scheduling updates and extensive simulation experiments to determine the factors that affect their performance.