25-01-2013, 02:48 PM
Principles of Distributed Database Systems
Principles of Distributed Database Systems.pdf (Size: 5.87 MB / Downloads: 359)
Introduction
Distributed database system (DDBS) technology is the union of what appear to
be two diametrically opposed approaches to data processing: database system and
computer network technologies. Database systems have taken us from a paradigm
of data processing in which each application defined and maintained its own data
(Figure 1.1) to one in which the data are defined and administered centrally (Figure
1.2). This new orientation results in data independence, whereby the application
programs are immune to changes in the logical or physical organization of the data,
and vice versa.
One of the major motivations behind the use of database systems is the desire
to integrate the operational data of an enterprise and to provide centralized, thus
controlled access to that data. The technology of computer networks, on the other
hand, promotes a mode of work that goes against all centralization efforts. At first
glance it might be difficult to understand how these two contrasting approaches can
possibly be synthesized to produce a technology that is more powerful and more
promising than either one alone. The key to this understanding is the
Introduction
that the most important objective of the database technology is integration, not
centralization. It is important to realize that either one of these terms does not
necessarily imply the other. It is possible to achieve integration without centralization,
and that is exactly what the distributed database technology attempts to achieve.
In this chapter we define the fundamental concepts and set the framework for
discussing distributed databases.We start by examining distributed systems in general
in order to clarify the role of database technology within distributed data processing,
and then move on to topics that are more directly related to DDBS.
Distributed Data Processing
The term distributed processing (or distributed computing) is hard to define precisely.
Obviously, some degree of distributed processing goes on in any computer system,
even on single-processor computers where the central processing unit (CPU) and input/
output (I/O) functions are separated and overlapped. This separation and overlap
can be considered as one form of distributed processing. The widespread emergence
of parallel computers has further complicated the picture, since the distinction between
distributed computing systems and some forms of parallel computers is rather
vague.
In this book we define distributed processing in such a way that it leads to a
definition of a distributed database system. The working definition we use for a
distributed computing system states that it is a number of autonomous processing
elements (not necessarily homogeneous) that are interconnected by a computer
network and that cooperate in performing their assigned tasks. The “processing
element” referred to in this definition is a computing device that can execute a
program on its own. This definition is similar to those given in distributed systems
textbooks (e.g., [Tanenbaum and van Steen, 2002] and [Colouris et al., 2001]).
A fundamental question that needs to be asked is: What is being distributed?
One of the things that might be distributed is the processing logic. In fact, the
definition of a distributed computing system given above implicitly assumes that the
What is a Distributed Database System? 3
processing logic or processing elements are distributed. Another possible distribution
is according to function. Various functions of a computer system could be delegated
to various pieces of hardware or software. A third possible mode of distribution is
according to data. Data used by a number of applications may be distributed to a
number of processing sites. Finally, control can be distributed. The control of the
execution of various tasks might be distributed instead of being performed by one
computer system. From the viewpoint of distributed database systems, these modes
of distribution are all necessary and important. In the following sections we talk
about these in more detail.
Another reasonable question to ask at this point is: Why do we distribute at all?
The classical answers to this question indicate that distributed processing better
corresponds to the organizational structure of today’s widely distributed enterprises,
and that such a system is more reliable and more responsive. More importantly,
many of the current applications of computer technology are inherently distributed.
Web-based applications, electronic commerce business over the Internet, multimedia
applications such as news-on-demand or medical imaging, manufacturing control
systems are all examples of such applications.
From a more global perspective, however, it can be stated that the fundamental
reason behind distributed processing is to be better able to cope with the large-scale
data management problems that we face today, by using a variation of the well-known
divide-and-conquer rule. If the necessary software support for distributed processing
can be developed, it might be possible to solve these complicated problems simply
by dividing them into smaller pieces and assigning them to different software groups,
which work on different computers and produce a system that runs on multiple
processing elements but can work efficiently toward the execution of a common task.
Distributed database systems should also be viewed within this framework and
treated as tools that could make distributed processing easier and more efficient. It is
reasonable to draw an analogy between what distributed databases might offer to the
data processing world and what the database technology has already provided. There
is no doubt that the development of general-purpose, adaptable, efficient distributed
database systems has aided greatly in the task of developing distributed software.
What is a Distributed Database System?
We define a distributed database as a collection of multiple, logically interrelated
databases distributed over a computer network. A distributed database management
system (distributed DBMS) is then defined as the software system that permits the
management of the distributed database and makes the distribution transparent to the
users. Sometimes “distributed database system” (DDBS) is used to refer jointly to
the distributed database and the distributed DBMS. The two important terms in these
definitions are “logically interrelated” and “distributed over a computer network.”
They help eliminate certain cases that have sometimes been accepted to represent a
DDBS.
Introduction
A DDBS is not a “collection of files” that can be individually stored at each
node of a computer network. To form a DDBS, files should not only be logically
related, but there should be structured among the files, and access should be via
a common interface. We should note that there has been much recent activity in
providing DBMS functionality over semi-structured data that are stored in files on
the Internet (such as Web pages). In light of this activity, the above requirement
may seem unnecessarily strict. Nevertheless, it is important to make a distinction
between a DDBS where this requirement is met, and more general distributed data
management systems that provide a “DBMS-like” access to data. In various chapters
of this book, we will expand our discussion to cover these more general systems.
It has sometimes been assumed that the physical distribution of data is not the
most significant issue. The proponents of this view would therefore feel comfortable
in labeling as a distributed database a number of (related) databases that reside in the
same computer system. However, the physical distribution of data is important. It
creates problems that are not encountered when the databases reside in the same computer.
These difficulties are discussed in Section 1.5. Note that physical distribution
does not necessarily imply that the computer systems be geographically far apart;
they could actually be in the same room. It simply implies that the communication
between them is done over a network instead of through shared memory or shared
disk (as would be the case with multiprocessor systems), with the network as the only
shared resource.
This suggests that multiprocessor systems should not be considered as DDBSs.
Although shared-nothing multiprocessors, where each processor node has its own
primary and secondary memory, and may also have its own peripherals, are quite
similar to the distributed environment that we focus on, there are differences. The
fundamental difference is the mode of operation. A multiprocessor system design
is rather symmetrical, consisting of a number of identical processor and memory
components, and controlled by one or more copies of the same operating system
that is responsible for a strict control of the task assignment to each processor. This
is not true in distributed computing systems, where heterogeneity of the operating
system as well as the hardware is quite common. Database systems that run over
multiprocessor systems are called parallel database systems and are discussed in
Chapter 14.
A DDBS is also not a system where, despite the existence of a network, the
database resides at only one node of the network (Figure 1.3). In this case, the
problems of database management are no different than the problems encountered in
a centralized database environment (shortly, we will discuss client/server DBMSs
which relax this requirement to a certain extent). The database is centrally managed
by one computer system (site 2 in Figure 1.3) and all the requests are routed to
that site. The only additional consideration has to do with transmission delays. It
is obvious that the existence of a computer network or a collection of “files” is not
sufficient to form a distributed database system. What we are interested in is an
environment where data are distributed among a number of sites (Figure 1.4).
1.3 Data Delivery Alternatives 5
Site 1
Site 2
Site 4 Site 3
Site 5
Communication
Network
Fig. 1.3 Central Database on a Network
Site 1
Site 2
Site 4 Site 3
Site 5
Communication
Network
DDBS Environment
Data Delivery Alternatives
In distributed databases, data are “delivered” from the sites where they are stored to
where the query is posed. We characterize the data delivery alternatives along three
orthogonal dimensions: delivery modes, frequency and communication methods. The
combinations of alternatives along each of these dimensions (that we discuss next)
provide a rich design space.
The alternative delivery modes are pull-only, push-only and hybrid. In the pullonly
mode of data delivery, the transfer of data from servers to clients is initiated
by a client pull. When a client request is received at a server, the server responds by
locating the requested information. The main characteristic of pull-based delivery is
that the arrival of new data items or updates to existing data items are carried out at a
Introduction
server without notification to clients unless clients explicitly poll the server. Also, in
pull-based mode, servers must be interrupted continuously to deal with requests from
clients. Furthermore, the information that clients can obtain from a server is limited
to when and what clients know to ask for. Conventional DBMSs offer primarily
pull-based data delivery.
In the push-only mode of data delivery, the transfer of data from servers to clients
is initiated by a server push in the absence of any specific request from clients.
The main difficulty of the push-based approach is in deciding which data would be
of common interest, and when to send them to clients – alternatives are periodic,
irregular, or conditional. Thus, the usefulness of server push depends heavily upon
the accuracy of a server to predict the needs of clients. In push-based mode, servers
disseminate information to either an unbounded set of clients (random broadcast)
who can listen to a medium or selective set of clients (multicast), who belong to some
categories of recipients that may receive the data.
The hybrid mode of data delivery combines the client-pull and server-push mechanisms.
The continuous (or continual) query approach (e.g., [Liu et al., 1996],[Terry
et al., 1992],[Chen et al., 2000],[Pandey et al., 2003]) presents one possible way of
combining the pull and push modes: namely, the transfer of information from servers
to clients is first initiated by a client pull (by posing the query), and the subsequent
transfer of updated information to clients is initiated by a server push.
There are three typical frequency measurements that can be used to classify the
regularity of data delivery. They are periodic, conditional, and ad-hoc or irregular.
In periodic delivery, data are sent from the server to clients at regular intervals.
The intervals can be defined by system default or by clients using their profiles. Both
pull and push can be performed in periodic fashion. Periodic delivery is carried out
on a regular and pre-specified repeating schedule. A client request for IBM’s stock
price every week is an example of a periodic pull. An example of periodic push is
when an application can send out stock price listing on a regular basis, say every
morning. Periodic push is particularly useful for situations in which clients might not
be available at all times, or might be unable to react to what has been sent, such as in
the mobile setting where clients can become disconnected.