26-04-2014, 11:06 AM
Storage Layout and I/O Performance Tuning for IBM® Red Brick® Data Warehouse
Storage Layout .pdf (Size: 72.42 KB / Downloads: 28)
Abstract
Although Red Brick performance depends primarily on the logical database design, the placement
and allocation of data files in the disk subsystem is another significant performance factor. There
are many choices in the configuration and design of the physical database layout. This article
reviews best practices propagated by storage and database vendors and presents measurements that
compare storage layout alternatives for Red Brick. This leads to specific I/O configuration and
tuning guidelines for IBM Red Brick Data Warehouse.
Introduction
Designing performance for an IBM® Red Brick data warehouse does not begin with storage layout
and I/O tuning. A great storage layout and finely tuned I/O performance cannot make up for poor
join performance caused by inadequate schema or index design. After logical design, you must also
consider capacity planning to estimate the number of required CPUs and disks. Both logical design
and capacity planning are not addressed in this article.
However, eventually there comes a time to optimize the physical database design and its storage
configuration. Because Red Brick is typically used to process complex analytical queries over huge
amounts of data, efficient I/O is often an important performance factor. You must map all tables
and indexes to a set of database files (called physical storage units, or PSUs), which in turn are
allocated in the disk subsystem.
What the vendors say
Hardware and software vendors such as IBM, Oracle, Informix®, Sun, Compaq, EMC, Veritas
have issued a variety of white papers and tuning recommendations. Some of them contradict each
other and have to be adopted with care. For an excellent survey of disk striping, RAID technologies,
and their application see [Chen et al. 94].
Vendor recommendations for storage layout
Recommendations for database storage layout come in a broad range of variety, often with specific
qualifiers regarding the vendor’s hardware or software product. For simplicity, we classify the
storage layout guidelines to fall into one of two categories:
Distribute all database objects over all disks.
Separate database objects using disjoint (non-overlapping) sets of disks.
By database objects we mean tables, indexes, temp space, logs, database catalog/system tables,
materialized views, and so on.
Separate database objects using disjoint sets of disks
A common recommendation is to place the recovery log on disks that are separate from the table
and indexes, not only for performance but also for fault tolerance. Other “separation” rules suggest
separating tables from indexes or separating tables from each other if they are joined frequently.
Yet another guideline found is separation of sequential from random access data [Compaq 98],
[Whalen, Schoeb 99]. For example, [Holdsworth 96] advises that tables that are subject to multiple
concurrent parallel scans be given a dedicated set of disks. While this may be beneficial for
recovery logs, it is probably difficult for tables; i.e. [Larson 00] argues that a table scan running
concurrently with an indexed lookup on the same table is unlikely to enjoy sequential I/O benefits.
Vendor recommendations for stripe unit size
[Holdsworth 96] notes that stripe unit size is a hotly debated issue. He argues that the stripe size
should not be set to the database system’s block size because it has been shown to degr ade
performance (in Oracle systems). He recommends a stripe size significantly larger then the database
block size such as 1MB, and on very large systems the stripe size should be 5 MB. [Veritas 01]
advises that the stripe size should be a multiple of the OS (or logical volume) block size. For most
OLTP databases, Veritas recommends their default stripe unit size of 64 KB while Decision
Support Systems (DSS) may perform better with larger stripe sizes, such as 128 KB or 256 KB.
[Loaiza 00] argues that a 1 MB stripe unit size is suitable for any kind of workload because it leads
to a good ratio of transfer times to seek times. He indicates that this value will increase in future as
disk transfer rates increase faster than seek times decrease. For Red Brick, [Fung 98] recommends
an 8 KB stripe unit size.
Research findings – stripe unit size
[Scheuermann et al. 98] is one of the later studies that unifies and confirms existing research. Their
performance model is based on an open queueing network, which seems appropriate for today’s
database applications (many users, varying degree of concurrency). The results show that a small
stripe unit size can provide good response times for a light load, but can severely limit throughput.
Consequently, with small stripe units the response times increase drastically under heavy loads. A
larger stripe unit size tends to cluster portions of a file on disk. While this may or may not improve
the performance of individual I/O requests (depending on the request size), it is shown to increase
overall I/O throughput if the requests are still uniformly distributed over all disks. Since a large
stripe unit size, e.g. 64KB or 1MB, is still very small compared to a total database size, this
uniformity assumption seems reasonable. Thus, theoretical results indicate that large stripe unit
sizes provide better performance under medium to heavy workloads, which in database systems can
arise due to multi-user activity, intra-query parallelism, or both.
Stripe 12/4/4/2 (64 KB)
This layout follows recommendations in [Fung 98]; that is, it separates fact table, star indexes, spill
space, and dimensions into different striped volumes on disjoint sets of disks. The numbers of disks
in the striped volumes is again based on the relative size of the objects. The fact table is striped over
12 disks, the star indexes and the spill space over 4 disks each, and the dimensions over 2 disks. We
chose a medium stripe unit size of 64 KB, which is the recommended default value in the Veritas
Volume Manager that we used.
Measurement results
We executed and measured each workload multiple times for each of the storage layout alternatives
to ensure reproducibility and consistency of the results. To isolate the performance impact of the
storage layout alternatives and to prevent other performance issues from distorting the results, the
database system was highly tuned for each of the five workloads. To a certain extent, we also
repeated the experiments for different sets of database configuration parameters. For example, we
varied the degree of parallelism per query, the amount of main memory used by Red Brick, etc. We
also experimented with sorted vs. unsorted data in the fact table and with pre -allocated data files vs.
data files that get extended during the load process. These variations confirmed that the results and
conclusions, which we present below, are robust and not very sensitive to such configuration
changes.
Conclusions and tuning recommendations
Our experiments show that distributing all Red Brick database files (and thus the I/O load) evenly
over all available disks by means of striping provides very good performance at low administration
overhead. This confirms the validity of storage layout strategies described as “extreme striping” or
SAME (see Section 2) and is also consistent with research results. Specifically, the separation of
tables from indexes is not required and can adversely affect performance.
Below we summarize the main tuning recommendations for IBM Red Brick Data Warehouse.
These tuning guidelines cannot be expected to provide optimal performance for every possible
query and application scenario. Instead, they are intended to o
ptimize the overall performance of a
representatively mixed workload. We emphasize that these are just rough guidelines and that a
specific data warehouse installation or the characteristics of a particular computing environment
may require adjusting these strategies. Thus, we do not encourage you to follow our
recommendations blindly. We do hope that you come away from this article with a better
understanding of the choices and potential consequences of different storage layouts. This in turn
should help y
ou make educated decisions when designing the storage layout for a particular data
warehouse implementation.