06-09-2012, 03:29 PM
Building data warehouses usingopen source technologies
Building data.pdf (Size: 956.19 KB / Downloads: 76)
Introduction
This article is about building data warehouses. A data warehouse is a computer database that collects, integrates and stores an organization's data with the aim of producing accurate and timely management information and supporting data analysis1. It explains the importance of a good data warehouse and cover the process of building such a specialized database using open source technologies.
Intended audience
This article is meant for software developers, database administrators, integrated software vendors or other people who are facing the challenge of making the analysis of large amounts of data generated by a business or an information system possible. It has been written with the assumption of a basic understanding of the covered concepts like databases, information systems and business transactions in mind.
Why build a data warehouse
There are many reasons to justify building a data warehouse, but almost all of them boil down to the same basic wish: provide means for analysis of data to support management decisions. You are probably already providing management with data like site usage statistics, referrer trends and the number of registered users of their information system. This is basic information, which can be retrieved directly from the information system itself. However, this has some important drawbacks.
First of all, since all of this data is retrieved directly from the information system, it places a considerable burden on this system. Analysis often requires huge amounts of data to be processed, which is often a problem if, for instance, the system's database tables get locked for retrieval. A data warehouse can be completely detached from the information system, even running on a different system.
Secondly, an OLTP2system's data model is rarely optimized for analysis. We all learn to develop systems to use a normalized database modeled after our entity relationships. This is a good thing for the information system, since the underlying model is a close reflection of the system itself, but it makes querying for large sums of aggregated data a costly operation. Furthermore, redundancy is rarely part of this database design, because redundancy is hard to maintain, often causing data inconsistencies or worse. For data analysis, redundancy can be great, because it speeds up the process.
Building a data warehouse
In this chapter, I will guide you through the different phases in building a data warehouse. I will illustrate this by using a simplified webbased information system as an example for creating a data warehouse. This system contains familiar entities such as “requests”, “users” and “pages”. The data model of this system is shown in figure <TODO>.
Designing a dimensional data warehouse
Asking questions
The most important step in building a data warehouse is designing it. You have to ask yourself: “What does the management want to know?”. First, you'll have to figure out which questions need to be answered by analysis of the data warehouse to.be3. For example, is there a correlation between users of a webbased system and the pages it provides? Do certain groups of users visit other pages than others? Where do the visitors from different pages come from? My belief is at least 80% of all management questions about the data generated by an information system can be answered by a decent data warehouse.
Since you are reading this, I assume your company's management has already formulated this kind of questions. If not, make sure they do. Don't think too lightly about this phase, as it is the basis of what the resulting data warehouse can do.
Modeling structures
Once you get a good view on what questions you'll want to ask the data warehouse later on, you have to determine the different dataarrangements that come with these questions. For example, a question about sales will have to operate on a different structure than one about employment. These structures are called “cubes” in the world of OLAP4, because they are essentially an extension to the twodimensional array of data that is stored in a conventional (for example SQL) database. Depending on the data, a cube may have more than two or even three dimensions. We'll model these cubes in a relational database as a star schema, containing a single fact table linking together multiple dimension tables.