29-08-2016, 12:23 PM
1451386599-rep.pdf (Size: 1.69 MB / Downloads: 17)
ABSTRACT
Extraction Transformation Loading (ETL) tools are pieces of software responsible
for the extraction of data from several sources, its cleansing, customization, reformatting, integration,
and insertion into a data warehouse. Building the ETL process is potentially one
of the biggest tasks of building a warehouse; it is complex, time consuming, and consume
most of data warehouse projects implementation efforts, costs, and resources. Building a data
warehouse requires focusing closely on understanding three main areas: the source area, the
destination area, and the mapping area (ETL processes). The source area has standard models
such as entity relationship diagram, and the destination area has standard models such as star
schema, but the mapping area has not a standard model till now. In spite of the importance of
ETL processes, little research has been done in this area due to its complexity. There is a clear
lack of a standard model that can be used to represent the ETL scenarios.Research in the field
of modeling ETL processes can be categorized into three main approaches: Modeling based on
mapping expressions and guidelines, modeling based on conceptual constructs, and modeling
based on UML environment. These projects try to represent the main mapping activities at
the conceptual level. The proposed model is built upon the enhancement of the models in the
previous models to support some missing mapping features.
INTRODUCTION
Data warehouse is a central repository of integrated data from one or more disparate data
sources. They store current and historical data and are used for creating analytical reports for
knowledge workers throughout the enterprise. In order to facilitate and manage the data warehouse
operational processes, specialized tools are already available in the market, under the
general title Extraction-Transformation-Loading (ETL) tools. Although ETL processes area is
very important, it has little research. This is because of its difficulty and lack of formal model
for representing ETL activities that map the incoming data from different DSs to be in a suitable
format for loading to the target Data warehouse. To build a DW we must run the ETL
tool which has three tasks: (1) data is extracted from different data sources, (2) propagated
to the data staging area where it is transformed and cleansed, and then (3) loaded to the data
warehouse. Many research projects try to represent the ETL processes at the conceptual level.
Here a conceptual model is proposed to be used in modelling various ETL processes and cover
the limitations of the previous research projects. The proposed model has the following characteristics:
−Simple: to be understood by the DW designer.
−Complete: to represent all activities of the ETL processes.
−Customizable: to be used in different DW environments.
The proposed model is entity mapping diagram (EMD). EMD model will be discussed
here. And also will make a survey of the previous work done in this area.
ETL MODELLING CONCEPTS
The general framework for ETL processes is shown in Fig. 2.1. Data is extracted from
different data sources, and then propagated to the DSA where it is transformed and cleansed
before being loaded to the data warehouse.
Fig. 2.1: A general framework for ETL processes
2.1 THE ETL PHASES
During the ETL process, data is extracted from an OLTP databases, transformed to match
the data warehouse schema, and loaded into the data warehouse database. Many data warehouses
also incorporate data from non-OLTP systems, such as text files, legacy systems, and
spreadsheets. ETL is often a complex combination of process and technology that consumes a
significant portion of the data warehouse development efforts and requires the skills of business
analysts, database designers, and application developers. The ETL process is not a one-time
event. As data sources change the data warehouse will periodically updated. The ETL processes
must be designed for ease of modification. A solid, well-designed, and documented
2
College of Engineering, Cherthala CHAPTER 2. ETL MODELLING CONCEPTS
ETL system is necessary for the success of a data warehouse project.
An ETL system consists of three consecutive functional steps: extraction, transformation,
and loading:
2.1.1 EXTRACTION
The first step in any ETL scenario is data extraction. The ETL extraction step is responsible
for extracting data from the source systems. Each data source has its distinct set of
characteristics that need to be managed in order to effectively extract data for the ETL process.
The process needs to effectively integrate systems that have different platforms, such as different
database management systems, different operating systems, and different communications
protocols.
The extraction process consists of two phases, initial extraction, and changed data extraction.
In the initial extraction it is the first time to get the data from the different operational
sources to be loaded into the data warehouse. This process is done only one time after building
the DW to populate it with a huge amount of data from source systems. The incremental
extraction is called changed data capture (CDC) where the ETL processes refresh the DW with
the modified and added data in the source systems since the last extraction.
2.1.2 TRANSFORMATION
The second step in any ETL scenario is data transformation. The transformation step
tends to make some cleaning and conforming on the incoming data to gain accurate data which
is correct, complete, consistent, and unambiguous. This process includes data cleaning, transformation,
and integration.
2.1.3 LOADING
Loading data to the target multidimensional structure is the final ETL step. In this step,
extracted and transformed data is written into the dimensional structures actually accessed by
the end users and application systems
MODELS OF ETL PROCESSES
This section will navigate through the efforts done to conceptualize the ETL processes.
Although the ETL processes are critical in building and maintaining the DW systems, there is
a clear lack of a standard model that can be used to represent the ETL scenarios. Research in
the field of modeling ETL processes can be categorized into three main approaches:
1. Modeling based on mapping expressions and guidelines.
2. Modeling based on conceptual constructs.
3. Modeling based on UML environment.
In the following, a brief description of each approach is presented.
3.1 MODELING ETL PROCESS USING MAPPING EXPRESSIONS
A model covering different types of mapping expressions. This model to create an active
ETL tool to achieve the warehousing process. Queries will be used to represent the mapping
between the source and the target data; thus, allowing DBMS to play an expanded role as a data
transformation engine as well as a data store. This approach enables a complete interaction between
mapping metadata and the warehousing tool. In addition, it addresses the efficiency of a
query-based data warehousing ETL tool without suggesting any graphical models. It describes
a query generator for reusable and more efficient data warehouse (DW) processing.
4
College of Engineering, Cherthala CHAPTER 3. MODELS OF ETL PROCESSES
3.1.1 MAPPING GUIDELINE
Mapping guideline means the set of information defined by the developers in order to
achieve the mapping between the attributes of two schemas. Actually, different kinds of mapping
guidelines are used for many applications. Traditionally, these guidelines are defined
manually during the system implementation. In the best case, they are saved as paper documents.
These guidelines are used as references each time there is a need to understand how an
attribute of a target schema has been generated from the sources attributes. To keep updating
these guidelines is a very hard task, especially with different versions of guidelines. To update
the mapping of an attribute in the system, one should include an update for the paper document
guideline as well. Thus, it is extremely difficult to maintain such tasks especially with
simultaneous updates by different users.
3.1.2 MAPPING EXPRESSIONS
Mapping expression of an attribute is the information needed to recognize how a target
attribute is created from the sources attributes. Examples of the applications where mapping
expressions are used are listed as follows:
• Schema mapping: for database schema mapping, the mapping expression is needed to define
the correspondence between matched elements.
• Data warehousing tool (ETL): includes a transformation process where the correspondence
between the sources data and the target DW data is defined.
• EDI message mapping: the need of a complex message translation is required for EDI, where
data must be transformed from one EDI message format into another.
• EAI (enterprise application integration): the integration of information systems and applications
needs a middleware to manage this process .It includes management rules of an enterprises
applications, data spread rules for concerned applications, and data conversion rules. Indeed,
data conversion rules define the mapping expression of integrated data.