ಓಂ Introduction
One of the primary components in a SQL Server business intelligence (BI) solution is the data warehouse. Indeed, the data warehouse is, in a sense, the glue that holds the system together. The warehouse acts as a central repository for heterogeneous data that is to be used for purposes of analysis and reporting.
Because of the essential role that the data warehouse plays in a BI solution, it’s important to understand the fundamental concepts related to data warehousing if you’re working with such a solution, even if you’re not directly responsible for the data warehouse itself. To this end, the article provides a basic overview of what a data warehouse is and how it fits into a relational database management system (RDBMS) such as SQL Server. The article then describes database modeling concepts and the components that make up the model, and concludes with an overview of how the warehouse is integrated with other components in the SQL Server suite of BI tools.
Note:
The purpose of this article is to provide an overview of data warehouse concepts. It is not meant as a recommendation for any specific design. In addition, the article assumes that you have a basic understanding of relational database concepts such as normalization and referential integrity. In addition, the examples used in here tend to be specific to SQL Server 2005 and 2008, although the underlying principles can apply to any RDBMS.
The Data Warehouse
A data warehouse consolidates, standardizes, and organizes data in order to support business decisions that are made through analysis and reporting. The data might originate in RDBMSs such as SQL Server or Oracle, Excel spreadsheets, CSV files, directory services stores such as Active Directory, or other types of data stores, as is often the case in large enterprise networks. Figure 1 illustrates how heterogeneous data is consolidated into a data warehouse.
Figure 1: Using a Data Warehouse to Consolidate Heterogeneous Data
The data warehouse must be able to store data from a variety of data sources in a way that lets tools such as SQL Server Analysis Services (SSAS) and SQL Server Reporting Services (SSRS) efficiently access the data. These tools are, in effect, indifferent to the original data sources and are concerned only with the reliability and viability of the data in the warehouse.
A data warehouse is sometimes considered to be a place for archiving data; however, that is not its purpose. Although historical data is stored in a data warehouse, only the historical range necessary to support analysis and reporting is retained there. For example, if a business rule specifies that the warehouse must maintain two years worth of historical data, older data is offloaded to another system for archiving or is deleted, depending on the specified business requirements.
Data Warehouse vs. Data Mart
A data warehouse is different from a data mart, although the terms are sometimes used interchangeable. In addition, there is some debate as to what exactly constitutes a data mart as compared to a data warehouse. However, it is generally accepted that a data warehouse is associated with enterprise-wide business processes and decisions (and consequently is usually a
repository for enterprise-wide data), whereas the data mart tends to focus on a specific business segment of that enterprise. In some cases, a data mart might be considered a subset of the data warehouse, although this is by no means a universal interpretation or practice. For the purposes of this article, we’re concerned only with the enterprise-wide repository known as a data warehouse.
Relational Database vs. Dimensional Database
Because SQL Server, like Oracle and MySQL, is a RDBMS, any database stored within that system can be considered, by extension, a relational database. And that’s where things can get confusing.
The typical relational database supports online transaction processing (OLTP). For example, an OLTP database might support bank transactions or store sales. The transactions are immediate and the data is current, with regard to the most recent transaction. The database conforms to a relational model for efficient transaction processing and data integrity. The database design should, in theory, adhere to the strict rules of normalization which aim, among other things, to ensure that the data is treated as atomic units and there is minimal amount of redundant data.
A data warehouse, on the other hand, generally conforms to a dimensional model, which is more concerned with query efficiency than issues of normalization. Even though a data warehouse is, strictly speaking, a relational database (because it’s stored in a RDBMS), the tables and relationships between those tables are modelled very differently from the tables and relationships defined in the relational database. (The specifics of data warehouse modelling are discussed below.)
Note:
Because of the reasons described above, you might come across documentation that refers to a data warehouse as a relational database. However, for the purposes of this article, I refer to an OLTP database as a relational database and a data warehouse as a dimensional database.
Dimensional Database vs. Multidimensional Database
Another source of confusion at times is the distinction between a data warehouse and an SSAS database. The confusion results because some documentation refers to an SSAS database as a dimensional database. However, unlike the SQL Server database engine, which supports OLTP as well as data warehousing, Analysis Services supports online analytical processing (OLAP), which is designed to quickly process analytical queries. Data in an OLAP database is stored in multidimensional cubes of aggregated data, unlike the typical table/column model found in relational and dimensional databases.
Note:
I’ve also seen a data warehouse referred to as a staging database when used to support SSAS analysis. Perhaps from an SSAS perspective, this might make sense, especially if all data in the data warehouse is always rolled up into SSAS multidimensional cubes. In this sense, SSAS is seen as the primary database and the warehouse as only supporting that database. Personally, I think such a classification diminishes the essential role that the data warehouse plays in a BI solution, so I would tend to avoid this sort of reference.