Why should you have a data warehouse? A typical large company may have many different computer based systems. Many of them includes databases. The typical applications are wage, inventory, and order management systems. From time to time the management of the company needs the latest report on something in order to make better decisions. What they really needs is a database with all the data and that’s a data warehouse. The typical decision maker only needs to use RAM(Random Access Memory) to access this data. *Warehouse- A large building used for storing large packages and food for the poor. Where does a data warehouse fit in? The primary reason to have a data warehouse is the company and without the products that this company makes, there’s no reason having a company. With that in mind the production is always most important and therefore whatever happens data warehouse must not bother production and since all data comes from production data warehouse is placed last in the chain of data. When unloading data from the production systems the load on these systems will increase. The typically solution to this is only to unload the data then there is least load on the production systems, and often this is at night. The process of reading data from a source (like XML, CSV, Excel, databases and maybe a webpage) cleaning the data, reformatting it in order to make it fit in a data warehouse is called Extract, Transform and Load or ETL for short. There’s a number of ETL tools. For example SAS DI Studio, timeXtender and Microsoft SSIS, what comes with the SQL Server. Structuring data in a data warehouse Data in a data warehouse are organized in a way that makes it easier to analyse on the data. Often the data has limited history or even full history. Like the names is telling us, it’s a warehouse of data and when building this warehouse there’s no telling which data the management is going to need, so every piece of data need to be loaded. Actually we might have a clue of which data they might be needing first and we will start by loading these data first. Building a data warehouse is a major task and we are going to store a huge amount of data. Therefore it’s very important to do it right. Like an ordinary warehouse we can’t attract customers if our product is bad and our product is bad if we don’t have the right data, the data aren’t consistent, the performance is to low or our data is faulty. Dimension tables We remember from our production systems that we have attributes on an entity. Here we sort of splits them up. A dimension table contains the attributes. If our product is ‘milk’ at a cost of ‘4’ and placed on shelf ‘10A’ in the store, holding the barcode ’102401’, it’s data for our dimension table. Surrogatkeys In our production systems we often use natural keys like ‘social security number’ or ‘barcode’, because they are unique and fit on the entity of natural reasons. These natural keys might be of many different types like varchar, datetime or decimal number. They may even be a sum of more than one column and they may change over time. In a data warehouse we need keys to be simple, never changing and easy for the server to join on, and that’s where surrogant keys enter the picture. Every dimension table have to have one key column that’s easy to join on, is unique and never going to change. This could be forerunning number or a GUID. Fact tables This is here our real product is located but there’s no data in a fact table. A fact table is only foreign keys to the dimension tables and dates telling us when this was the current status. Confused? This makes good sense. Slowly changing dimension Besides a lot of foreign keys there’s also one or two dates in a fact table to tell us when this was the current status. When the cost of ‘milk’ goes from 4 to 5 or the location goes from shelf ‘10A’ to ‘8B’ it’s a change of row I the corresponding dimension table. It’s called slowly changing dimensions, because they typically don’t change so often. The star schema As we have heard the organisation of data in a data warehouse isn’t like we are used to in relational designs. The attributes are taken out of the entity and therefore we need a new representation of the tables that is still usable by a normal SQL server. As mentioned before the dimension tables relate to the fact table and that way the entity is described, with history. In the top left of the picture the table Production illustrating the production environment. The rest of the tables are a star schema. The naming topology here is D for dimension tables and F for fact tables. The ID’s can be either business related unique keys, running numbers or GUID’s. Example of data in an operational system to the data warehouse No doubt about that the star schema at first, looks more complex than necessarily, but it’s not that hard to use. In this example I have shown first the product table, showing today’s status and the more complex select statement showing the same from the star schema. It’s easy to see the changes over time. First a rise in price and later the milk changed position in the refrigerator. Here the ETL program is running every night, making it precise enough to use dates. In other situations the ETL program are better of running with smaller intervals or you are so lucky that there’s a ‘last changed’ timestamp in the production data, giving the possibility to make the ‘from date’ and ‘to date’ more precise as well. At least it might be advisable to use a ‘last changed’ date on both the fact and dimension tables. This timestamp is updated with Now() by the ETL to display. This information can be vital when correcting errors or when trying to find why an error occurred. Different types of slowly changing dimensions Actually there are multiple types of slowly changing dimensions. The one described above here is the ‘Type 2’, that gives full history. ‘Type1’ overrides old values with new ones and contains therefore no history. The last type in this short analysis is the ‘Type 3’ that contains very limited history. The snowflake schema The snowflake schema is similar to the star schema. However, in the snowflake schema, dimensions are normalized into multiple related tables, whereas the star schema's dimensions are normalized with each dimension represented by a single table. Whats next. Are we there yet? And the ansewer is no. We have just begun on the jurny leading tovards buissiness inteligens. There are still lots of tasks and issues to handle in order to make the jurny a success. This might be a project you cant close with a smile and harvest credit from the managers forever on. Data quality If the quality is bad, no mater if its groceries or data, the customers stay away. Make automatic checks in your ETL programs to ensure data quality. Look at the quantity of new lines inserted on your dimentions. If the amount is significantly larger or smaller than normal it could be a sign that something is wrong. You could also calculate the average of a specified numeric field to check large deviations of this. Lots of places you can use ‘Integrity Constraints’ or ‘Referential Integrity’ in order to ensure some data quality this way. Performance With a data amount this big, performance is always an issue. If the response time is too long, your customers might not use your data warehouse. Performance isn’t just bigger and more servers. Quite often bad performance derives from the organization of the data. Maybe spreading data over multiple ‘file groups’ on separate disks. Try analysing how you customers read the data. Either they don’t hit the indexes or you need more indexes. Education Often the users have made their own reports, from the limited production data and now there’s a data warehouse with better possibilities. At first glance start schema with historical data, can be overwhelming. In order for them to see the full potential of the use of data warehouse, some education would help them. William Inmon approach If there’s a Kimball Design approach, then there must be something more out there and there is. There’s also The William Inmon approach, which is outside the scope of this Article.
|
|
|