Size Of Data Warehouse

The storage design (hardware & data modelling) vary based on the size of the data warehouse. Broadly, there are three different sizes of data warehouse: Small-Size Data warehouse: DWH databases ~ 200 GB Medium-Size Data warehouse: DWH databases are between ~300 GB and ~ 5 TB Enterprise (large) Data warehouse: DWH databases are more than ~5 TB Fact tables

  • Carefully choose your data type (use DW best practices).
  • Consider partitioning fact tables if they are greater than 10 GB.
  • Index them properly, make use of column-storage index on SQL Server 2012 or 2014.
  • On medium-size DW and on large Fact tables use partition by date key.

Snapshot facts (Daily/Weekly/Monthly/Quarterly/Yearly facts)

  • Maintain separate fact tables for each grain level of facts.
  • Each time period of facts should be stored separately from each other. For example, Daily and Monthly facts should not be combined, nor should Monthly and Quarterly facts, and so on.
  • Typical users will analyse yearly /quarterly / monthly data more frequently than weekly / daily data.
  • Apply index on date keys of fact tables as this will improve efficiency when retrieving historical data.

Dimensions tables

  • Carefully choose your data type (use DW best practices).
  • Don’t confuse context / activity with measures; divide the context into logical groupings such as customers, store locations, and time, and store them as dimensions.
  • Don’t snowflake when a simple star schema will do. The purpose of data warehouse design isn’t to normalize the data (as you attempt to do when you create a snowflake schema), but rather to organize the data in such a way as to be intuitively meaningful to end users. That often means you need to de-normalize the dimensions.
  • Use integer surrogate keys for all dimensions, other than the Date dimension. Use the smallest possible integer for the dimension surrogate keys. This helps to keep fact table narrow.
  • Use a meaningful date key of integer type derivable from the DATETIME data type (for example: 20141214).
  • Don’t use a surrogate Key for the Date dimension.
  • Leverage easy to write queries that put a WHERE clause on this column, which will allow partition elimination of the fact table.
  • Build a clustered index on the surrogate key for each dimension table, and build a non-clustered index on the Business Key (potentially combined with a row-effective-date) to support surrogate key lookups during loads.
  • Build non-clustered indexes on other frequently searched dimension columns.
  • Avoid partitioning dimension tables.
  • Avoid enforcing foreign key relationships between the fact and the dimension tables, to allow faster data loads. You can create foreign key constraints with NOCHECK to document the relationships; but don’t enforce them. Ensure data integrity though Transform Lookups, or perform the data integrity checks at the source of the data.

Large dimensions Dimensions are often expected to be smaller in size less than 100,000 rows. If your dimension contains more than 100,000 rows you know you have a large dimension. Large dimensions that change slowly

  • If your dimension change frequency per month is around 300 rows out of 100,000 rows, you can treat it as a slowly changing dimensions.
  • Any SCD (Slowly Changing Dimension) types can be applied on this.

Large dimension with frequent changes

  • If your dimension change frequency per month is more than 1000 rows out of ~100,000 rows, you can treat it as a frequently changing dimension.
  • Split dimensions by separating the stable attributes from the attributes that change frequently (i.e., the “hot” attributes), and then link the split dimensions through a fact table.
  • SCD Types must be chosen very carefully as it might need more storage space and will impact on data retrieval performance.
  • Table partitioning on dimensions are not advised; hence is important to plan SCD Type 2 or 4.

If disk space usage becomes an issue, it can be a good idea to un-pivot the changed data set and store it in a separate table with column storage index. This will improve retrieval performance. Thanks very much for reading until the end! Your comments and feedback are most appreciated and will help me to improve my blogging skills. Meet you all again in my next blog. Take it easy!