Wednesday, February 29, 2012

CRM and data mining Day 02

Foreign keys must match corresponding PKs to maintain referential integrity.

Surrogate key is a unique value, usually an integer assigned to each row in the dimension. It becomes the PK.

An OLAP engine such as Analysis Services provides a better query language (than SQL) and better computational performance.

Entity Relationship modelling is a technique used to 'abstract' users' data requirements into a model that can be analyzed and ultimately implemented. The objectives of ER modelling are
  1. Avoid anomalies, and achieve processing and data storage efficiency by reducing data redundancy (storing data elements once).
  2. Provide flexibility and ease of maintenance.
  3. Protect the integrity of data by storing it once.
ER modelling and normalization simplify transaction processing as they make transactions as simple as possible (data is stored in one place only). However, normalized databases become very complex, making queries difficult and inefficient. A 'spider-web of joins' is required for many queries.

Every dimensional model is composed of one table with a multi-part key, called the fact table, and a set of smaller tables called dimension tables.

Each dimension table has a single-part primary key that corresponds exactly to one of the components of the multi-part key in the fact table.

A single ER diagram breaks down into multiple Dimensional Model diagrams, or 'stars'.

ER modelling (arguably) does not really model a business,; rather, it models the micro-relationships among data elements.

Facts are the 'verbs or actions' of the business. E.g. taking an order, displaying a web page, printing a book,  handling a customer support request. 

A value in a fact table is a measurement. E.g. quantity ordered, sale amount, call duration.

The level of detail in a fact table is called the grain. It is recommended that the grain be kept to lowest (or finest) level. E.g. one row per sale, one row per service call.

A fact always 'resolves' a many-to-many relationship between the parent (or dimension) tables.

The most useful facts in a fact table are numeric and additive. E.g. sales by month for the last year.

The best way to identify dimensions is to note down every time some one says the word 'by'. For example,
-Sales by manufacturing plant
-Deliveries by method of shipment


"Sales" will become a measure. Manufacturing plant becomes a dimension table.
"Deliveries" becomes a measure. Method of shipment becomes a dimension table.


All foreign keys in the fact table are not allowed NULL.
Measures columns are allowed NULL values.


References
  1. FIT5158 Monash University Lecture Notes, 2011
Glossary of terms
Dimensional analysis
Multidimensional conceptual view
Intuitive and high-performance retrieval of data.
High-performance access
Fact tables: contain the measurements associated with a specific business process.
Dimension tables
'Star-like' structure
Star join
Star schema
Dimensional model vs ER model
Multi-part primary key: is a primary key consisting of a set of different keys from dimension tables.
One-to-many relationship
Many-to-many relationship
Many-to-one relationship
Referential integrity
Foreign key (FK)
ETL: extract, transfer and load.
SCD: slowly changing dimensions
Dimensional modelling concepts
Conformed dimensions

No comments:

Post a Comment

Mounting USB drives in Windows Subsystem for Linux

Windows Subsystem for Linux can use (mount): SD card USB drives CD drives (CDFS) Network drives UNC paths Local storage / drives Drives form...