Thursday, March 15, 2012

CRM and data mining Day 05

Database normalization process.


An invoice design example.




The invoice is represented in the tabular form.


The invoice without multivalued attributes.


First normal form: eliminating repeating data. A relation is said to be in first normal form when it contains no multivalued attributes. That is, every intersection of a row and column in the relation must contain at most one data value (saying “at most” allows for missing or null values).

Sometimes, we will find a group of attributes that repeat together, as with the line items on the invoice.

To transform unnormalized relations into first normal form, we must move multivalued attributes and repeating groups to new relations.

Because a repeating group is a set of attributes that repeat together, all attributes in a repeating group should be moved to the same new relation.

However, a multivalued attribute (individual attributes that have multiple values) should be moved to its own new relation rather than combined with other multivalued attributes in the new relation.

Second normal form: eliminating partial dependencies i.e. separating data so that any editing of a piece of data only has to be done once

Third normal form: eliminating transitive dependencies i.e. separating data so that when records are deleted other data is not disturbed

Characteristics of high quality data
  1. Correct - accurate
  2. Verifiable - checkable through other existing information
  3. Complete -  data contains all relevant information
  4. Concise - data contains only relevant information
  5. Understandable - most users can readily comprehend what the data means
  6. Current - data is up to date and useful at the current time
  7. Accessible - data available when and where it is required
Problems of poor data quality
  1. Incomplete: not enough information to do a job
  2. out-of-date: does not indicate current state of affairs
  3. not understandable: cannot understand what the data is describing
  4. not where it is needed: cannot use it because it is too far away
  5. not when it is needed: cannot get it until after it is needed

Finish page 202.

References
  1. Andy Oppel (2011), Database Demystified, 2nd Ed, McGraw-Hill.
  2. Melbourne Institute of Technology (Semester 2, 2012). BN105 IT for Users in Organisations Lecture Notes.

No comments:

Post a Comment

Mounting USB drives in Windows Subsystem for Linux

Windows Subsystem for Linux can use (mount): SD card USB drives CD drives (CDFS) Network drives UNC paths Local storage / drives Drives form...