For each algorithm, discuss what it is, its (perceived) strengths and (perceived) weaknesses.
a) (unsupervised and supervised) Classification learning
Classification learning problems refer to problems that are trying to classify unknown instances given a set of known instances.
b) Decision Trees (and Decision Graphs)
Decision Trees problems refer to problems that represent class labels in terms of a tree. For each branch of the tree there is a classification rule.
c) Naive Bayes
Naive Bayes problems refer to problems using probabilistic approach to solving classification problems.
d) Clustering
Clustering involves segmenting a data set in such a way that all members in a segment will have similar characteristics, and one segment will be different from another in terms of characteristics.
e) Segmentation
Segmentation is a method that is used to group individual instances/records in such a way that they have similar attributes.
f) k-means algorithm
k-means algorithm is used to divide a data set into many subsets of data. Each subset is often represented by a centroid. K indicates a number of subsets that are obtained.
g) Self organising maps
Self organising maps is an algorithm that group similar records according to some biological rules.
h) Memory based reasoning (and k nearest neighbour algorithm)
Memory-based reasoning and collaborative filtering (CF) are nearest neighbour approaches
Nearest neighbour techniques are based on the concept of similarity.
Memory-Based Reasoning (MBR) results are based on similar situations in the past.
Strengths
- Ability to adapt
- Good results without lengthy training
- Ability to use data "as is", meaning that the format of records won't create a problem. Be able to utilize both a distance function and a combination function between data records to help determine how similar they are.
Weaknesses
- Many samples are required.
- Deciding a 'good' distance metric
i) Market basket analysis
j) Association rules
k) Neural networks
l) Recommender systems
Systems that make recommendations
m) Collaborative filtering systems
Recommender systems based on ratings of people who have made similar ratings on items the user has rated.
n) Content-based filtering
Recommender systems based on properties of the items and of the user.
o) Supervised learning
Supervised learning is a computer learning approach where it is presented a dataset with class labels. The computer program then tries to derive a rule/pattern for each class label.
p) Unsupervised learning
Unsupervised learning is a computer learning approach where it is presented a dataset without class labels, and try to segment it into different labelled segments.
q) Lift
A rule interestingness measure
The value of confidence(L → R) measures the support for R if we only examine the transactions that match L. So purchasing the items in L makes it 0.864/0.125 = 6.91 times more likely that
the items in R are purchased.
Lift values greater than 1 are ‘interesting’. They indicate that transactions containing L tend to contain R more often than transactions that do not contain L.
Although lift is a useful measure of interestingness it is not always the best one to use. In some cases a rule with higher support and lower lift can be more interesting than one with lower support and higher lift because it applies to more cases.
r) Leverage
A rule interestingness measure
Another measure of interestingness that is sometimes used is leverage.
It measures the difference between the support for L ∪ R (i.e. the items in L and R occurring together in the database) and the support that would be expected if L and R were independent.
The former is just support(L ∪R). The frequencies (i.e. supports) of L and R are support(L) and support(R), respectively.
If L and R were independent the expected frequency of both occurring in the same transaction would be the product of support(L) and support(R).
This gives a formula for leverage:
leverage(L → R) = support(L ∪ R) − support(L) × support(R).
The value of the leverage of a rule is clearly always less than its support.
s) EM clustering algorithm
Expectation Maximisation (EM) clustering algorithm employs ...
t) What are the two main definitions of a Data Warehouse according to W. H. Inmon and Ralph Kimball?
According to W.H. Inmon "Data Warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process"
According to Ralph Kimball "Data Warehouse is a copy of transaction data, specifically structured for query and analysis."
u) Describe the following concepts related to data warehouse design:
(a) granularity
(b) data partitioning
(c) subject orientation
v) Define Dimension tables and Fact tables in a dimensional data warehouse model.
Dimension tables in a dimensional data warehouse model refer to
Fact tables in a dimensional data warehouse model consists of measures.
w) How is a dimensional model different from the Enterprise data warehouse model suggested by Inmon?
References
- FIT5158 Monash University Lecture Notes, 2012