Sunday, May 27, 2012

Multimedia Day 06


Ethics
Copyright
  1. Jury deals blow to Oracle v Google case: no infringement http://bit.ly/JYCzBz
Fair use
Infringement
Privacy
Digital rights management
Censorship
Intellectual property
Public domain means either that the work was never copyrighted in the first place or its copyright protection has expired over time and not been renewed; you can use public domain material without a license.

Saturday, May 19, 2012

Neural networks for time-series forecasting Day 2

Neural networks can be classified into dynamic and static categories.

Static (feedforward) networks have no feedback elements and contain no delays; the output is calculated directly from the input through feedforward connections.

In dynamic networks, the output depends not only on the current input to the network, but also on the current or previous inputs, outputs, or states of the network.

Dynamic networks can also be divided into two categories: those that have only feedforward connections, and those that have feedback, or recurrent, connections.

References
  1. Grace Rumantir, Monash University FIT5167 Lecture Notes, 2012.
  2. MATLAB 2011b Help Documentation

Thursday, May 17, 2012

CRM and data mining Day 10

For each algorithm, discuss what it is, its (perceived) strengths and (perceived) weaknesses.

a) (unsupervised and supervised) Classification learning
Classification learning problems refer to problems that are trying to classify unknown instances given a set of known instances.

b) Decision Trees (and Decision Graphs)
Decision Trees problems refer to problems that represent class labels in terms of a tree. For each branch of the tree there is a classification rule.

c) Naive Bayes
Naive Bayes problems refer to problems using probabilistic approach to solving classification problems.

d) Clustering
Clustering involves segmenting a data set in such a way that all members in a segment will have similar characteristics, and one segment will be different from another in terms of characteristics.

e) Segmentation
Segmentation is a method that is used to group individual instances/records in such a way that they have similar attributes.

f) k-means algorithm
k-means algorithm is used to divide a data set into many subsets of data. Each subset is often represented by a centroid. K indicates a number of subsets that are obtained.

g) Self organising maps
Self organising maps is an algorithm that group similar records according to some biological rules.

h) Memory based reasoning (and k nearest neighbour algorithm)
Memory-based reasoning and collaborative filtering (CF) are nearest neighbour approaches
Nearest neighbour techniques are based on the concept of similarity.
Memory-Based Reasoning (MBR) results are based on similar situations in the past.

Strengths
  1. Ability to adapt
  2. Good results without lengthy training
  3. Ability to use data "as is", meaning that the format of records won't create a problem. Be able to utilize both a distance function and a combination function between data records to help determine how similar they are.
Weaknesses
  1. Many samples are required.
  2. Deciding a 'good' distance metric
i) Market basket analysis

j) Association rules

k) Neural networks

l) Recommender systems
Systems that make recommendations

m) Collaborative filtering systems
Recommender systems based on ratings of people who have made similar ratings on items the user has rated.

n) Content-based filtering
Recommender systems based on properties of the items and of the user.

o) Supervised learning
Supervised learning is a computer learning approach where it is presented a dataset with class labels. The computer program then tries to derive a rule/pattern for each class label.

p) Unsupervised learning
Unsupervised learning is a computer learning approach where it is presented a dataset without class labels, and try to segment it into different labelled segments.

q) Lift
A rule interestingness measure
The value of confidence(L → R) measures the support for R if we only examine the transactions that match L. So purchasing the items in L makes it 0.864/0.125 = 6.91 times more likely that
the items in R are purchased.

Lift values greater than 1 are ‘interesting’. They indicate that transactions containing L tend to contain R more often than transactions that do not contain L.

Although lift is a useful measure of interestingness it is not always the best one to use. In some cases a rule with higher support and lower lift can be more interesting than one with lower support and higher lift because it applies to more cases.

r) Leverage
A rule interestingness measure

Another measure of interestingness that is sometimes used is leverage. It measures the difference between the support for L ∪ R (i.e. the items in L and R occurring together in the database) and the support that would be expected if L and R were independent.

The former is just support(L ∪R). The frequencies (i.e. supports) of L and R are support(L) and support(R), respectively. If L and R were independent the expected frequency of both occurring in the same transaction would be the product of support(L) and support(R).

This gives a formula for leverage:

leverage(L → R) = support(L ∪ R) − support(L) × support(R).

The value of the leverage of a rule is clearly always less than its support.

s) EM clustering algorithm
Expectation Maximisation (EM) clustering algorithm employs ...

t) What are the two main definitions of a Data Warehouse according to W. H. Inmon and Ralph Kimball?

According to W.H. Inmon "Data Warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process"

According to Ralph Kimball "Data Warehouse is a copy of transaction data, specifically structured for query and analysis."

u) Describe the following concepts related to data warehouse design:

(a) granularity
(b) data partitioning
(c) subject orientation

v) Define Dimension tables and Fact tables in a dimensional data warehouse model.
Dimension tables in a dimensional data warehouse model refer to

Fact tables in a dimensional data warehouse model consists of measures.

w) How is a dimensional model different from the Enterprise data warehouse model suggested by Inmon?

References
  1. FIT5158 Monash University Lecture Notes, 2012

Tuesday, May 15, 2012

CRM and data mining Day 09

For cluster analysis, the question is how to evaluate the “goodness” of the resulting clusters?

But “clusters are in the eye of the beholder”!

Then why do we want to evaluate them?
  • To compare clustering algorithms
  • To avoid finding patterns in noise
  • To compare two sets of clusters
  • To compare two clusters
Different Aspects of Cluster Validation
  1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data.
  2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.
  3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.
    1. Use only the data
  4. Comparing the results of two different sets of cluster analyses to determine which is better.
  5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.

Measures of Cluster Validity

Internal Measures: Cohesion and Separation
  1. Cluster Cohesion: measures how closely related are objects in a cluster.
    • Example: SSE
  2. Cluster Separation: measure how distinct or well-separated a cluster is from other clusters.
    1. Separation is measured by the between cluster sum of square
  3. Silhouette Coefficient combines ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings.

For an individual point, i

Calculate a = average distance of i to the points in its cluster

Calculate b = min (average distance of i to points in another cluster)

The silhouette coefficient for a point is then given by 
s = 1 – a/b   if a < b,   
s = b/a - 1    if a > b, not the usual case 

Silhouette coefficient is typically between 0 and 1. The closer to 1 the better.


Can calculate the Average Silhouette width for a cluster or a clustering.

Thursday, May 10, 2012

Operating Systems Day 06

Problems that arise when many processes compete for relatively few resources and the system is unable to service all of the processes. In particular, there are two extreme conditions: deadlock and starvation which result from a lack of process synchronization.

Problems may arise if the system is unable to service all of the processes in a synchronous manner.

Deadlock is more serious than starvation because it affects more than one job. There are seven cases of deadlock.
  1. Case 1: Deadlocks on File Requests. Deadlock from multiple process file requests occurs because of jobs being allowed to hold files for the duration of their executions.
  2. Case 2: Deadlocks in Databases. Deadlock may occur if two processes access and lock records in a database. The concept of locking within a database whereby one user locks out all other users while working with the database.
  3. Case 3: Deadlocks in Dedicated Device Allocation. Deadlock may occur if there are a limited number of dedicated devices and many more processes requesting them.
  4. Case 4: Deadlocks in Multiple Device Allocation. Deadlock may occur with multiple device allocations utilizing different device types when several processes request and hold dedicated devices while other processes act in a similar manner.
  5. Case 5: Deadlocks in Spooling. Deadlock may occur if a sharable device, such as a spooling printer that requires all job output before initiating printing, encounters a full spool disk area with only partially completed output residing on it.
  6. Case 6: Deadlocks in a Network. Deadlock in a network resulting from the absence of network protocols to control message flow.
  7. Case 7: Deadlocks in Disk Sharing. 
Deadlock occurs due to over-commitment of resources. Four conditions that need to be simultaneously present for deadlock to occur: mutual exclusion, resource holding, no preemption, and circular wait. Each of these four individual conditions is necessary for the operating system to work smoothly, and none of them can be removed easily without causing the system’s overall functioning to suffer. Therefore, the system needs to recognize the combination of conditions before they occur and threaten to cause the system to lock up.

Starvation occurs when allocation of resources is restricted.

When a starvation occurs, a job never completes because a requested resource is not allocated.

Causes of Deadlock
  1. Deadlock occurs due to over-commitment of resources
There are four methods for dealing with deadlocks: prevention, avoidance, detection, recovery.

Directed graphs are used to model the four simultaneous conditions necessary for deadlock.

In directed graphs, a solid arrow from a resource to process represents process is holding the resource.

Avoidance: the dynamic strategy of deadlock avoidance that attempts to ensure that resources are never allocated in such a way as to place a system in an unsafe state.

Circular wait: one of four conditions for deadlock through which each process involved is waiting for a resource being held by another; each process is blocked and can’t continue, resulting in deadlock.

Mutual exclusion: one of four conditions for deadlock in which only one process is allowed to have access to a resource.

Process synchronization: (1) the need for algorithms to resolve conflicts between processors in a multiprocessing environment; or (2) the need to ensure that events occur in the proper order even if they are carried out by several processes.

Race: a synchronization problem between two processes vying for the same resource.

Starvation: the result of conservative allocation of resources in which a single job is prevented from execution because it’s kept waiting for resources that never become available.

The following figure shows a deadlock event using a directed graph.


Use Directed graphs
  • Circles represent processes
  • Squares represent resources
  • Solid arrow from resource to process
    • Process holding resource
  • Solid arrow from a process to resource
    • Process waiting for resource
  • Arrow direction indicates flow
  • Dashed arrow indicates Process requests Resource
Cycle in graph indicates deadlock condition

Point out that deadlock results from the liberal allocation of resources, while starvation results from conservative allocation of resources.

Strategies for handling deadlocks


  1. Prevention: prevent one of four neccessary deadlock conditions (i.e. no preemption, circular wait, mutual exclusion, resource holding) from occuring
  2. Avoidance
  3. Detection

References 

  1. McHoes, A., & Flynn, I.M. (2008). Understanding operating systems (5th ed.). CENGAGE Learning.  
  2. Melbourne Institute of Technology (Semester 1, 2012). BN104 Operating Systems Lecture Notes.

Multimedia Day 05

Animation is possible because of persistence of vision.
Keyframes show the start and end of some action
The process of filling in the action is called tweening.
Morphing is the process of transitioning from one image to another
Kinematics is the study of movement of objects that have joints, such as people.

Monday, May 7, 2012

Neural networks for time-series forecasting Day 1

Time series data can be decomposed into four possible components:
  1. the Trend component reflecting the long term progression of the series.
  2. the Cyclical component that describes repeated but non-periodic fluctuations, possibly caused by the economic cycle.
  3. the Seasonal component reflecting seasonality (Seasonal variation)
  4. the Irregular component (or "noise") that describes random, irregular influences. Compared to the other components it represents the residuals of the time series.

Time series forecasting is the use of a model to forecast future events based on known past events. Meaning that, it is used to predict data points before they are measured.

Autocorrelation is the correlation of the value of a data item at a particular time with the values of previous data items.

The number of time steps back into the past that have data items with high autocorrelation to the future data item is called time lags.

Given the water demand data of the previous 12 months as input, a multi-layer perceptron model can be built to forecast the value of the random component of the water demand in the following month.

Spatio-temporal model not only incorporates time lags of the variable whose future values are of interest, but also other influential variables, and possibly their own lags. For example, for the water demand time series data, rainfall is seen as a lag indicator of water demand.

The output of the MLP is added to the extrapolations of the trend and seasonal components to get the forecast of water demand for the following month.

Perceptron for forecasting of linear time-series
  1. Linear Auto-Regressive with  Exogenous (external)  input variables networks (ARx)
Neural networks for non-linear time-series forecasting
  1. Time-lagged feedforward networks (sometimes are also called focused time-lagged networks)
  2. Dynamically-driven recurrent (feedback) networks
A recurrent network feeds back the predicted output through feedback connection into the input layer with one unit time delay.

A Non-linear Auto-Regressive with Exogenous input variables (NARx) network model is a recurrent network that, on top of the variable being predicted, uses other external/exogenous variables as inputs.

MLP for forecasting Multi-step-ahead forecasting
  1. A Non-linear Auto-Regressive with Exogenous (external)  input variables networks (NARx).
The nonlinear autoregressive network with exogenous inputs (NARX) is a recurrent dynamic network, with feedback connections enclosing several layers of the network.

Single step vs. Multi-step-ahead forecasting

Time-lagged and recurrent networks

Performance of ARx and NARx networks

In Spatio-Temporal Time-Lagged networks, the response of the time-series is affected by one or more time-series.

Dynamic networks can be divided into two categories: those that have only feedforward connections, and those that have feedback, or recurrent connections.

Recurrent-dynamic networks typically have a longer response than feedforward-dynamic networks.

For linear networks, feedforward-dynamic networks are called finite impulse response (FIR), because the response to an impulse input will become zero after a finite amount of time.

Linear recurrent-dynamic networks are called infinite impulse response (IIR), because the response to an impulse can decay to zero (for a stable network), but it will never become exactly equal to zero.

An impulse response for a nonlinear network cannot be defined, but the ideas of finite and infinite responses do carry over.

Dynamic networks are generally more powerful than static networks (although somewhat more difficult to train). Because dynamic networks have memory, they can be trained to learn sequential or time-varying patterns.

References
  1. Grace Rumantir (2012). FIT5167 Lecture Notes. Monash University
  2. MATLAB 2011b Help Documentation

Mounting USB drives in Windows Subsystem for Linux

Windows Subsystem for Linux can use (mount): SD card USB drives CD drives (CDFS) Network drives UNC paths Local storage / drives Drives form...