Knowledge discovery from data
data cleaning
remove noise and inconsistencies
data integration
combine data sources
data selection
retrieve relevant data form db
data transformation
aggregation, etc feature extraction
data mining
machine learning
pattern evaluation
identify truly interesting patterns
knowledge representation
visualise and transfer new knowledge
Arff file format
attribute-relation file format
WEKA
Most popular toolbox for data mining and machine learning
www.cs.waikato.ac.na/~ml/weka
HDF5
Toy data repositories
uci repository
http://archive.ics.uci.edu/ml/
currently has 308 datasets
kaggle
not a static repository of datasets, but a site that manages data mining competitions
example of the modern concept of crowdsourcing
Subtopic 3
DM types of data
Relational database
spatial and aptio-temporal databases
text multimedia databases
heterogeneous
data stream
DM functionalities
outlier analysis
...
patterns of interest
objective interest
rule of support
rule of confidence
Degree of certainty of a detected association
subjective intersest
subjective measrures require a human with domain knowledge to provide measures
Integration with DBS/Data Warehouses
Semi-tight copuling use sorting, indexing aggregation, histogram analysis
tight coupling
Dirty Data
Real world Data is dirty
incomplete
noisy
containing errors or outliers
inconsistent
containing discrepancies in codes or names
e.g. Age=42, Birthday="22/06/1989"
why is data dirty
imcomplete data not applicable data value when collected
inconsistent data
nosiy
data collection instruments faulty
errors in data transission
human hardware software problems
Importance of cleeaning
Data quality measures
Accuracy Completence
Major Prep tasks
Data cleaning
Data integration
Data transformation
Data reduction
Data discretisation
Noisy Data
bining
Cancelling noise by binning
sort data
Create local groups of data
replace original values by the bin mean
replace the original values by the min/max value / boundary value
Regression
Fit a parametric function to the data using minimisation of e.g. least squares error
Replace original values by the parametric function value
Clustering
replace original values by means of clusters
Data integration
entity identification problem
redundancy detection
correlation analysis
detection and resolution of data value conflicts
e.g. weight units, in/exlusion of taxes
data normalisation
min-max nomalisation
data reduction
Data cube aggregation
Attribute subset selection feature selction
Exact solution infeasible
greedy forward selection
backward elimination
forward-backward
decision tree induction
Dimensionality reduction (manifold projection)
Numerosity reduction
Reduces the number of instances rather than attributes
Parametrisation
Discretisation
Sampling
Discretisation
Instance reduction