data mining

Knowledge discovery from data
1. data cleaning
  1. remove noise and inconsistencies
2. data integration
  1. combine data sources
3. data selection
  1. retrieve relevant data form db
4. data transformation
  1. aggregation, etc feature extraction
5. data mining
  1. machine learning
6. pattern evaluation
  1. identify truly interesting patterns
7. knowledge representation
  1. visualise and transfer new knowledge
Arff file format
1. attribute-relation file format
WEKA
1. Most popular toolbox for data mining and machine learning
2. www.cs.waikato.ac.na/~ml/weka
HDF5
Toy data repositories
1. uci repository
  1. http://archive.ics.uci.edu/ml/
  2. currently has 308 datasets
2. kaggle
  1. not a static repository of datasets, but a site that manages data mining competitions
  2. example of the modern concept of crowdsourcing
3. Subtopic 3
DM types of data
1. Relational database
2. spatial and aptio-temporal databases
3. text multimedia databases
4. heterogeneous
5. data stream
DM functionalities
1. outlier analysis
2. ...
patterns of interest
objective interest
1. rule of support
2. rule of confidence
  1. Degree of certainty of a detected association
subjective intersest
1. subjective measrures require a human with domain knowledge to provide measures
Integration with DBS/Data Warehouses
1. Semi-tight copuling use sorting, indexing aggregation, histogram analysis
2. tight coupling
Dirty Data
1. Real world Data is dirty
  1. incomplete
  2. noisy
    1. containing errors or outliers
  3. inconsistent
    1. containing discrepancies in codes or names
    2. e.g. Age=42, Birthday="22/06/1989"
2. why is data dirty
  1. imcomplete data not applicable data value when collected
  2. inconsistent data
  3. nosiy
    1. data collection instruments faulty
    2. errors in data transission
  4. human hardware software problems
3. Importance of cleeaning
4. Data quality measures
  1. Accuracy Completence
5. Major Prep tasks
  1. Data cleaning
  2. Data integration
  3. Data transformation
  4. Data reduction
  5. Data discretisation
6. Noisy Data
  1. bining
    1. Cancelling noise by binning
    2. sort data
    3. Create local groups of data
    4. replace original values by the bin mean
    5. replace the original values by the min/max value / boundary value
  2. Regression
    1. Fit a parametric function to the data using minimisation of e.g. least squares error
    2. Replace original values by the parametric function value
  3. Clustering
    1. replace original values by means of clusters
7. Data integration
  1. entity identification problem
  2. redundancy detection
    1. correlation analysis
  3. detection and resolution of data value conflicts
    1. e.g. weight units, in/exlusion of taxes
8. data normalisation
  1. min-max nomalisation
9. data reduction
  1. Data cube aggregation
  2. Attribute subset selection feature selction
    1. Exact solution infeasible
    2. greedy forward selection
    3. backward elimination
    4. forward-backward
    5. decision tree induction
  3. Dimensionality reduction (manifold projection)
  4. Numerosity reduction
    1. Reduces the number of instances rather than attributes
    2. Parametrisation
    3. Discretisation
    4. Sampling
  5. Discretisation
10. Instance reduction