**Phases in a typical Data Mining effort:**

**1. Discovery**

Frame business problem

Identify analytics component

Formulate initial hypotheses

**2. Data Preparation**

Obtain dataset form internal and external sources

Data consistency checks in terms of definitions of fields, units of measurement, time periods etc.,

Sample

**3. Data Exploration and Conditioning**

Missing data handling, Range reason ability, Outliers,

Graphical or Visual Analysis

Transformation, Creation of new variables, and Normalization

Partitioning into Training, validation, and Test datasets

**4. Model Planning**

- Determine data mining task such as prediction, classification etc.

- Select appropriate data mining methods and techniques such as regression, neural networks, clustering etc.

**5. Model Building**

Building different candidate models using selected techniques and their variants using training data

Refine and select the final model using validation data

Evaluate the final model on test data

**6. Results Interpretation**

Model evaluation using key performance metrics

**7. Model Deployment**

Pilot project to integrate and run the model on operational systems

Similar data mining methodologies developed by SAS and IBM Modeler (SPSS Clementine) are called SEMAA and CRISP-DM respectively

Data mining techniques can be divided into Supervised Learning Methods and Unsupervised Learning Methods

**Supervised Learning**

- In supervised learning, algorithms are used to learn the function 'f' that can map input variables (X) into output variables (Y)

Y = f(X)

- Idea is to approximate 'f' such that new data on input variables (X) can predict the output variables (Y) with minimum possible error (ε)

Supervised Learning problem can be grouped into prediction and classification problems

**Unsupervised Learning**

- In Unsupervised Learning, algorithms are used to learn the underlying structure or patterns hidden in the data

Unsupervised Learning problems can be grouped into clustering and association rule learning problems

**Target Population**

- Subset of the population under study

- Results are generalized to the target population

**Sample**

- Subset of the target population

**Simple Random Sampling**

- A sampling method where in each observation has an equal chance of being selected.

**Random Sampling**

- A sampling method where in each observation does not necessarily have an equal chance of being selected

**Sampling with Replacement**

- Sample values are independent

**Sampling without Replacement**

- Sample values aren't independent

Sampling results in less no. of observation than the no. of total observation in the dataset

Data Mining algorithms

- Varying limitations on number of observation and variables

Limitation due to computing power and storage capacity

Limitations due to statistical being used

How many observation to build accurate models?

Rare Event, e.g., low response rate in advertising by traditional mail or email

- Oversampling of 'success' cases

- Arise mainly in classification tasks

- Costs of misclassification

- Costs of failing to identify 'success' cases are generally more than costs of detailed review of all cases

- Prediction of 'success is likely to come at cost of misclassifying more 'failure' cases as 'success' cases than usual