Sunday 5 August 2018

Analysis of Variance (ANOVA) and F-test in R Language

Irawen August 05, 2018 R No comments

Analysis of Variance (ANOVA)

A statistical method for making simultaneous comparisons between two or more means; a statistical method that yields values that can be tested to determine whether a significant relation exists between variables.

Examples :-

A car company wishes to compare the average petrol consumption of THREE similar models of car and has available six vehicles of each model.

A teacher is interested in a comparison of average percentage marks attained in the examinations of FIVE different subjects and has available the marks of eight students who all completed each examination.

What ANOVA looks at is the way groups differ internally versus what the difference is between them. To take the above example:
ANOVA calculates the mean for each group.
It calculates the mean for all the group combined - the Overall Mean.
Then it calculates, within each group, the total deviation of each individual's score from the Group Mean - Within Group Variation.
Next, it calculates the deviation of each Group Mean from the Overall Mean - Between Group Variation.
Finally, ANOVA produces the F statistic which is the ratio - Between Group Variation to the Within Group Variation.
If the Between Group Variation is significantly greater than the Within Group Variation, then it is likely that there is a statistically significant difference between the groups.

One way Analysis of Variance

setwd("C:\\PERSONAL\\Irawen_Business_Analytics_With_R\\WIP\\Class-8 data-1")
data.ex1<-read.table("Class-8 data.txt",sep=',',head=T)
aov.ex1 = aov(Alertness~Dosage,data=data.ex1)
summary(aov.ex1)

Two way Analysis of Variance

Data are from an experiment in which alertness level of male and female subjects was measured after they had been given one of two possible dosages of a drug. Thus, this is a 2x2 design with the factors being Gender and Dosage

Used of Analysis of Variance

→ ANOVA is a particular form of statistical hypothesis testing heavily used in the analysis of experimental data.

→ A statistical hypothesis test is a method of making decisions using data. A test result (calculated from the null hypothesis and the sample) is called statistically significant if it is deemed unlikely to have occurred by chance, assuming the truth of the null hypothesis. A statistically significant result, when a probability (p-value) is less than a threshold (significance level), justifies the rejection of the null hypothesis, but only if the a priori probability of the null hypothesis is not high.

→In the typical application of ANOVA, the null hypothesis is that all groups are simply random samples of the same population. This implies that all treatments have the same effect (perhaps). Rejecting the null hypothesis implies that different treatments results in altered effects.

More on ANOVA

ANOVA is the synthesis of several ideas and it is use for multiple purposes. As a consequence, it is difficult to define concisely or precisely.

Classical ANOVA for balanced data does three things at once:
- As exploratory data analysis, an ANOVA is an organization of an additive data decomposition, and its sums of squares indicate the variance of each component of the decomposition (or, equivalently, each set of terms of a linear model).
- Comparisons of mean squares, along with F-tests ... allow testing of a nested sequence of models.
- Closely related to the ANOVA is a linear model fit with coefficient estimates and standard errors. In short, ANOVA is a statistical tool used in several ways to develop and confirm an explanation for the observed data.

Additionally:

- It is computationally elegant and relatively robust against violations of its assumptions.
- ANOVA provides industrial strength (multiple sample comparison) statistical analysis.
- It has been adapted to the analysis os a variety of exprimental designs.

F-test

An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis.

It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled.

Exact "F-test" mainly arise when the models have been fitted to the data using least squares.

The name was coined by George W. Snedecor, in honour of Sir Ronald A.Fisher.

Fisher initially developed the statistic as the variance ratio in the 1920s.

The formula for the one-way ANOVA F-test statistic is:

Decision Making in R Language

Irawen August 04, 2018 R No comments

Decision Making

Decision Making structure, evaluate a condition and depending on the result some other code is processed.

There are three types of decision making statements in R

If Statement
If-Else Statement
Switch Statement

If Statement

An if statement consists of a Boolean expression followed by one or more statement.

Syntax of is Statement :-

if(Boolean expression)
{
// Code to be executed
}

If-Else Statements

An if statement can be followed by an optional else statement, which executes when the Boolean expression is false.

The syntax of if else is :

if (Boolean_expression)
{
// Code to be executed
}
else
{
// Code to be executed
}

Example :-

Switch Statement

A switch statement allows a variable to be tested for equality against a list of values.

R has following Switch Statement :

switch (expression, case1, case2..)
→ We can have any number of cases in switch.
→ No Default is available in switch.

Operators use in R Language

Irawen August 03, 2018 R No comments

Operators

Operators are special symbols or phrases that programmers use to check, combine or change values.

E.g., '+' is a operator, used to add two values, like 2 +7

There are two types of operators:

Unary Operators : Operators on one operand, E.g. -4 , 1x
Binary Operators : Operators on two operands. E.g 5+7

Operators are further divided into

Arithmetic Operators
Relational Operators
Logical Operators

Arithmetic Operators

It is used to perform arithmetic and mathematical operations.

Examples :-

Relational Operators

Relational operators is used to compare two values or see the relation between values.

Examples :-

Logical Operators

It is used to perform logical operation on values.

Examples :-

Variables and Data types in R Language

Irawen August 03, 2018 R No comments

R- Variables

A Named storage that can be used by our program.

A Variable can be assigned value in three ways:

var_name <- Value # leftward operator
var_name = value # assignment operator
var_value -> Variables_name # rightward operator

To see the data type of a variable we use class( ) function.

To see the list of variables we use ls( ) function.

To delete a variable we use rm( ) function.

R- Data types

When we create some variable in a memory, then it is stored according to the value inside that variable.

Based on the type of value the memory is allocated.

Following are the basic Data Types:

Vectors : a combination of values.
List : Can contain many different type of objects.
Matrices : A two dimensional Data Set.
Arrays : Multi-dimensional Data Set.
Factors : Factors store Vectors along with labels.
Data Frames : Tabular Data objects, can have multiple types.

Data Management : Sorting and Ordering in R Language

Irawen August 03, 2018 R No comments

Sorting

Sort function sorts the values of a vector in ascending order (by default) or descending order.

Syntax

sort (x, decreasing = FALSE , ....,)
sort (x, decreasing = FALSE , na . last = NA, ...)

x                       Vector of values to be sorted

decreasing        Should the sort be increasing or decreasing

na.last               for controlling the treatment of NAs.
                          If TRUE, missing values in the data are put last;
                          If FALSE, they are put first;
                          If NA, they are removed.

Example

> y <- c(8,5,7,6)
> y
[1] 8 5 7 6

> sort (y)
[1] 5 6 7 8

> sort (y , decreasing = TRUE)
[1]   8 7 6 5

Ordering

Order function sorts a variable according to the order of variable.

Syntax

order (x , decreasing = FALSE, ...,)
order (x, decreasing = FALSE, na.last = TRUE, ...)

x                         Vector of values to be sorted

decreasing          Should the sort be increasing or decreasing

na.last                 for controlling the treatment of NAs.
   If TRUE, missing values in the data are put last;
   If FALSE, they are put first;
   If NA, they are removed.

Example

> y <-    c(8,5,7,6)
> y
[1] 8 5 7 6

> order (y)
[1]   2 4 3 1

> order (y, decreasing = TRUE)
[1] 1 3 4 2

Data Management : Sequences in R Language

Irawen August 01, 2018 R No comments

Sequences

The regular sequences can be generated in R.

Syntax :-

seq ( )

seq (from = 1, to = 1, by = ( ( to - from) / (length.out - 1) ) , length.out = NULL, along.with = NULL, ....)

Examples:

> seq (10)
[1] 1 2 3 4 5 6 7 8 9 10

is the same as
> seq (1 : 10)

Assignment of an index-vector

> x <- c (9,8,7,6)
> ind <- seq (along = x)
> ind
[1] 1 2 3 4

Accessing a value in the vector through index vector
→ Accessing an element of an index-vector
> x [ind [2] ]
   [1] 8

Generating sequence of dates

Generating current time and date

Sys.time ( ) command provides the current time and date from the computer system.
> Sys.time ( )
[1] "2017-01-01 09:17:01 IST"

Sys.Date ( ) command provides the current date from the computer system.
> Sys.Date ( )
    [1] "2017-01-01"

Usage
seq (from, to, by, length.out = NULL, along.with = NULL, ...)

Arguments
from             starting date (Required)
to                  end date (Optional)
by                 Increment of the sequence. "day" , "week" , "month" , "quarter" or "year".
length.out     Integer, optional. Desired length of the sequence.
along.with    take the length from the length of this argument.

Sequence of first day of years

> seq (as.Date ("2010-01-01") , as.Date ("2017-01-01"), by = "Years")
[1] "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01"
[5] "2014-01-01" "2015-01-01" "2016-01-01" "2017-01-01"

Sequence of days

> seq (as.Date ("2017-01-01") , by = "days", length = 6)

Sequence of months

> seq (as.Date ("2017-01-01") , by = "days", length = 6)

Sequence of years

> seq (as.Date ("2017-01-01") , by = "years", length = 6)

To find sequence with defining start and end dates

R for Everyone: Advanced Analytics and Graphics

Irawen July 31, 2018 Books, R No comments

Using the open source R language, you can build powerful statistical models to answer many of your most challenging questions. R has traditionally been difficult for non-statisticians to learn, and most R books assume far too much knowledge to be of help. R for Everyone, Second Edition, is the solution.

Buy Book: R for Everyone: Advanced Analytics and Graphics

Download PDF : R for Everyone: Advanced Analytics and Graphics

About the Author -

Jared P. Lander is the Chief Data Scientist of Lander Analytics, a New York-based data science firm that specializes in statistical consulting and training services, the organizer of the New York Open Statistical Programming Meetup—the world’s largest R meetup—and the New York R Conference and an adjunct professor of statistics at Columbia University. With an M.A. from Columbia University in statistics and a B.S. from Muhlenberg College in mathematics, he has experience in both academic research and industry. Very active in the data community, Jared is a frequent speaker at conferences, universities and meetups around the world. His writings on statistics can be found and been featured in publications such as Forbes and the Wall Street Journal.

Basics Calculations in R Language

Irawen July 31, 2018 R No comments

- > is the prompt sign in R

- The assignment operators are the left arrow with dash <- and equal sign =.

    > x <- 20 assigns the value 20 to x.
    > x = 20 assigns the value 20 to x.
    Initially only <- was available in R.

- > x = 20 assigns the value 20 to x.
   > y = x + 2 assigns the value 2 * x to y.
   > z = x + y assigns the value x + y to z.

# : The character # marks the beginning of a comment. All characters until the end of the line are ignored.
> # mu is the mean
> # x <- 20 is treated as comment only.

Capital and small letters are different.
> x <- 20 and > x <- 20 are different

The command c (1,2,3,4,5) combines the numbers 1,2,3,4 and 5 to a vector.

R as a calculator

> 2 + 3              # Command
[1] 5                  # Output

> 2 * 3              # Command
[1] 6                 # Output

Multiplication and Division x * y , x/y

> c ( 2,3,5,7) * 3
[1]   6 9 15 21

Addition and Subtraction x + y , x - y

> (2,3,5,7) + 10
[1] 12 13 15 17

Business Analytics with R Language

Irawen July 30, 2018 R No comments

Business Analytics

Definition
"Study of business data using statistical technique and programming for creating decision support and insights for achieving business goals"

Why uses it ? How?
- Across Domain
* Dashboard
* Models
- Across A Company

Who creates it ? How?
- Skills Needed Business Perception

Business Intelligence
→ Business Intelligence is a set of theories, methodologies, processes, architecture, and technologies that transform raw data into meaningful and useful information for business purposes.

What is Data Science ?
→ Science of Studying Data:
Programming + Statistics + Business

PHP: The Complete Reference

Irawen July 28, 2018 Books, PHP No comments

PHP is a server-side programming language mainly used for web development and is also used as a general purpose programming language. It has become a rage in the Internet world. PHP: The Complete Reference, as the name suggests is a complete reference guide to the widely popular PHP.

This book deals with explaining how to personalize the PHP work space, define operators and variables, manipulate strings and arrays and the way in which one can use HTML. It also covers details on how to access database information, track client-side preferences using cookies, execute FTP and e-mail transactions and publish your applications to the Web. Additionally, this book deals in PHP's next generation Web 2.0 design features including AJAX, XML and RSS.

One can also learn to use PHP's object-oriented tools to build blogs, guest books and feedback pages with server-side file storage. PHP: The Complete Reference is a step by step guide to mastering PHP. Starting from the basic to the most advanced level, this book covers each aspect in great detail. This book was published by McGraw-Hill Education on 30 November 2007 and is available in paperback.

Key Features

Detailed coverage of PHP's next-generation Web 2.0 design features, including AJAX, XML and RSS is included.

Buy : PHP: The Complete Reference

Your One-Stop Guide to Web Development with PHP--Covers PHP 5.2

Build dynamic, cross-browser Web applications with PHP--the server-side programming language that's taken the Internet by storm. Through detailed explanations and downloadable code examples, this comprehensive guide shows you, step-by-step, how to configure PHP, create PHP-enabled Web pages, and put every advanced development tool to work.

PHP: The Complete Reference explains how to personalize the PHP work space, define operators and variables, manipulate strings and arrays, deploy HTML forms and buttons, and process user input. You'll learn how to access database information, track client-side preferences using cookies, execute FTP and e-mail transactions, and publish your applications to the Web. You'll also get in-depth coverage of PHP's next-generation Web 2.0 design features, including AJAX, XML, and RSS.

Install PHP and set up a customized development environment
Work with variables, operators, loops, strings, arrays, and functions
Integrate HTML controls, text fields, forms, radio buttons, and checkboxes
Accept and validate user-entered data from Web pages
Simplify programming using PHP's object-oriented tools
Build blogs, guest books, and feedback pages with server-side file storage
Write MySQL scripts that retrieve, modify, and update database information
Set cookies, perform FTP transactions, and send e-mails from PHP sessions
Build AJAX-enabled Web pages
Draw graphics on the server
Create XML components and add RSS feeds

About the Author

About the Author: Steven Holzner is an author of over 100 technology books, many of which are bestsellers. His works mainly pertain to online applications and components of Ajax including JavaScript, XML, browser objects and Web services. Steven also teaches programming classes at Fortune 500 companies and has also been a faculty at the Cornell University and MIT. His well known works include books like Ajax For Dummies and the Ajax Visual Blueprint. Steven has also worked as a contributing editor for PC magazine.

Tasks in Data Mining for R Language

Irawen July 26, 2018 R No comments

Anomaly Detection
→ Identification of unusual patterns, outliers, which help us in understanding the variation in data.
Example:-

Association Rule Mining
→ Also referred to as market basket analysis, this method is used for discovering interesting "association" patterns among the variables.
Example :- The beer-diaper syndrome

Clustering
→ Identifying groups/classes in data which are similar to each other.
The similarity inside the "cluster" is high and between the "clusters" is low.

Classification
→ Classification is the process of identifying to which category does an observation belong.
Example:-

Regression
→ With the help of regression, we can identify the extent of relationship among variables.
Understanding how the "dependent" variable varies with respect to the variation in "independent" variable.

Who uses R?

1. FACEBOOK :- For behavior analysis related status updates and profile pictures.
2. GOOGLE :- For advertising effectiveness and economic forecasting.
3. TWITTER :- For data visualization and semantic clustering.

Tasks to be performed

Data Importing :- Import the "Houses for sale" dataset.
Data Pre-processing :- Understand the structure of data and find correlation between different data entities.
Data Mining :- Use Linear Regression to predict the rates of houses.
Pattern Evaluation :- Evaluate which model fits better for the dataset.

Data Mining using R Language

Irawen July 26, 2018 R No comments

Why Data Mining ?

- I have this financial data with me, I need to find out if any of the transactions are fraudulent.
- I have this email data with me, I have need to check how many of the mails are spam.
- I have this telecom data with me, I need to find out how many of the customers will churn out.

Data Mining to the rescue!
How do I obtain knowledge from this data?
→ Hey, you can use data mining technique to find interesting insights from the data.

What is Data Mining?
→ Data Mining is the computing process of discovering patterns in large datasets involving methods at the intersection of machine learning, statistic, and database systems.

How should the Mined Information be?

New :- The extracted information should give us new patterns, relationships among the data entities.

Correct :- As everything that glitters is not gold, similarly, all the mined information might not be correct/valid. The mined information needs to be evaluated for it's correctness before we use it for any other purpose.

Potentially useful :- As we extract useful products such as petrol, diesel etc. from crude oil, similarly, the mined information from raw data should be useful and relevant to us.

Knowledge Discovery in database

Tasks in KDD

1. Data Selection :- a) Data from

b) Data Warehouse

c) Target Data

2. Data Pre-Processing :-

a) The selected data must be appropriate for mining tasks

b) Simple operations such as summarizing, aggregation, normalization can be done to transform/consolidate the data such that it is suitable for mining.

3. Data Mining :-

a) This is the most important step in KDD process

b) Intelligent operations such as clustering, classification, regression, and applied in order to extract patterns.

4. Pattern Evaluation :-

Once the data mining technique have been applied, the obtained results need to be evaluated for their accuracy.

5. Knowledge Representation :-

The identified patterns must be represented using simple, anesthetic graphs.

Data Visualization in R Language

Irawen July 25, 2018 R No comments

Data visualization helps the organizations unleash the power of their most valuable assets:
- Their data and
- Their people

1. Pie Chart :-
Pie Charts are the best to use when you are trying to compare parts of whole.

2. Bar Chart :-
Bar graphs are used to compare things between different group or to track changes over time.

3. Boxplot :-
Boxplot are used summarize data from multiple source and display the results in a single graph.

4. Histogram :-
Histogram are used to plot the frequency of score occurrences in a continuous data set that has been divided into classes, called bins.

5. Line Graph :-
Line graph are used to track changes over short and long periods of time.

6. Scatter Plot :-
Scatter plot show how much one variable is affected by another.

Fundamental Concepts of R Language

Irawen July 24, 2018 R No comments

Variables in R

A variables are nothing but reserved memory locations to store values. This means that when you create a variable you reserve some space in memory.

Data Operators

1. Arithmetic Operators
2. Assignment Operators
3. Relational Operators
4. Logical Operators
5. Special Operators

1. Arithmetic Operator

    (" + ") → Add two operands or unary plus.
                             >> 2+3
                              5
                              >>+2
    (" - ") → Subtract two operands or unary subtract.
                             >> 3-1
                              2
                              >>-2
(" * ") → Multiply two operands
     >> 2*3
                              6
    (" / ") → Divide left operand with the right and results is in float.
   >> 6/3
                              2.0
(" ^ ") → Left operand raised to the power of right
   >> 2^3
                             8
    (" %% ") → Remainder of the division of left operand by the right
   >>5%%2
                               1
   (" %/% ") →Division that results into whole number adjusted to the left in the number line.
>> 7%/%3
                              2

2. Assignment Operators

   (" = ") → x = <right operand>
>>x=5
                  >>x
                   5
   (" <- ") → x <- <right operand>
>>5<-15
                     >> x
                      15
   (" <<- ") → x<<- <right operand>
   >> x<<-2
                        >> x
                         2
   (" -> ") → <left operand> -> x
>> 25 -> x
                      >> x
                       25

3. Relational Operators

(" > ") → True if left operand is greater than the right
>> 2>3
False
   (" < ") → True if left operand is less than the right
   >> 2>3
True
   (" == ") → True if left operand is equal to right
   >> 2==2
   True
        (" != ") → True if left operand is not equal to the right
                                 >> x >>=2
                                  >> print(x)
                                   1
    (" >= ") → True if left operand is greater than or equal to the right operand
                                  >> 2 >=3
                                    False
        (" =< ") → True if left operand is less than or equal to the right operand
                                  >> 2 =<3
                                    True

4. Logical Operators

         (" & ") → Returns x if x is False , y otherwise
>> 2 &3
3
   (" | ") → Returns y if x is False, x otherwise
   >> 2|3
2
   (" ! ") → Returns True if x is True, False otherwise
   >> !1
False

5. Special Operators

   (" : ") → It creates the series of numbers in sequence for a vector
>> x <- 2:8
   >> x
   [1] 2 3 4 5 6 7 8
   (" %in% ") → This operator is used to identify if an element belongs to a vector
>> x <-2:8
>> y <- 5
>>y %in% x
   True

Data Type
We do not need to declare a variables before using them.


Vectors :-
   A Vector is a sequence of data elements of data elements of the same basic type.
      Example :
                 vtr = (1,3,5,7,9)
                  or
                  vtr <- (1,3,5,7,9)
There are 5 Atomic vectors, also termed as five classes of vectors.

Lists :-

Lists are the R objects which contain elements of different types like -numbers, strings, vectors and another list inside it.
    > n = c(2,3,5)
    > 5 = c("aa", "bb", "cc", "dd", "ee")
    >x = list(n, s, TRUE)

Arrays :-

Arrays are the R data objects which can store data in more than two dimensions.
It takes vectors as input and uses the values in the dim parameter to create an array.
       vector 1 <- c(5,9,3)
        vector2 <- c(10,11,12,13,14,15)
result <- array(c(vector1, vector2), dim = c(3,3,2))

Matrices :-

Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular layout.
A Matrix is created using the matrix() function.
matrix(data, nrow, ncol, byrow, dimnames)

- data is the input vector which becomes the data elements of the matrix.
- nrow is the number of rows to be created
- ncol is the number of columns to be created.
- byrow is a logical clue. If TRUE then the vector elements are arranged by row.
- dimname is the names assigned to rows and columns.

Factors:-

Factors are the data objects which are used to categorize the data and store it as levels
They can store both strings and integers.
They are useful in data analysis for statistical modeling.

   data <- c("East","West","East","North","North","East","West","West","East")
             factor_data <- factor(data)

Data Frames :-

A data frame is a table or two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.
   emp_id = c(1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
emp.data <- data.frame(emp_id, emp_name, salary)

Flow Control Statements

if → It evaluates a single condition
if .. else → It evaluates a group of condition and selects the statements
Switch → It checks the different known possibilities and selects the statements

Loops :-

Repeat → Repeat things until the loop condition is true
While → Repeat things until the loop condition is true
For → Repeat things till the given number of times.