Showing posts with label R. Show all posts
Showing posts with label R. Show all posts

Friday 7 December 2018

Nonlinear Least Square in R Language

 Nonlinear Least Square
When modeling real world data for regression analysis, we observe that it is rarely the case that the equation of the model is a linear equation giving a linear graph. Most of the time, the equation of the model of real world data involves mathematical functions of higher degree like an exponent of 3 or a sin function. In such a scenario, the plot of the model gives a curve rather than a line. The goal of both linear and non-linear regression is to adjust the values of the model's parameters to find the line or curve that comes closest to your data. On finding these values we will be able to estimate the response variable with good accuracy.
In Least Square regression, we establish a regression model in which the sum of the squares of the vertical distances of different points from the regression curve is minimized. We generally start with a defined model and assume some values for the coefficients. We then apply the nls() function of R to get the more accurate values along with the confidence intervals.

Syntax

The basic syntax for creating a nonlinear least square test in R is −
nls(formula, data, start)
Following is the description of the parameters used −
  • formula is a nonlinear model formula including variables and parameters.
  • data is a data frame used to evaluate the variables in the formula.
  • start is a named list or named numeric vector of starting estimates.

Example

We will consider a nonlinear model with assumption of initial values of its coefficients. Next we will see what is the confidence intervals of these assumed values so that we can judge how well these values fir into the model.
So let's consider the below equation for this purpose −
a = b1*x^2+b2
Let's assume the initial coefficients to be 1 and 3 and fit these values into nls() function.
xvalues <- c(1.6,2.1,2,2.23,3.71,3.25,3.4,3.86,1.19,2.21)
yvalues <- c(5.19,7.43,6.94,8.11,18.75,14.88,16.06,19.12,3.21,7.58)

# Give the chart file a name.
png(file = "nls.png")


# Plot these values.
plot(xvalues,yvalues)


# Take the assumed values and fit into the model.
model <- nls(yvalues ~ b1*xvalues^2+b2,start = list(b1 = 1,b2 = 3))

# Plot the chart with new data by fitting it to a prediction from 100 data points.
new.data <- data.frame(xvalues = seq(min(xvalues),max(xvalues),len = 100))
lines(new.data$xvalues,predict(model,newdata = new.data))

# Save the file.
dev.off()

# Get the sum of the squared residuals.
print(sum(resid(model)^2))

# Get the confidence intervals on the chosen values of the coefficients.
print(confint(model))
When we execute the above code, it produces the following result −
[1] 1.081935
Waiting for profiling to be done...
       2.5%    97.5%
b1 1.137708 1.253135
b2 1.497364 2.496484
Non Linear least square R

We can conclude that the value of b1 is more close to 1 while the value of b2 is more close to 2 and not 3.

Time Series Analysis in R Language

Time series is a series of data points in which each data point is associated with a timestamp. A simple example is the price of a stock in the stock market at different points of time on a given day. Another example is the amount of rainfall in a region at different months of the year. R language uses many functions to create, manipulate and plot the time series data. The data for the time series is stored in an R object called time-series object. It is also a R data object like a vector or data frame.
The time series object is created by using the ts()function.

Syntax

The basic syntax for ts() function in time series analysis is −
timeseries.object.name <-  ts(data, start, end, frequency)
Following is the description of the parameters used −
  • data is a vector or matrix containing the values used in the time series.
  • start specifies the start time for the first observation in time series.
  • end specifies the end time for the last observation in time series.
  • frequency specifies the number of observations per unit time.
Except the parameter "data" all other parameters are optional.

Example

Consider the annual rainfall details at a place starting from January 2012. We create an R time series object for a period of 12 months and plot it.
# Get the data points in form of a R vector.
rainfall <- c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)

# Convert it to a time series object.
rainfall.timeseries <- ts(rainfall,start = c(2012,1),frequency = 12)

# Print the timeseries data.
print(rainfall.timeseries)

# Give the chart file a name.
png(file = "rainfall.png")

# Plot a graph of the time series.
plot(rainfall.timeseries)

# Save the file.
dev.off()
When we execute the above code, it produces the following result and chart −
Jan    Feb    Mar    Apr    May     Jun    Jul    Aug    Sep
2012  799.0  1174.8  865.1  1334.6  635.4  918.5  685.5  998.6  784.2
        Oct    Nov    Dec
2012  985.0  882.8 1071.0
The Time series chart −
Time Series using R


Different Time Intervals

The value of the frequency parameter in the ts() function decides the time intervals at which the data points are measured. A value of 12 indicates that the time series is for 12 months. Other values and its meaning is as below −

  • frequency = 12 pegs the data points for every month of a year.
  • frequency = 4 pegs the data points for every quarter of a year.
  • frequency = 6 pegs the data points for every 10 minutes of an hour.
  • frequency = 24*6 pegs the data points for every 10 minutes of a day.

Multiple Time Series

We can plot multiple time series in one chart by combining both the series into a matrix.
# Get the data points in form of a R vector.
 rainfall1 <- c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)
rainfall2 <- 
           c(655,1306.9,1323.4,1172.2,562.2,824,822.4,1265.5,799.6,1105.6,1106.7,1337.8)

# Convert them to a matrix.
combined.rainfall <-  matrix(c(rainfall1,rainfall2),nrow = 12)

# Convert it to a time series object.
rainfall.timeseries <- ts(combined.rainfall,start = c(2012,1),frequency = 12)

# Print the timeseries data. print(rainfall.timeseries)

# Give the chart file a name.
png(file = "rainfall_combined.png")

# Plot a graph of the time series.
plot(rainfall.timeseries, main = "Multiple Time Series")

# Save the file.
dev.off()
When we execute the above code, it produces the following result and chart −
           Series 1  Series 2
Jan 2012    799.0    655.0
Feb 2012   1174.8   1306.9
Mar 2012    865.1   1323.4
Apr 2012   1334.6   1172.2
May 2012    635.4    562.2
Jun 2012    918.5    824.0
Jul 2012    685.5    822.4
Aug 2012    998.6   1265.5
Sep 2012    784.2    799.6
Oct 2012    985.0   1105.6
Nov 2012    882.8   1106.7
Dec 2012   1071.0   1337.8
The Multiple Time series chart −

Saturday 1 December 2018

Analysis of Covariance in R Language

Introduction

We use Regression analysis to create models which describe the effect of variation in predictor variables on the response variable. Sometimes, if we have a categorical variable with values like Yes/No or Male/Female etc. The simple regression analysis analysis gives multiple results for each value of the categorical variable. In such scenario, we can study the effect of the categorical variable by using it along with the predictor variable and comparing the regression lines for each level of the categorical variable. Such an analysis is termed as Analysis of Covariance also called as ANCOVA.

Example

Consider the R built in data set mtcars. In it we observer that the field "am" represents the type of transmission (auto or manual). It is a categorical variables with value 0 and 1. The miles per gallon value (mpg) of a car can also depend on it besides the value of horse power ("hp").

We study the effect of the value of "am" on the regression between "mpg" and "hp". It is done by using the aov( ) function followed by the anova( ) function to compare the multiple regressions.

Input Data

Create a data frame containing the fields "mpg" , "hp" and "am" from the data set mtcars. Here we take "mpg" as the response variable, "hp" as the predictor variables and "am" as the categorical variables.

input <- mtcarst [,c("am", "mpg", "hp") ]
print (head(input) )

When we execute the above code, it produces the following result -

                                          am             mpg               hp
Mazda RX4                        1               21.0              110
Mazda RX4 Wag               1               21.0              110
Datsun 710                         1               22.8              93
Hornet 4 Drive                   0               21.4              110
Hornet Sportabout            0               18.7             175
Valiant                                0               18.1             105


ANCOVA Analysis

We create a regression model taking "hp" as the predictor variables and "mpg" as the response variables taking into account the interaction between "am" and "hp"

Model with interaction between categorical variable and Predictor variable

#  Get the dataset.
input  <-  mtcars

#  Create the regression model.
result  <- aov (mpg-hp*am , data = input)
print (summary (result) )

When we execute the above code, it produces the following result -

                Df     Sum Sq.      Mean Sq      F value     Pr(>F)
hp            1        678.4            678.4          77.391     1.50e-09 ***
am           1        202.2            202.2          23.072     4.756-05 ***
hp:am      1         0.0                0.0             0.001        0.981
Residuals 28       245.4            8.8
- - -
Signif. codes :  0 '***' 0.001 '**' 0.01 '*'  0.05 '.' 0.1 '  '  1

The result shows that both horse power and transmission type has significant effect on miles per gallon as the p value in both cases is less than 0.05 But the interaction between these two variables is not significant as the p-value is more than 0.05.

Comparing Two Models

Now we can compare the two models to conclude if the interaction of the variables is truly in-significant. For this we use the anova( ) function.

# Get the dataset.
input  <- mtcars

# Create the regression models.
result 1  <- aov (mpg~hp*am,data = input)
result 2  <- aov (mpg~hp+am,data = input)

# Compare the two models.
print (anova(result1,result2) )

When we execute the above code, it produces the following result -

Model    1:  mpg  ~  hp   *   am
Model    2:  mpg ~   hp   +   am
    Res.Df      RSS Df      Sum of Sq       F         Pr(>F)
1     28           245.43       
2     29           245.44 -1   -0.0052515   6e-04    0.9806

As the p-value is greater than 0.05 we conclude that the interaction between horse power and transmission type is not significant. So the mileage per gallon will depend in a similar manner on the horse power of the car in both auto and manual transmission mode.

Friday 30 November 2018

Poisson Regression in R Language

Poisson Regression involves regression models in which the response variable is in the form of counts and not fractional numbers. For example, the count of number of births or number of wins in a football match series. Also the values of the response variables follow a Poisson distribution.

The general mathematical equation for Poisson regression is -


log(y) = a + b1x1 + b2x2 + bnxn ........

Following is the description of the parameters used -

       y is the response variable.
       a and b are the numeric coefficients.
       x is the predictor variables.

The function used to create the Poisson regression model is the glm( ) function.

Syntax

The basic syntax for glm( ) function in Poisson regression is -


glm( formula, data, family)

Following is the description of the parameters used in above followings -
  • formula is the symbol presenting the relationship between the variables.
  • data is the data set giving the values of these variables.
  • family is R object to specify the details of the model. It's value is 'Poisson' for Logistic Regression. 
Example

We have the in-built data set "warpbreaks" which describes the effect of wool type (A or B) and tension (low, medium or high) on the number of wrap breaks per loom. Let's consider "breaks" as the response variable which is a count of number of breaks. The wool "type" and "tension" are taken  as predictor variables.

INPUT DATA

input <-  warpbreaks
 print (head(input) )


When we execute the above code, it producers the following result -

              breaks            wool              tension

1              26                   A                     L

2              30                   A                     L

3              54                   A                     L

4              25                   A                     L

5              70                   A                     L

6              52                   A                     L


Create Regression Model

 output <- glm( formula = breaks ~ wool + tension,
                                  data = warpbreaks,
                                family = poisson)
print (summary (output) )


When we execute the above code, it produces the following result - 


Call :
glm( formula = breaks ~ wool + tension, family = poisson, data = warpbreaks)

Deviance Residuals :

 Min           1Q          Median          3Q         Max
-3.6871   -1.6503      -0.4269        1.1902    4.2616

Coefficients :
                        Estimate Std. Error z value Pr (> | z |)
(Intercept)   3.69196     0.04541    81.302   < 2e-16     ***
woolB        -0.020599   0.05157    -3.994  6.49e-05     ***
tensionM    -0.32132    0.06027    -5.332   9.73e-08     ***
tensionH    -0.51849     0.06396    -8.107   5.21e-16     ***
---
Signif.  codes:  0  '***'  0.001  '**'  0.01 '*'  0.05 '.'  0.1 '  '  1

(Dispersion parameter for poisson family taken to be 1 )

       Null deviance: 297.37  on 53  degrees of freedom
Residual deviance: 210.39  on 50  degrees of freedom
AIC:  493.05

Number of Fisher Scoring iterations: 4



In the summary we look for the p-value in the last column to be less than 0.05 to consider an impact of the predictor variable on the response variable. As seen the wooltype B having tension type M and H have impact on the count of breaks.

Friday 19 October 2018

Data Mining Process in R Language

Phases in a typical Data Mining effort:

1.  Discovery
     Frame business problem
     Identify analytics component
     Formulate initial hypotheses

2.  Data Preparation
     Obtain dataset form internal and external sources
     Data consistency checks in terms of definitions of fields, units of measurement, time periods etc.,
     Sample

3.  Data Exploration and Conditioning
     Missing data handling, Range reason ability, Outliers,
     Graphical or Visual Analysis
     Transformation, Creation of new variables, and Normalization
     Partitioning into Training, validation, and Test datasets

4.  Model Planning
    - Determine data mining task such as prediction, classification etc.
     - Select appropriate data mining methods and techniques such as regression, neural networks, clustering etc.
     
5.  Model Building
     Building different candidate models using selected techniques and their variants using training data
     Refine and select the final model using validation data
     Evaluate the final model on test data
  
6.  Results Interpretation
      Model evaluation using key performance metrics

7. Model Deployment
       Pilot project to integrate and run the model on operational systems

Similar data mining methodologies developed by SAS and IBM Modeler (SPSS Clementine) are called SEMAA and CRISP-DM respectively

Data mining techniques can be divided into Supervised Learning Methods and Unsupervised Learning Methods

Supervised Learning
-  In supervised learning, algorithms are used to learn the function 'f' that can map input variables (X) into output variables (Y)
                        Y = f(X)
- Idea is to approximate 'f' such that new data on input variables (X) can predict the output variables (Y) with minimum possible error (ε)


Supervised Learning problem can be grouped into prediction and classification problems

Unsupervised Learning
  -  In Unsupervised Learning, algorithms are used to learn the underlying structure or patterns hidden in the data

Unsupervised Learning problems can be grouped into clustering and association rule learning problems

Target Population
  - Subset of the population under study
  - Results are generalized to the target population

Sample 
  - Subset of the target population

Simple Random Sampling
  - A sampling method where in each observation has an equal chance of being selected.

Random Sampling
  - A sampling method where in each observation does not necessarily have an equal chance of being selected

Sampling with Replacement
  - Sample values are independent

Sampling without Replacement
  - Sample values aren't independent

Sampling results in less no. of observation than the no. of total observation in the dataset

Data Mining algorithms
  - Varying limitations on number of observation and variables

Limitation due to computing power and storage capacity

Limitations due to statistical being used

How many observation to build accurate models?

Rare Event, e.g., low response rate in advertising by traditional mail or email
 - Oversampling of 'success' cases
 - Arise mainly in classification tasks
 - Costs of misclassification
 - Costs of failing to identify 'success' cases are generally more than costs of detailed review of all cases
 - Prediction of 'success is likely to come at cost of misclassifying more 'failure' cases as 'success' cases than usual

Wednesday 17 October 2018

Steps to write a programe with Examples of Programming in R Language

Steps to write a programme
  • A programme is a set of instructions or commands which are written in a sequence of operations i.e., what comes first and what comes after that.
  • The objective of a programme is to obtain a defined outcome based on input variables.
  • The computer is instructed to perform the defined task.
  • Computer is an obedient worker but it has its own language.
  • We do not understand computer's language and computer does not understand our language.
  • The software help us and work like an interpreter between us and computer.
  • We say something in software's language and software informs it to computer.
  • Computer does the task and informs back to software.
  • The software translates it to our language and informs us.
  • Programme in R is written as a function using function.
  • Write down the objective, i.e., what we want to obtain as an outcome.
  • Translate it in the language of R.
  • Identify the input and output variables.
  • Identify the nature of input and output, i.e., numeric string, factor, matrix etc.
  • Input and output variables can be single variable, vector, matrix or even a function itself.
  • The input variables are the component of function which are reported in the argument of function ( ) .
  • The output of a function can also be input to another function.
  • The output of an outcome can be formatted as per the need and requirement.   

Tips :
  1. Loops usually slower the speed of programmes, so better is to use vectors and matrices.
  2. Use # symbol to write comment to understand the syntax.
  3. Use the variable names which are easy to understand.
  4. Don't forget to initialize the variables.


Example 1

Input variables : x, y, n (if x and y have different number of observations, choose different numbers, say n1 and n2)


We need summation, so use sum function or alternatively computer it through vectors.




Monday 15 October 2018

R Programming vs Python Comparison

R Programming

R is a programming language and free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing.



Python


python is an interpreted high-level programming language for general-purpose programming. It has a design philosophy that notably using significant white-space.

 1. Ease of Learning :-
    R is not ease to Learn and Python is easy to learn.

2. Speed :-
    R is a Low Level Language Slower and Python is a High Level Language Faster.

3. Data Handling Capabilities :- 
    R is Convenient for large Datasets and Python is Progressing with new Releases.

4. Graphics and Visualization :-
      R is Easy and Better and Python is Complex and Tedious.

5. Deep Learning Support :-
     R is New to Deep Learning and Python is Works Amazingly (TensorFlow).

6. Flexibility :- 
     R is Statistical Tests and Models and Python is Websites and Applications.

7. Code Repository & Libraries
    R is Huge Repository more Libraries and Python is Lesser Libraries.

8. Popularity Index :-
  


9. Job Scenario :-
 


10. Community & Customer Support :-


Wednesday 10 October 2018

Data Management : Repeats in R Languages

Repeats
  • Command rep is used to replicates the values in a vector.
  • Syntax rep (x) replicates the values in a vector x.
  • rep (x, times=n )  # Repeat x as a whole n times
  • rep (x, each=n )  # Repeat each cell n times
Help for the command rep

> help  ("rep")



rep { base }

      Replicate Element of Vectors and Lists

Description

rep replicates the values in x. It is generic function, and the (internal) default method is described here.
rep.int and rep-len are faster simplified versions for two common cases.
They are not generic.

Usage

rep (x, ......)
rep.int (x,  times)
rep_len (x, length.out)

The command rep

Repeat an object n-times:

> rep (3.5,  times = 10)

[1]     3.5    3.5    3.5   3.5    3.5    3.5   3.5    3.5    3.5     3.5

> rep ( 1 :4,  2)
   [1]   1   2    3    4     1    2    3    4




Repeat an object n-times

rep (x, times  = n)

Repeat each cell n-times:

rep (x, each = n)

> x <- 1 : 3
> x
 [1]   1  2  3

>  rep (x,  times = 3)
  [1]  1   2   3  1    2   3   1   2   3

> rep (x, each = 3 )
   [1]  1    1   1    2    2   2   3     3    3

Every object is repeated several times successively:

> rep (1 : 4,   each  =  2)
   [1]   1    1   2   2   3   3   4   4

> rep ( 1 : 4,  each = 2 ,  times = 3)
 [1]   1  1   2   2    3    3   4   4   1   1   2   2    3   3   4   4   1   1   2   2    3    3   4    4


Every object is repeated a different number of times:

> rep  (1 : 4,  2 :  5)
  [1]   1   1   2   2    2    3    3   3   3   4    4    4   4    4




Sunday 7 October 2018

Basics of Calculations_Calculator_Built in Function Assignments

Integer Division %/%

Integer Division :  Division in which the fractional part (remainder) is discarded

> c (2, 3, 5, 7)  % / %  c(2,3)

[1]  1  1  2  2



Modulo Division (x mod y)  %%:

x mod y : modulo operation finds the remainder after division of one number by another

> c (2,3,5,7)  %% 2
  [1]  0  1  1  1



Maximum: max



Maximum: min




Overview Over Further Functions



Example :

> abs ( -4)
  [1] 4

> abs (c(-1, -2, -3, 4, 5) )
 [1]  1  2  3  4  5
> sqrt (4)
  [1]  2

> sqrt ( c(4,9,16,25) )
  [1]  2  3  4  5

 
 > sum (c(2,3,5,7) )
  [1]  17

> prod ( c( 2,3,5,7) )
  [1]   20

> round  (1.23)
  [1]  1

> round (1.83)
  [1]  2



Assignments

Assignments can be made in two ways:

>  x<-6
>  x
    [1]   6

> mode(x)
  [1]  "numeric"

> x=8
> x
   [1]   8

>  mode (x)
    [1]  "numeric"


An assignments can also be used to save values in variables:

>  x1  <-  c(1,2,3,4)

>  x2  <-  x1^2

>  x2
  [1]  1  4  9  16

ATTENTION: R is case sensitive (X is not the same as x

Popular Posts

Categories

AI (27) Android (24) AngularJS (1) Assembly Language (2) aws (17) Azure (7) BI (10) book (4) Books (117) C (77) C# (12) C++ (82) Course (62) Coursera (179) coursewra (1) Cybersecurity (22) data management (11) Data Science (94) Data Strucures (6) Deep Learning (9) Django (6) Downloads (3) edx (2) Engineering (14) Excel (13) Factorial (1) Finance (5) flutter (1) FPL (17) Google (19) Hadoop (3) HTML&CSS (46) IBM (25) IoT (1) IS (25) Java (92) Leet Code (4) Machine Learning (44) Meta (18) MICHIGAN (5) microsoft (3) Pandas (3) PHP (20) Projects (29) Python (747) Python Coding Challenge (221) Questions (2) R (70) React (6) Scripting (1) security (3) Selenium Webdriver (2) Software (17) SQL (40) UX Research (1) web application (8)

Followers

Person climbing a staircase. Learn Data Science from Scratch: online program with 21 courses