Friday, 10 August 2018

Data Management : Factors in R Language

Categorical variables

Quantitative variables
Example:
Height (in meters) - 1.65, 1.76, ....

Qualitative variables
Example:
Gender - Male, Female
Performance - Excellent, Good, Average, Bad ....

Categorical variables
Example:
x : Gender - Male, Female
x = 0 if a person is male
x = 1 if a person is female

Example:
The categories are stored internally as numeric codes, with labels to provide meaningful names for each code.

Factors

Factors represent categorical variables and are used as grouping indicators.

Example:
Suppose we denote the three colors of balls in a basket by following numbers:
Red = 1,  Blue = 2,  Green = 3

Suppose we draw five balls with following colors:
Red, Green, Green, Blue, Red

This outcome of colors can be coded by numbers


Each character is mapped to a code.

Factors represent categorical variables and are used as grouping indicators.

The categories are stored internally as numeric codes, with labels to provide meaningful names for each code.

The order of the labels is important.
First label is mapped to code 1.
Second label is mapped to code 2 and so on.

The values of the codes are always restricted to 1,2,...,k, to represent k discrete categories.

Here "Red" is mapped to code 1,
"Blue" is mapped to code 2 and 
"Green" is mapped to code 3.

We have a vector to character strings or integers.
R's term for a categorical variable is a factor.
In R, each possible value of a categorical variable is called a level.
A vector of level is called a factor.

A categorical variable is characterized by a (here : finite) number of levels called as factor levels.

To define a factor, we start with
  • a vector of values,
  • a second vector that gives the collection of possible values, and 
  • a third vector that gives labels to the possible values.
A factor function encodes the vector of discrete values into a factor:
  factor (x)
          where x is a vector of strings or integers.
If the vector contains only a subset of possible values and not the entire values, then include a second argument that gives the possible levels of the factor:
  factor (x, levels)

Usage
factor (x = character ( ) , levels , labels = levels, exclude = NA, ..)


  • levels : Determines the categories of the factor variable.                         Default is the sorted list of all the distinct values of x.
  • labels : (Optionally Vector of values that will be the labels of the categories in the levels argument.
  • exclude : (Optional) It defines which levels will be classified as NA in any output using the factor variable. 

0 Comments:

Post a Comment

Popular Posts

Categories

Android (21) AngularJS (1) Books (3) C (75) C++ (81) Data Strucures (4) Engineering (13) FPL (17) HTML&CSS (38) IS (25) Java (85) PHP (20) Python (83) R (68) Selenium Webdriver (2) Software (13) SQL (27)