R Programming Tutorial

You want to for beginner to expert R Programmer rapidly? R is one the best programming language for work with statistics, math and data science. This tutorial will help you to learn R and build your first machine learning model. Let’s get started.

Compiling and Running R from the Command Line

The two ways to run R programs are: an R script, which is widely used and is most preferred and the second one is R CMD BATCH, it is not a commonly used command. We can call them directly from the command line or any other job scheduler.

You can conceivably call these commands from a shell built into the IDE and nowadays, the RStudio IDE does come with tools that enhance or manage the R script and R CMD BATCH functions.

source() function inside R is a good alternative to using the command line. This function can also call a script, but for using this function, you have to be inside the R environment.

R Language Built-in Data Sets

To list the data sets that are built-in with R, use the data() command, then find what you want, and use the name of the dataset in the data() function. Like data (functionname).

Show Data Sets in R

The question mark (?) could be used to ask for help for datasets.

To check for everything use summary ().

Plot () is also a function that is used to plot graphs.

Let’s create a test script and run it. Create p1.R file and save it the home directory with the following contents:

Code Example:

  # Simple hello world code in R  print("Hello World!")  print("LinuxHint")  print(5+6)

Running Hello World

R Data Frames

For storing data in tables, we use a structure in R called a Data Frame. It is used to list vectors of equal length. For example, the following variable nm is a data frame containing three vectors x, y, z:

x = c(2, 3, 5)     y = c("aa", "bb", "cc")     z = c(TRUE, FALSE, TRUE)     # nm is a data frame  nm = data.frame(n, s, b)

There is a concept called Built-inData Frames in R as well. mtcars is one such built-in data frame in R, which we will use as an example, for our better understanding. See the code below:

  > mtcars                  mpg   cyl disp  hp   drat   wt ...   Mazda RX4      21.0   6  160   110  3.90 2.62 ...   bus RX4 Wag    21.0   6  160   110  3.90 2.88 ...   Datsun 710     22.8   4  108   93   3.85 2.32 ...                  ............

mtcars bulitin dataframe

The header is the top line of the table that contains the column names. Data rows are donated by each horizontal line; each line starts with the name of the row and then followed by the actual data. The data member of a row is termed as a cell.

We would enter the row and column coordinates in a single square bracket ‘[]’ operator to retrieve data in a cell. To separate the coordinates, we use a comma. The order is essential. The coordinate begins with row then comma and then ends with the column. Cell value of 2^nd row and 1^st column is given as:

  > mtcars[2, 2]      [1] 6

We can also use row and column name instead of coordinates:

  >  mtcars["Bus RX4", "mpg"]   [1] 6

nrow function is used to find the number of rows in the data frame.

  >  nrow(mtcars)     # number of data rows  [1] 32

ncol function is used to find the number of columns in a data frame.

  > ncol(mtcars)      # number of columns  [1] 11

R Programming Loops

Under some conditions, we use loops when we want to automate some part of code, or we want to repeat a sequence of instructions.

For loop in R

If we want to print out these years information more than one time.

  print(paste("The year is", 2000))  "The year is 2000"  print(paste("The year is", 2001))  "The year is 2001"  print(paste("The year is", 2002))  "The year is 2002"  print(paste("The year is", 2003))  "The year is 2003"  print(paste("The year is", 2004))  "The year is 2004"  print(paste("The year is", 2005))  "The year is 2005"

Instead of repeating our statement again and again if we use for loop it will be much easier for us. Like this:

  for (year in c(2000,2001,2002,2003,2004,2005)){    print(paste("The year is", year))  }    "The year is 2000"  "The year is 2001"  "The year is 2002"  "The year is 2003"  "The year is 2004"  "The year is 2005"

While Loop in R

  while (expression)  {     statement  }

If the result of expression is TRUE, the body of the loop is entered. The statements inside the loop are performed, and the flow returns to assess the expression again. The loop will repeat itself until the expression evaluates to FALSE, in which case the loop exits.

Example of while Loop:

  # i is initially initialized to 0  i = 0  while (i<5)  {    print (i)    i=i+1  }    Output:  0  1  2  3  4

In the above while loop, the expression is i<5which measures to TRUE since 0 is less than 5. Hence, the body of the loop is executed, and i is output and incremented. It is important to increment i inside the loop, so it will somehow meet the condition at some point. In the next loop, the value of i is 1, and the loop continues. It will repeat itself until i equals 5 when the condition 5<5 reached loop will give FALSE and the while loop will exit.

R Functions

To create a function we use directive function (). Specifically, they are R objects of class function.

f <- function() {   ##some piece of instructions   }

Notably, functions could be passed to other functions as arguments and functions could be nested, to let you determine a function inside of another function.

Functions optionally can have some named arguments that have default values. If you don’t want a default value, you can set its value to NULL.

Some facts about R Function arguments:

The arguments admitted in the function definition are the formal arguments
The formals function could give back a list of all the formal arguments of a function
Not every function call in R uses all the formal arguments
Function arguments might have default values, or they may be missing

#Defining a Function:  f <- function (x, y = 1, z = 2, s= NULL) {   }

Creating a logistic regression model with built-in data set

The glm() function is used in R to fit the logistic regression. glm() function is similar to the lm() but glm() has some additional parameters. Its format looks like this:

  glm(X~Z1+Z2+Z3, family=binomial (link=”logit”), data=mydata)

X is dependent on the values of Z1, Z2, and Z3. Which means Z1, Z2 and Z3 are independent variables and X is the dependent Function involves extra parameter family and it has value binomial (link=“logit”) that means link function is logit and the probability distribution of regression model is binomial.

Suppose we have an example of student where he will get admission on the basis of two exam results. The data set contains the following items:

result _1- Result-1 score
result _2- Result -2 score
admitted- 1 if admitted or 0 if not admitted

In this example, we have two values 1 if a student got admission and 0 if he did not get admission. We have to generate a model to predict that the student got admission or not,. For a given problem, admitted is considered as a dependent variable, exam_1 and exam_2 are considered as independent variables. For that model, our R code is given

  >Model_1<-glm(admitted ~ result_1 +result_2, family = binomial("logit"), data=data)

Lets suppose that we have two results of the student. Result-1 65% and result-2 90%, now we will predict that the student gets admission or not for estimating the probability of student to get admission our R code is as below:

  >in_frame<-data.frame(result_1=65,result_2=90) >predict(Model_1,in_frame, type="response")    Output:  0.9894302

The above output shows us the probability between 0 and 1. If then it is less than 0.5 it means student did not get admission. In this condition, it will be FALSE. If it is greater than 0.5, the condition will be considered as TRUE which means the student has got admission. We have to use round () function to predict probability between 0 and 1.

R code for that is as shown below:

  >round(predict(Model_1, in_frame, type="response"))[/code]    Output: 1

A student will get admission as the output is 1. Moreover, we may also predict for other observations in the same manner.

Using logistic regression model (scoring) with new data

When needed we can save the model in a file. R code for our train model will look like this:

  the_model <- glm(my_formula, family=binomial(link='logit'),data=model_set)

This model can be saved with:

  save(file="filename",the_file)

You can use the file after saving the it, by using that peace of R code:

  load(file="filename")

For applying the model for new data you can use this line of a code:

  model_set$pred <- predict(the_model, newdata=model_set, type='response')

NOTE: The model_set cannot be assigned to any variable. To load a model we will use the function load(). New observations will not change anything in the model. The model will remain the same. We use the old model to make predictions about the new data not to change anything in the model.

Conclusion

I hope you have seen how R programming works in a basic way and how you can quickly get into action doing machine learning and statistics coding with R.