SOME OF MY FAVORITE R FUNCTIONS.

The functions and packages described below are not limited to the use cases discussed in this post. Some can be found in more than one package and can be applied in more than one way. 

I use the iris dataset for the examples. It has 150 rows and 5 columns namely:

Sepal.Length, Sepal.Width, Petal.Length, Petal.Width and Species

 The pipe operator ( %>%)

This function can be found in the dplyr package. It is one of those functions you learn earlier on when you start doing a lot of visualization and deep dive exploratory analyses. I envision it as a funnel to sift your bulk data through till you are left with what you want.

While on the same package, there are some functions that are worth mentioning

ends_with/starts_with: useful in selecting specific columns to keep based on pattern

summarise: In most cases this should return a one row summary and is most useful after grouping data.

Example (for the pipe function and the first two complimentary functions)

iris%>%
  select(ends_with("Length"))%>%
  summarise(Sepal_Length_Mean = mean(Sepal.Length),
            Petal_Length_Mean = mean(Petal.Length))

The results of the function are 

Sepal_Length_Mean: 5.843
Petal_Length_Mean: 3.758

summarise_all: Say I want to know the number of distinct/unique values of all the rows in all my columns. Checking column by column is tiresome and this provides all the values at a glance. I have used this in eliminating columns that have the same row entry as part of my data cleaning process because if we are studying relationships, this column clearly has no variability and can not help (You could keep one as they are also used to uniquely identify each row). It is the alternative to

unique(iris$species)  

select_if: Select columns only if they meet a specific condition.

Example:

iris%>%
  summarise_all(n_distinct)%>%
  select_if(.!=1)%>%
  names()

The code above checks how many distinct values are in each row, filters those columns that have more than one level and retracts the names of those columns. It can be saved say into an object call it select_columns and use that to reduce the dimensions of your data set.

Example

select_columns<-iris%>%
  summarise_all(n_distinct)%>%
  select_if(.!=1)%>%
  names()

data<-iris[,select_columns]

Now you could comfortably work with your new data.  

groub_by: Can be used to evaluate distribution of the groups

Example: 

iris%>%
  group_by(Species)%>%
  summarise(Frequency = n(),
            Percentage = Frequency/nrow(iris))

The results are 50 and 33.33% each for the three species:  setosa, versicolor and virginica.

The 3 species are equally distributed in the dataset

where: coupled with another function .i.e is.numeric :  to bulk select columns that meet that specific quality . I have used this

iris%>%
  select(where(is.numeric))

Will return all integer/numeric variable columns.

lapply and sapply

They help a lot when it comes to wrangling - transforming data. i.e., applying a function that changes date columns into say POSIXct columns.

scale and log

Scale normalizes data and is mostly used before checking for correlations between numeric variables on different ranges. That is, one column might range from 1:20 and another from 0:100,000.  Log takes logs of the data. I have used this to try and coerce my distributions into normal distributions which can be easily checked using the histogram plot.

kable

This is in the package knitr. My regression model summaries have never looked any cuter!!

While on knitting, I have to mention r markdown reports. After my first class report using r markdown, I never want to produce reports any other different way. There is something good about compiling code and text but more so, interpreting results on the same document you did the analyses on. 

tbl_regression

This is in gtsummary and it also helps print out clean table of some regression model outputs. The plus with this is you can easily identify the reference level the other levels have been compared to in a qualitative regression model.

Example :

mod<-lm(Petal.Length ~ ., data = iris)
tbl_regression(summary(mod))


You can see that Species Setosa was taken to be the reference level.

Honorary mention: For loops, arrange, pivot_wider and pivot_longer, model.matrix

PS: All the best as you continuously improve yourself .

~NMN




Comments

Popular posts from this blog

Financial Mathematics CT-1 Finally Paid Off

Data Scientist Courses (edX vs DatCamp)

Self Joins in R