Self Joins in R

Sometime back at work, I was faced with a dilemma. 
I am one of those people who might get stuck on something because I have a "feeling" there is a shorter way to do it. Neither am I the biggest fan of writing several lines of code when there is an option to have less lines. I know that my tendency towards less lines of code sometimes is not the best bacause it might hinder readability. 

To explain my dilemma, I have created a sample data frame as follows

Code:

data <-data.frame(

  ID = c(12,12,34,36,47,47,58,68,77,77),

  Call_Duration  = c(20,20,25,27,40,40,50,55,20,20),

  Call_Reason = c("Reason 1", "Reason 2", "Reason 3", "Reason 1","Reason 3",

                  "Reason 6", "Reason 4", "Reason 1", "Reason 2", "Reason 3")

  )

Running the code above results to the below dataframe


The Dilemma
How can I easily find and count duplicate IDs so that I can ration call duration to all the calling reasons. The reason to ration is so that there is no double counting when aggregating call durations which will lead to higher averages/sums or any other static of interest. 
The source of duplicate IDs can be thought of as: Say there is a school and the secretary is receiving inquiry phone calls from the parents on specific things(call reasons). A record of the time the call started and ended is entered. Using the start and end time, the call duration can be estimated. But, in one single call, a parent can inquire on several things. All these reasons are recorded separately on each new row with the difference between them being reason for call. 
If we want to estimate the time it takes averagely for the secretary to answer inquiries on a specific reason, we will need to find all call durations for that reason and take the average. 
Using the data above and ID 47, this can be explained as: Parent with ID 47 called the office and had  reason 3 and reason 6 as the inquiry reasons. The total call time was 40 minutes. We need to estimate that reason 3 took 20 minutes and reason 6 took the remaining 20 minutes
There are several ways I thought of doing this. Some are:
  1. Create a function that takes in the dataframe and the two columns of interest (ID and Call_Duration). It should count how many times the ID appears in the data and return this value. The value is then added as a column to the data with duplicate count value for duplicate ID values. The count values are then used to divide the call duration - very long and winded
  2. Group the data by ID, summarize count of IDs (n()), save the data in an object and do a join to add the count to the original data. Use the count column to divide call duration - short but needs you to create an object to use in the join

What I wanted however was to count and not create an object to store the counts. So the next idea was to create a frequency table using count(). table() is what we are probably used to however, it has an output that requires further work to convert it into a tibble or data frame. 
The next question was how do I do that. 
Thankfully, I remembered that you can reference preceding data in a pipeline using a dot. Tried that and it worked!
The summary is, I performed a self join in R.
The resulting data frame looks like this:



And the code to produce this is:

Adjusted_data <- data %>%
  left_join(count(., ID, name = "Duplicated_Count"))%>% ##THIS IS THE LINE RIGHT HERE
  mutate(Call_Duration = round(Call_Duration/Duplicated_Count))

Happy Reading!
~NMN





Comments

Popular posts from this blog

Financial Mathematics CT-1 Finally Paid Off

Data Scientist Courses (edX vs DatCamp)