R - Split a String in a Data Frame Column and Keep a Piece as a New Variable
Iβve been having trouble figuring out where to begin with this data blog, so I think Iβll start with something pretty simple but ultimately very valuable - splitting a column of values in an R data frame and creating a new variable out of one piece of the split, for every row in your dataset. I use this all the time to create a new variable whoβs values are a subset of another variable. This might be a niche piece of code, but I looooooove it :)
df$variable2 <- sapply(strsplit(as.character(df$variable1), " "),"[", 1)
Letβs break down the pieces to this nifty little trick, from the inside out:
as.character(df$variable1)
We want the variable that we are splitting to be a character variable, if it is not already.
This will split the value in a variable by a delimiter, which is great. However, say you have a variable1, with a value βTyler is awesomeβ. Using the strsplit function (and splitting on a space β β), you would end up with βTylerβ βisβ βawesomeβ. Thereβs nothing wrong with this, but if you tried to assign this to a data frame, you would have one variable with three rows, one for each of the split words. And this is only working for a single value. This isnβt what we are trying to do here - especially if you have a large data frame with lots of different values in variable1. We do want to split the variable though, which is why this is an important piece to this.
This is where the magic happens. sapply()Β function takes a list, vector or data frame as input and gives output in vector or matrix.Β The apply family in general primarily are used to avoid explicit uses of loop constructs, which in our case is quite helpful as we have many rows of data that we want to perform some sort of function on.
The piece β[β,1 is the FUN function for sapply and the part where we tell R to retain just one piece of the split. The β1β tells R that we want the first piece of the split - we could change that to 2, 3 etc depending on what we want to keep.
Itβs probably best to see it in action though, as even some of these intricate details can get complicated for me as well.
Alright so based on the dataset above, letβs say we wanted to split the variable βNBA_Teamsβ and store the city that each team is from in a new column, called βCitiesβ. Hereβs the code we would use to do that:
NBA$Cities <- sapply(strsplit(as.character(NBA$NBA_Teams), " "),"[", 1)
If we wanted to just keep the mascot portion of each team (letβs call that new variable βMascotβ, we would simply change the β1β to a β2β at the end of the function:
NBA$Mascot <- sapply(strsplit(as.character(NBA$NBA_Teams), " "),"[", 2)
So again, instead of just splitting a single value into smaller chunks, we can split an entire column of values based on any delimiter that we want (the above example we split on a space, but we could split on the letter βtβ if we wanted to). No for loops necessary!