Data frames in R

rstudio

rstats

Quick bits for manipulating data frames in R.

Author

Siobhon Egan

Published

August 2, 2022

Warning

I will try and keep this page very simple - a place to house some quick examples of data manipulations I reach for often. A work in progress.

Some quick bits of code that I reach for often to manipulate data.frames in R.

I ❤️ tidyverse so let’s load it first up!

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::%+%()  masks crayon::%+%()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Matching

Create two data frames match order of each based on a column in common

# data frame 1
producers <- data.frame(
  surname =  c("Tarantino", "Scorsese", "Spielberg", "Hitchcock", "Polanski"),
  nationality = c("US", "US", "US", "UK", "Poland"),
  stringsAsFactors = FALSE
)

# data frame 2

movies <- data.frame(
  surname = c("Spielberg",
              "Scorsese",
              "Hitchcock",
              "Tarantino",
              "Polanski"),
  title = c("Super 8",
            "Taxi Driver",
            "Psycho",
            "Reservoir Dogs",
            "Chinatown"),
  stringsAsFactors = FALSE
)

Then we match data frames using the surname column.

idx <- match(producers$surname, movies$surname)
movies_matched <- movies[idx, ]

Note

Formula idx <- match(df1$colInCommon, df2$colInCommon) matched_df <- df2[idx, ]

Join data frames

Use the full_join to merge two data frames based on a common column

joindf <- movies %>% full_join(producers,
          by = "surname")
DT::datatable(joindf)

Join options include:

inner_join(): includes all rows in x and y.
left_join(): includes all rows in x.
right_join(): includes all rows in y.
full_join(): includes all rows in x or y.

Rename column

Using dplyr to rename a column with specific name we can call.

df <- df %>%
  dplyr::rename(newName = oldName)

Alternative way without using dplyr we call the specific column number

colnames(df)[1] <- "newName"

Say you have a vector with the names we can use

colnames(df) <- vector

Maybe col names are contained within a row 2 of the data frame

colnames(df) <- df[2,]

Find and replace

df["colname"][df["colname"] == "existing value"] <- "new value"

Pivot wide and long

Load the palmerpenguins package for some fun example data.

library(palmerpenguins)
data(package = 'palmerpenguins')

df <- penguins_raw


df <- dplyr::select(df, studyName, `Sample Number`, Species,  `Culmen Length (mm)`, `Flipper Length (mm)`, `Body Mass (g)`)

Pivot long

df_long <- df %>%
  pivot_longer(
    cols = 4:length(df),
    names_to = "measurements",
    values_to = "value")
DT::datatable(df_long)

Pivot wide

df_wide<- df_long %>%
  pivot_wider(
    names_from = "measurements",
    values_from = "value")
DT::datatable(df_wide)

Misc.

Paste options

# with spaces by default use `paste`
df_wide$studyName_sampleNumber <- paste(df_wide$studyName, "-", df_wide$`Sample Number`)

# without spaces by default use `paste0`
df_wide$studyName_sampleNumber2 <- paste0(df_wide$studyName, "-", df_wide$`Sample Number`)

DT::datatable(df_wide)