Strava Data

I am a vivid runner and cyclist. Since a few years, I’m recording almost all my activities with some kind of GPS device.

I record my runs with a Garmin device and my bike rides with a Wahoo device. Both accounts get synchronized with my Strava account. I figured that it would be nice to directly access my data from my Strava account.

In the following text, I will describe the progress to get the data into R. Once available in a nice format in R, the data is stored as a pin in a private github repository. By doing so, the data is easily accessible in other analysis or shiny apps.

In this analysis, the following packages are used:

library(conflicted)
library(tidyverse)
library(lubridate)
library(drake)
library(pins)
library(rava)
library(httr)
library(fs)

conflict_prefer("filter", "dplyr")

The rava package is package with helper functions regarding Strava data. It can be installed via the following command:

remotes::install_github("duju211/rava")

Data

To get access to your Strava data from R, you have to create a Strava api. How to do this is documented here.

OAuth Dance from R

The Strava api requires a so called OAuth dance. How this can be done from within R is described in the following section.

Create an OAuth Strava app:

define_strava_app <- function() {
  httr::oauth_app(
    appname = "r_api",
    key = Sys.getenv("STRAVA_KEY"),
    secret = Sys.getenv("STRAVA_SECRET"))
}

You can find your STRAVA_KEY and STRAVA_SECRET variables under the Strava api settings after you have created your own personal api. The name of api is determined during creation. In my case I named it r_api.

my_app <- define_strava_app()

Define an endpoint:

define_strava_endpoint <- function() {
  httr::oauth_endpoint(
    request = NULL,
    authorize = "https://www.strava.com/oauth/authorize",
    access = "https://www.strava.com/oauth/token")
}

The authorize parameter describes the url to send client to for authorization. And the access argument is used to exchange the authenticated token.

my_endpoint <- define_strava_endpoint()

The final authentication step. Before the user can execute the following steps, he has to authenticate the api in the web browser.

define_strava_sig <- function(endpoint, app) {
  httr::oauth2.0_token(
    endpoint, app,
    scope = "activity:read_all,activity:read,profile:read_all",
    type = NULL, use_oob = FALSE, as_header = FALSE,
    use_basic_auth = FALSE, cache = FALSE)
}
my_sig <- define_strava_sig(my_app, my_endpoint)

The information in my_sig can now be used to access Strava data.

Activities

We are now authenticated and can directly access Strava data. At first load an overview table of all available activities (one activity per row). Because the total number of activities is unknown, use a while loop. Break the execution of the loop, if there are no more activities to read.

read_all_activities <- function(sig) {
  activities_url <- httr::parse_url(
    "https://www.strava.com/api/v3/athlete/activities")

  act_vec <- vector(mode = "list")
  df_act <- tibble::tibble(init = "init")
  i <- 1L

  while (nrow(df_act) != 0) {
    r <- activities_url %>%
      httr::modify_url(
        query = list(
          access_token = sig$credentials$access_token[[1]],
          page = i)) %>%
      httr::GET()

    df_act <- httr::content(r, as = "text") %>%
      jsonlite::fromJSON(flatten = TRUE) %>%
      tibble::as_tibble()
    if (nrow(df_act) != 0)
      act_vec[[i]] <- df_act
    i <- i + 1L
  }

  df_activities <- act_vec %>%
    dplyr::bind_rows() %>%
    dplyr::mutate(start_date = lubridate::ymd_hms(start_date))
}
df_act_raw <- read_all_activities(my_sig)
df_act_raw
## # A tibble: 507 x 60
##    resource_state name  distance moving_time elapsed_time total_elevation~ type 
##             <int> <chr>    <dbl>       <int>        <int>            <dbl> <chr>
##  1              2 "Act~   55836.       10461        10960            1273  Virt~
##  2              2 "Act~   34455.        5385         5455             449  Virt~
##  3              2 "Act~   24732.        4198         4198             380  Virt~
##  4              2 "Act~   46553.        6279         6522             439  Virt~
##  5              2 "202~    5745         2215         2218             117  Run  
##  6              2 "Mem~   10482.        3987         4163             167. Run  
##  7              2 "Act~   19607.        2940         3179             336  Virt~
##  8              2 "Act~   88553.       10738        11308             514  Virt~
##  9              2 "Act~   15621.        2514         2514             288  Virt~
## 10              2 "Act~   23100.        3248         3510             332  Virt~
## # ... with 497 more rows, and 53 more variables: id <dbl>, external_id <chr>,
## #   upload_id <dbl>, start_date <dttm>, start_date_local <chr>, timezone <chr>,
## #   utc_offset <dbl>, start_latlng <list>, end_latlng <list>,
## #   location_city <lgl>, location_state <lgl>, location_country <chr>,
## #   start_latitude <dbl>, start_longitude <dbl>, achievement_count <int>,
## #   kudos_count <int>, comment_count <int>, athlete_count <int>,
## #   photo_count <int>, trainer <lgl>, commute <lgl>, manual <lgl>,
## #   private <lgl>, visibility <chr>, flagged <lgl>, gear_id <chr>,
## #   from_accepted_tag <lgl>, upload_id_str <chr>, average_speed <dbl>,
## #   max_speed <dbl>, average_cadence <dbl>, average_watts <dbl>,
## #   weighted_average_watts <int>, kilojoules <dbl>, device_watts <lgl>,
## #   has_heartrate <lgl>, average_heartrate <dbl>, max_heartrate <dbl>,
## #   heartrate_opt_out <lgl>, display_hide_heartrate_option <lgl>,
## #   max_watts <int>, elev_high <dbl>, elev_low <dbl>, pr_count <int>,
## #   total_photo_count <int>, has_kudoed <lgl>, workout_type <int>,
## #   average_temp <int>, athlete.id <int>, athlete.resource_state <int>,
## #   map.id <chr>, map.summary_polyline <chr>, map.resource_state <int>

Determine activities, that were already scraped:

existing_activities <- function(board_name) {
  board_register_github(repo = board_name, name = "strava_act")

  df_all_pins_raw <- pin_find(board = "strava_act")

  df_all_pins <- df_all_pins_raw %>%
    distinct(name) %>%
    filter(str_detect(name, "^act")) %>%
    mutate(name = str_remove(name, "^act_")) %>%
    separate(col = name, into = c("id", "athlete.id"))

  board_disconnect("strava_act")

  return(df_all_pins)
}
df_existing_act <- existing_activities(board_name)

Pin all activities to a private github repository:

pin_new_activities <- function(df_act) {
  board_register_github(
    repo = "duju211/strava_act", name = "strava_act", branch = "master")

  pin(df_act, "df_act", board = "strava_act")

  board_disconnect("strava_act")
}
pin_new_activities(df_act)

Measurements

Read the ‘stream’ data from Strava. A ‘stream’ is a nested list (json format) with all available measurements of the corresponding activity.

To get all available variables and turn the result into a data frame, define a helper function. This function takes an id of an activity and an authentication token, which we have created earlier.

read_activity_stream <- function(id, sig) {
  act_url <- httr::parse_url(stringr::str_glue(
    "https://www.strava.com/api/v3/activities/{id}/streams"))
  access_token <- sig$credentials$access_token[[1]]

  r <- httr::modify_url(
    act_url,
    query = list(
      access_token = access_token,
      keys = stringr::str_glue(
        "distance,time,latlng,altitude,velocity_smooth,heartrate,cadence,watts,
        temp,moving,grade_smooth"))) %>%
    httr::GET()

  jsonlite::fromJSON(httr::content(r, as = "text"), flatten = TRUE) %>%
    tibble::as_tibble()
}

Extract the measurements for each new activity and pin the result to the repository:

pin_new_rides <- function(df_act, df_existing_act, my_sig, board_name) {
  df_act_new <- df_act %>%
    anti_join(df_existing_act, by = c("id", "athlete.id"))

  df_meas <- df_act_new %>%
    transmute(
      id, `athlete.id`,
      stream = map(id, ~ read_activity_stream(id = .x, sig = my_sig))) %>%
    tidy_streams() %>%
    unnest(stream) %>%
    select(id, `athlete.id`, where(is_list)) %>%
    unnest(where(purrr::is_list))

  df_meas_nested <- df_meas %>%
    nest(meas = -c(id, `athlete.id`))

  board_register_github(repo = board_name, name = board_name, branch = "master")

  pwalk(
    list(
      m = df_meas_nested$meas, id = df_meas_nested$id,
      a_id = df_meas_nested$`athlete.id`),
    function(m, id, a_id)
      pin(m, str_glue("act_{id}_{a_id}"), board = board_name))

  board_disconnect(board_name)
}

Download all measurement data from the repository and combine in one big data frame:

read_meas_nested <- function(df_act_ids) {
  board_register_github(repo = "duju211/strava_act", branch = "master")

  df_act_meas_nested <- df_act_ids %>%
    anti_join(readd(df_act_meas_nested), by = c("id", "athlete.id")) %>%
    mutate(
      meas = map2(
        id, athlete.id,
        ~ pin_get(str_glue("act_{.x}_{.y}"), board = "github"))) %>%
    bind_rows(readd(df_act_meas_nested))

  board_disconnect("github")

  return(df_act_meas_nested)
}
df_act_meas_nested <- read_meas_nested(df_act_ids)
df_act_meas <- unnest(df_act_meas_nested, meas)
df_act_meas
## # A tibble: 1,788,705 x 12
##    id    watts moving velocity_smooth grade_smooth cadence distance altitude
##    <chr> <int> <lgl>            <dbl>        <dbl>   <int>    <dbl>    <dbl>
##  1 4593~    37 FALSE              0           -3.5      21      2.9      1.8
##  2 4593~    37 TRUE               1.5         -2.4      26      4.4      1.6
##  3 4593~    79 TRUE               1.7         -1.7      31      6.3      1.6
##  4 4593~    79 TRUE               1.9          0        39      8.6      1.6
##  5 4593~    98 TRUE               2.1          0        46     11.3      1.6
##  6 4593~   146 TRUE               2.3          0        53     14.4      1.6
##  7 4593~   146 TRUE               2.8          0        60     18.2      1.6
##  8 4593~   146 TRUE               3.2         -1.1      60     22.5      1.6
##  9 4593~   147 TRUE               3.7         -1.1      60     27.1      1.6
## 10 4593~   144 TRUE               4.1         -1        61     31.8      1.4
## # ... with 1,788,695 more rows, and 4 more variables: heartrate <int>,
## #   time <int>, lat <dbl>, lng <dbl>

Visualisation

Visualize the final data. Every facet is a activity and the color represents the type.

vis_act_meas <- function(df_act, df_act_meas) {
  df_vis <- df_act_meas %>%
    left_join(
      select(df_act, id, `athlete.id`, type), by = c("id", "athlete.id"))

  df_vis %>%
    ggplot(aes(x = lng, y = lat, color = type)) +
    geom_path() +
    facet_wrap(~ id, scales = "free") +
    theme(
      axis.line = element_blank(),
      axis.text.x = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks = element_blank(),
      axis.title.x = element_blank(),
      axis.title.y = element_blank(),
      legend.position = "bottom",
      panel.background = element_blank(),
      panel.border = element_blank(),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      plot.background = element_blank(),
      strip.text = element_blank()) +
    labs(color = "Type of Activity")
}
vis_act_meas(df_act, df_act_meas)