Strava Data

I am a vivid runner and cyclist. Since a few years, I’m recording almost all my activities with some kind of GPS device.

I record my runs with a Garmin device and my bike rides with a Wahoo device. Both accounts get synchronized with my Strava account. I figured that it would be nice to directly access my data from my Strava account.

In the following text, I will describe the progress to get the data into R.

Data

Load necessary libraries and user defined functions:

library(tidyverse)
library(lubridate)
library(patchwork)
library(jsonlite)
library(getPass)
library(httpuv)
library(drake)
library(here)
library(pins)
library(httr)
theme_set(theme_light())

source(here::here("content", "post", "2019-09-20-scrape-strava_functions.R"))
rerun <- FALSE

Create an OAuth app Strava app:

define_strava_app <- function() {
  oauth_app(
    appname = "r_api",
    key = Sys.getenv("STRAVA_KEY"),
    secret = Sys.getenv("STRAVA_SECRET"))
}

You can find your STRAVA_KEY and STRAVA_SECRET variables under the strava api settings (marked yellow):

Define an endpoint:

define_strava_endpoint <- function() {
  oauth_endpoint(
    request = NULL,
    authorize = "https://www.strava.com/oauth/authorize",
    access = "https://www.strava.com/oauth/token")
}

The final authentication step. Before the user can execute the following steps, he has to authenticate this “application” in the web browser.

define_strava_sig <- function(endpoint, app) {
  oauth2.0_token(
    endpoint, app, 
    scope = "activity:read_all,activity:read,profile:read_all",  
    type = NULL, use_oob = FALSE, as_header = FALSE,
    use_basic_auth = FALSE, cache = FALSE)
}

Now we can load all available activities. The total number of activities is unknown. To get all activities, use a while-loop. Break the execution of the loop, if there are no more activities to read.

read_all_activities <- function(sig) {
  activities_url <- parse_url(
    "https://www.strava.com/api/v3/athlete/activities")
  
  act_vec <- vector(mode = "list")
  df_act <- tibble(init = "init")
  i <- 1L
  
  while (nrow(df_act) != 0) {
    r <- activities_url %>% 
      modify_url(
        query = list(
          access_token = sig$credentials$access_token[[1]], 
          page = i)) %>% 
      GET()
    
    df_act <- content(r, as = "text") %>% 
      fromJSON(flatten = TRUE) %>% 
      as_tibble()
    if (nrow(df_act) != 0)
      act_vec[[i]] <- df_act
    i <- i + 1L
  }
  
  df_activities <- act_vec %>% 
    bind_rows() %>% 
    mutate(start_date = ymd_hms(start_date))
}

Read the ‘stream’ data from Strava. A ‘stream’ is a nested list (json format) with all available information of the corresponding activity.

To get all available variables and turn the result into a data frame, define a helper function. This function takes an id of an activity and an authentication token, which we have created earlier.

read_activity_stream <- function(id, sig) {
  act_url <- parse_url(str_glue(
    "https://www.strava.com/api/v3/activities/{id}/streams"))
  access_token <- sig$credentials$access_token[[1]]

  r <- modify_url(
    act_url,
    query = list(
      access_token = access_token,
      keys = str_glue(
        "distance,time,latlng,altitude,velocity_smooth,heartrate,cadence,watts,
        temp,moving,grade_smooth"))) %>%
    GET()

  fromJSON(content(r, as = "text"), flatten = TRUE) %>%
    as_tibble()
}

For every activity, extract the corresponding content from the stream.

The activities aren’t in a tidy format yet. Turn the latlng column into two separate columns, so that the following unnesting step runs without problems.

tidy_streams <- function(df_activities_streams_raw) {
  df_activities_streams <- df_activities_streams_raw %>%
    mutate(
      stream = map(
        stream, ~ pivot_wider(.x, names_from = type, values_from = data)),
      contains_latlng = map_lgl(
        stream, ~ any(str_detect(colnames(.x), "latlng"))),
      stream = map_if(
        .x = stream, .p = contains_latlng,
        .f = ~ select(mutate(
          .x, lat = map(latlng, ~ .x[, 1]), lng = map(latlng, ~ .x[, 2])),
          -latlng),
        .else = ~ .x))
}

Unnest the data. Each observation is one row. This is ideal for later (ggplot) visualisations:

extract_meas <- function(df_activities_streams) {
  df_meas <- df_activities_streams %>%
    select(id, stream, start_date, type) %>%
    unnest(stream) %>%
    mutate(time = map2(start_date, time, ~ .x + dseconds(.y))) %>%
    select(-c(series_type, original_size, resolution, start_date)) %>%
    unnest(cols = -c(id))
}

Look at the first few rows, to get an impression on the final form of the data:

id altitude velocity_smooth cadence grade_smooth heartrate distance moving time lat lng watts type
3606261427 719.8 6.6 NA -2.8 NA 37421.4 TRUE 2020-06-13 08:45:59 48.21704 9.028574 NA Ride
3606261427 720.0 6.3 NA -2.8 NA 37414.1 TRUE 2020-06-13 08:45:58 48.21711 9.028566 NA Ride
3606261427 720.2 6.1 NA -2.9 NA 37406.9 TRUE 2020-06-13 08:45:57 48.21717 9.028557 NA Ride
3606261427 720.4 6.1 NA -3.1 NA 37400.3 TRUE 2020-06-13 08:45:56 48.21723 9.028540 NA Ride
3606261427 720.6 6.2 NA -4.1 NA 37394.2 TRUE 2020-06-13 08:45:55 48.21728 9.028529 NA Ride
3606261427 720.8 6.4 NA -4.2 NA 37388.4 TRUE 2020-06-13 08:45:54 48.21733 9.028552 NA Ride

Execution Plan

All the preceding steps can be put together in one big execution plan. There is a special package (drake) to do this. Define the individual steps and the corresponding dependencies:

strava_plan <- drake_plan(
  # strava_global_vars
  special_rides_ids = c(
    2519983912, 2525074116, 2525072832, 2525681356, 2547075904),
  # define strava app
  my_app = define_strava_app(),
  # strava_endpoint 
  my_endpoint = define_strava_endpoint(),
  # direct_user_to_authentication_page
  sig = target(
    define_strava_sig(my_endpoint, my_app),
    trigger = trigger(condition = TRUE)),
  df_meas_old = target(
    pin_get("strava_meas", board = "github"),
    trigger = trigger(condition = TRUE)),
  df_activities = target(
    read_all_activities(sig), trigger = trigger(condition = TRUE)),
  df_activities_new = filter(df_activities, !(id %in% df_meas_old$id)),
  df_activities_new_stream_raw = 
    mutate(
      df_activities_new, 
      stream = map(id, ~ read_activity_stream(.x, sig))),
  df_activities_new_stream = tidy_streams(df_activities_new_stream_raw),
  df_meas_new_raw = extract_meas(df_activities_new_stream),
  df_meas_new = filter_home_zone(df_meas_new_raw),
  df_meas = arrange(bind_rows(df_meas_old, df_meas_new), desc(time)),
  df_meas_public = filter(df_meas, id %in% special_rides_ids),
  gg_activities = plot_activities(df_meas),
  gg_activities_sample = plot_activities(
    df_meas, pos_legend = "none"),
  out = ggsave(
    filename = here::here(
      "static", "post",
      "simple_example_vis_plot-1.png"), 
    plot = gg_activities_sample))

By putting the individual steps into a drake plan, the costs of changing individual steps decrease a lot. If I would for example change the axis text of the final plot, the make function of the drake package would recognize this. Only the plot would be updated and the results of the preceding steps would be left unchanged.

Visualisation

Close this post with a visualisation of all the scraped activities. The type of the activity is indicated by the colour of the path.

plot_activities <- function(
    df_meas, number_obs = NULL, pos_legend = "bottom") {
  theme_set(theme_light())
  if (is.null(number_obs)) {
    include_meas_id <- unique(df_meas$id)
  } else {
    include_meas_id <- sample(unique(df_meas$id), size = number_obs)
  }
  
  df_meas %>%
    filter(id %in% include_meas_id) %>% 
    ggplot(aes(x = lng, y = lat, color = type)) +
    geom_path() +
    facet_wrap(~ id, scales = "free") +
    theme(
      axis.line = element_blank(),
      axis.text.x = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks = element_blank(),
      axis.title.x = element_blank(),
      axis.title.y = element_blank(),
      legend.position = pos_legend,
      panel.background = element_blank(),
      panel.border = element_blank(),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      plot.background = element_blank(),
      strip.text = element_blank()) +
    labs(color = "Type of Activity")
}