Strava Data

Article on how to effectively scrape and store Strava data

Julian During
2021-06-09

I am a vivid runner and cyclist. Since a few years, I’m recording almost all my activities with some kind of GPS device.

I record my runs with a Garmin device and my bike rides with a Wahoo device. Both accounts get synchronized with my Strava account. I figured that it would be nice to directly access my data from my Strava account.

In the following text, I will describe the progress to get the data into R. Once available in a nice format in R, the data is stored as a pin in a private github repository. By doing so, the data is easily accessible in other analysis or shiny apps.

In this analysis, the following packages are used:

The rava package is a package with helper functions regarding Strava data. It can be installed via the following command:

remotes::install_github("duju211/rava")

Data

To get access to your Strava data from R, you have to create a Strava api. How to do this is documented here.

OAuth Dance from R

The Strava api requires a so called OAuth dance. How this can be done from within R is described in the following section.

Create an OAuth Strava app:

define_strava_app <- function() {
  httr::oauth_app(
    appname = "r_api",
    key = Sys.getenv("STRAVA_KEY"),
    secret = Sys.getenv("STRAVA_SECRET"))
}

You can find your STRAVA_KEY and STRAVA_SECRET variables under the Strava api settings after you have created your own personal api. The name of api is determined during creation. In my case I named it r_api.

my_app <- define_strava_app()

Define an endpoint:

define_strava_endpoint <- function() {
  httr::oauth_endpoint(
    request = NULL,
    authorize = "https://www.strava.com/oauth/authorize",
    access = "https://www.strava.com/oauth/token")
}

The authorize parameter describes the url to send client to for authorization. And the access argument is used to exchange the authenticated token.

my_endpoint <- define_strava_endpoint()

The final authentication step. Before the user can execute the following steps, he has to authenticate the api in the web browser.

define_strava_sig <- function(endpoint, app) {
  httr::oauth2.0_token(
    endpoint, app,
    scope = "activity:read_all,activity:read,profile:read_all",
    type = NULL, use_oob = FALSE, as_header = FALSE,
    use_basic_auth = FALSE, cache = FALSE)
}
my_sig <- define_strava_sig(my_app, my_endpoint)

The information in my_sig can now be used to access Strava data.

Activities

We are now authenticated and can directly access Strava data. At first load an overview table of all available activities (one activity per row). Because the total number of activities is unknown, use a while loop. Break the execution of the loop, if there are no more activities to read.

read_all_activities <- function(sig) {
  activities_url <- httr::parse_url(
    "https://www.strava.com/api/v3/athlete/activities")

  act_vec <- vector(mode = "list")
  df_act <- tibble::tibble(init = "init")
  i <- 1L

  while (nrow(df_act) != 0) {
    r <- activities_url %>%
      httr::modify_url(
        query = list(
          access_token = sig$credentials$access_token[[1]],
          page = i)) %>%
      httr::GET()

    df_act <- httr::content(r, as = "text") %>%
      jsonlite::fromJSON(flatten = TRUE) %>%
      tibble::as_tibble()
    if (nrow(df_act) != 0)
      act_vec[[i]] <- df_act
    i <- i + 1L
  }

  df_activities <- act_vec %>%
    dplyr::bind_rows() %>%
    dplyr::mutate(start_date = lubridate::ymd_hms(start_date))
}
df_act_raw <- read_all_activities(my_sig)
df_act_raw
# A tibble: 555 x 60
   resource_state name               distance moving_time elapsed_time
            <int> <chr>                 <dbl>       <int>        <int>
 1              2 "BBQ \U0001f357 R~   17477.        3028         3028
 2              2 "Rain Break Run \~    8709.        3716         3969
 3              2 "Run-Before-The-E~    8446.        3765         3936
 4              2 "Saturday Breakfa~    9145.        3591         3801
 5              2 "Pidcock Trail "      8780.        3603         3667
 6              2 "Lauf am Nachmitt~    7194.        3022         3189
 7              2 "Sunny Forest Run"   11317         4559         4826
 8              2 "Bear Valley Clim~   61372.        9817        11848
 9              2 "Fahrt am Nachmit~   42834.        7072         7812
10              2 "Coffee Run"          3846.        1549         1549
# ... with 545 more rows, and 55 more variables:
#   total_elevation_gain <dbl>, type <chr>, workout_type <int>,
#   id <dbl>, external_id <chr>, upload_id <dbl>, start_date <dttm>,
#   start_date_local <chr>, timezone <chr>, utc_offset <dbl>,
#   start_latlng <list>, end_latlng <list>, location_city <lgl>,
#   location_state <lgl>, location_country <chr>,
#   start_latitude <dbl>, start_longitude <dbl>,
#   achievement_count <int>, kudos_count <int>, comment_count <int>,
#   athlete_count <int>, photo_count <int>, trainer <lgl>,
#   commute <lgl>, manual <lgl>, private <lgl>, visibility <chr>,
#   flagged <lgl>, gear_id <chr>, from_accepted_tag <lgl>,
#   upload_id_str <chr>, average_speed <dbl>, max_speed <dbl>,
#   device_watts <lgl>, has_heartrate <lgl>, heartrate_opt_out <lgl>,
#   display_hide_heartrate_option <lgl>, elev_high <dbl>,
#   elev_low <dbl>, pr_count <int>, total_photo_count <int>,
#   has_kudoed <lgl>, average_cadence <dbl>, average_heartrate <dbl>,
#   max_heartrate <dbl>, average_temp <int>, athlete.id <int>,
#   athlete.resource_state <int>, map.id <chr>,
#   map.summary_polyline <chr>, map.resource_state <int>,
#   average_watts <dbl>, weighted_average_watts <int>,
#   kilojoules <dbl>, max_watts <int>

Determine activities, that were already scraped:

existing_activities <- function(board_name) {
  board_register_github(repo = board_name, name = "strava_act")

  df_all_pins_raw <- pin_find(board = "strava_act")

  df_all_pins <- df_all_pins_raw %>%
    distinct(name) %>%
    filter(str_detect(name, "^act")) %>%
    mutate(name = str_remove(name, "^act_")) %>%
    separate(col = name, into = c("id", "athlete.id"))

  board_disconnect("strava_act")

  return(df_all_pins)
}
df_existing_act <- existing_activities(board_name)

Pin all activities to a private github repository:

pin_new_activities <- function(df_act) {
  board_register_github(
    repo = "duju211/strava_act", name = "strava_act", branch = "master")

  pin(df_act, "df_act", board = "strava_act")

  board_disconnect("strava_act")
}
pin_new_activities(df_act)

Measurements

Read the ‘stream’ data from Strava. A ‘stream’ is a nested list (json format) with all available measurements of the corresponding activity.

To get all available variables and turn the result into a data frame, define a helper function. This function takes an id of an activity and an authentication token, which we have created earlier.

read_activity_stream <- function(id, sig) {
  act_url <- httr::parse_url(stringr::str_glue(
    "https://www.strava.com/api/v3/activities/{id}/streams"))
  access_token <- sig$credentials$access_token[[1]]

  r <- httr::modify_url(
    act_url,
    query = list(
      access_token = access_token,
      keys = stringr::str_glue(
        "distance,time,latlng,altitude,velocity_smooth,heartrate,cadence,watts,
        temp,moving,grade_smooth"))) %>%
    httr::GET()

  jsonlite::fromJSON(httr::content(r, as = "text"), flatten = TRUE) %>%
    tibble::as_tibble()
}

Extract the measurements for each new activity and pin the result to the repository:

pin_new_rides <- function(df_act, df_existing_act, my_sig, board_name) {
  df_act_new <- df_act %>%
    anti_join(df_existing_act, by = c("id", "athlete.id"))

  df_meas <- df_act_new %>%
    transmute(
      id, `athlete.id`,
      stream = map(id, ~ read_activity_stream(id = .x, sig = my_sig))) %>%
    tidy_streams() %>%
    unnest(stream) %>%
    select(id, `athlete.id`, where(is_list)) %>%
    unnest(where(purrr::is_list))

  df_meas_nested <- df_meas %>%
    nest(meas = -c(id, `athlete.id`))

  board_register_github(repo = board_name, name = board_name, branch = "master")

  pwalk(
    list(
      m = df_meas_nested$meas, id = df_meas_nested$id,
      a_id = df_meas_nested$`athlete.id`),
    function(m, id, a_id)
      pin(m, str_glue("act_{id}_{a_id}"), board = board_name))

  board_disconnect(board_name)
}

Download all measurement data from the repository and combine in one big data frame:

act_meas <- function(board_name, pin_result) {
  board_register_github(repo = board_name, name = board_name, branch = "master")

  df_act_meas_nested <- pin_find(board = board_name) %>%
    filter(str_detect(name, "^act")) %>%
    mutate(meas = map(name, ~ pin_get(name = .x, board = board_name)))

  df_act_meas <- df_act_meas_nested %>%
    select(name, meas) %>%
    filter(map_lgl(meas, ~ is_tibble(.x))) %>%
    unnest(meas)

  board_disconnect(board_name)

  df_act_meas
}
df_act_meas = act_meas(board_name, pin_result)
df_act_meas %>%
  select(-name)
# A tibble: 1,960,412 x 11
   moving velocity_smooth grade_smooth distance altitude  time   lat
   <lgl>            <dbl>        <dbl>    <dbl>    <dbl> <int> <dbl>
 1 FALSE              0            0        0       746      0  48.2
 2 TRUE               1.8         -0.3     40.7     746     23  48.2
 3 TRUE               2.1         -3.2     54.7     746     26  48.2
 4 TRUE               7.2         -5.4    120.      746.    34  48.2
 5 TRUE               8.9         -6      188.      740.    41  48.2
 6 TRUE               5           -7.6    249.      735.    60  48.2
 7 TRUE               2.8         -7.1    262.      734.    68  48.2
 8 TRUE               4.5         -6.3    329.      730.    78  48.2
 9 TRUE               5.7         -5.2    359.      728.    85  48.2
10 TRUE               4.1         -1      366.      727.    87  48.2
# ... with 1,960,402 more rows, and 4 more variables: lng <dbl>,
#   heartrate <int>, cadence <int>, watts <int>

Visualisation

Visualize the final data. Every facet is a activity and the color represents the type.

df_act_meas %>%
  filter(!is.na(lat)) %>%
  ggplot(aes(x = lng, y = lat)) +
    geom_path() +
    facet_wrap(~ name, scales = "free") +
    theme(
      axis.line = element_blank(),
      axis.text.x = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks = element_blank(),
      axis.title.x = element_blank(),
      axis.title.y = element_blank(),
      legend.position = "bottom",
      panel.background = element_blank(),
      panel.border = element_blank(),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      plot.background = element_blank(),
      strip.text = element_blank())

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/duju211/pin_strava, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".