What makes me tick? Strava Heartrate Analysis

In this post, I want to explore the heart rate data of my Strava Activities.

Heart rate can be considered as a measurement of the intensity of an physical activity. So in this post, I want to find out, what are the driving forces behind the intensity level of my activities. For future activities I would like to be able to determine, if the activity was less or more exhausting than comparable activities.

In this post, the following libraries are used:

library(tidymodels)
library(conflicted)
library(tidyverse)
library(lubridate)
library(janitor)
library(drake)
library(pins)
library(vip)
library(fs)

conflict_prefer("filter", "dplyr")
conflict_prefer("workflow", "workflows")
conflict_prefer("expand", "tidyr")

Data

All my Strava activities are stored in a private Github repository. How this can be done, is described in this post.

Activities

Loadd all activities from the repository. Filter for those with heart rate data.

act_raw <- function() {
  board_register_github(repo = "duju211/strava_act")

  df_act <- pin_get("df_act", board = "github")

  board_disconnect("github")

  df_act %>%
    filter(has_heartrate) %>%
    clean_names()
}
df_act_raw <- act_raw()

The raw data consists of 227 activities. The aim of this post is to predict the average_heartrate with the available information.

## # A tibble: 227 x 60
##    resource_state name  distance moving_time elapsed_time total_elevation~ type 
##             <int> <chr>    <dbl>       <int>        <int>            <dbl> <chr>
##  1              2 "Act~   46553.        6279         6522             439  Virt~
##  2              2 "202~    5745         2215         2218             117  Run  
##  3              2 "Mem~   10482.        3987         4163             167. Run  
##  4              2 "Act~   19607.        2940         3179             336  Virt~
##  5              2 "Act~   88553.       10738        11308             514  Virt~
##  6              2 "Act~   15621.        2514         2514             288  Virt~
##  7              2 "Alm~    7092.        2678         2741             136. Run  
##  8              2 "Loc~   73058.       12603        13272            1204  Ride 
##  9              2 "Bla~   50241.        7478         7733             739  Ride 
## 10              2 "Hos~   30792.        4990         5052             602  Ride 
## # ... with 217 more rows, and 53 more variables: id <chr>, external_id <chr>,
## #   upload_id <chr>, start_date <dttm>, start_date_local <chr>, timezone <chr>,
## #   utc_offset <dbl>, start_latlng <list>, end_latlng <list>,
## #   location_city <lgl>, location_state <lgl>, location_country <chr>,
## #   start_latitude <dbl>, start_longitude <dbl>, achievement_count <int>,
## #   kudos_count <int>, comment_count <int>, athlete_count <int>,
## #   photo_count <int>, trainer <lgl>, commute <lgl>, manual <lgl>,
## #   private <lgl>, visibility <chr>, flagged <lgl>, gear_id <chr>,
## #   from_accepted_tag <lgl>, upload_id_str <chr>, average_speed <dbl>,
## #   max_speed <dbl>, average_cadence <dbl>, average_watts <dbl>,
## #   weighted_average_watts <int>, kilojoules <dbl>, device_watts <lgl>,
## #   has_heartrate <lgl>, average_heartrate <dbl>, max_heartrate <dbl>,
## #   heartrate_opt_out <lgl>, display_hide_heartrate_option <chr>,
## #   max_watts <int>, elev_high <dbl>, elev_low <dbl>, pr_count <int>,
## #   total_photo_count <int>, has_kudoed <lgl>, workout_type <int>,
## #   average_temp <int>, athlete_id <chr>, athlete_resource_state <int>,
## #   map_id <chr>, map_summary_polyline <chr>, map_resource_state <int>

Preprocess data. Only select relevant columns and convert the start_date from a timestamp to a date column.

pre_process_act <- function(df_act_raw) {
  df_act_raw %>%
    select(
      id, distance, moving_time, elapsed_time, total_elevation_gain, type,
      start_date, average_heartrate, elev_high, athlete_id,
      elev_low, max_speed, average_speed,
    ) %>%
    mutate(start_date = as_date(start_date)) %>%
    relocate(average_heartrate, id)
}
df_act <- pre_process_act(df_act_raw)

EDA Visualisation

Make first exploratory visualisations of the numeric features of the data. Distinguish between runs and rides.

Runs

vis_eda_act_run <- function(df_act) {
  df_act %>%
    select(id, type, where(is.numeric)) %>%
    filter(type == "Run") %>%
    pivot_longer(
      cols = distance:last_col(), values_to = "value", names_to = "kpi"
    ) %>%
    ggplot(aes(x = value, y = average_heartrate)) +
    geom_point() +
    facet_wrap(~kpi, scales = "free")
}

Rides

vis_eda_act_ride <- function(df_act) {
  df_act %>%
    select(id, type, where(is.numeric)) %>%
    filter(type == "Ride") %>%
    pivot_longer(
      cols = distance:last_col(), values_to = "value", names_to = "kpi"
    ) %>%
    ggplot(aes(x = value, y = average_heartrate)) +
    geom_point() +
    facet_wrap(~kpi, scales = "free")
}
## Warning: Removed 6 rows containing missing values (geom_point).

Modeling

Data Splitting

Split the data in training and test data:

set.seed(123)
init_split <- initial_split(df_act, strata = average_heartrate)
bpm_train <- training(init_split)
bpm_test <- testing(init_split)

Split the training data further in multiple folds:

set.seed(234)
bpm_folds <- bootstraps(bpm_tain, strava = average_heartrate)

Model Workflow

Define a preprocessing recipe for a random forest model:

rf_rec <- function(df_act_run) {
  ranger_recipe <-
    recipe(formula = average_heartrate ~ ., data = select(df_act_run, -id)) %>%
    #add_role(id, new_role = "ID") %>%
    step_string2factor(type, athlete_id) %>%
    step_knnimpute(elev_high, elev_low)
}
rf_recipe <- rf_rec(df_act)

Further specify the random forest model. Define the parameters mtry and min_n as tuneable hyperparameters.

rf_spec <- function() {
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
    set_mode("regression") %>%
    set_engine("ranger")
}
rf_specification <- rf_spec()

Put the recipe and the model specification into one workflow:

rf_wf <- function(rf_recipe, rf_specification) {
  workflow() %>%
    add_recipe(rf_recipe) %>%
    add_model(rf_specification)
}
rf_workflow <- rf_wf(rf_recipe, rf_specification)

Tune the workflow. Use the bootstrap folds from the data splitting section.

rf_tune_wf <- function(ranger_workflow, bpm_folds) {
  set.seed(41210)
  tune_grid(ranger_workflow, resamples = bpm_folds, grid = 11)
}
rf_tune <- rf_tune_wf(rf_workflow, bpm_folds)

Determine the best parameter set:

show_best(rf_tune, metric = "rmse")
## # A tibble: 5 x 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     2     4 rmse    standard    7.12    24   0.252 Preprocessor1_Model10
## 2     4    10 rmse    standard    7.13    24   0.263 Preprocessor1_Model03
## 3     5    17 rmse    standard    7.16    24   0.267 Preprocessor1_Model11
## 4     2    22 rmse    standard    7.17    24   0.247 Preprocessor1_Model02
## 5     5    26 rmse    standard    7.18    24   0.270 Preprocessor1_Model07

Fit the model one last time with the determined parameters:

rf_last_fit <- function(rf_workflow, rf_tune, init_split) {
  rf_final_wf <- finalize_workflow(
    rf_workflow, select_best(rf_tune, metric = "rmse"))

  rf_final_fit <- last_fit(rf_final_wf, init_split)
}
rf_fit <- rf_last_fit(rf_workflow, rf_tune, init_split)

Evaluate performance to see, if the model is overfitting:

collect_metrics(rf_fit)
## # A tibble: 2 x 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       5.16  Preprocessor1_Model1
## 2 rsq     standard       0.432 Preprocessor1_Model1

Variable Importance

Fit the model one more time (with importance set to ‘permutation’) to get the variable importance scores:

vis_vip <- function(mod_spec, mod_tune, mod_rec, mod_method, bpm_train) {
  imp_spec <- mod_spec %>%
    finalize_model(select_best(mod_tune, metric = "rmse")) %>%
    set_engine(mod_method, importance = "permutation")

  workflow() %>%
    add_recipe(mod_rec) %>%
    add_model(imp_spec) %>%
    fit(bpm_train) %>%
    pull_workflow_fit() %>%
    vip(aesthetics = list(alpha = 0.8, fill = "midnightblue"))
}