Vive le Tour

Recently I came across an excellent talk about the gganimate package. In order to try it out, I’ve decided to collect and analyze some data from one of my favorite sporting events: Le Tour de France.

In this post, I will describe how to get the data using the rvest package. After collecting the data, I will describe how to create animations using the gganimate package.

At first, load all important libraries. The two most important ones for this post are the rvest and the gganimate package. The first is for scraping data from the internet and the second is for generating animations.

Data Scraping

Helper function to read the results of one single stage.

This function reads the table for a specific stage number from the above mentioned URL. It uses functions from the rvest package and mainly the ideas from this vignette.

It reads the table and returns everything in one tidy data frame.

read_result <- function(base_url) {
  result_html <- read_html(base_url)
  riders <- html_nodes(result_html, "tbody tr")
  
  tibble(
    rider_name = str_trim(html_text(html_nodes(riders, ".runner a"))),
    rank = as.integer(str_trim(html_text(html_nodes(
      riders, ".is-alignCenter:nth-child(1)")))),
    team = str_trim(html_text(html_nodes(riders, ".js-extend-table , .team"))),
    time = html_text(html_nodes(riders, ".is-alignCenter:nth-child(5)")),
    bonification = html_text(html_nodes(
      riders, ".is-alignCenter:nth-child(7)")),
    bon_seconds = as.integer(str_extract(bonification, "\\d+")),
    penalty = html_text(html_nodes(riders, ".is-alignCenter:nth-child(8)")),
    pen_seconds = as.numeric(seconds(ms(str_remove(penalty, "P :")))),
    stage_position = seq_along(riders)) %>%
    replace_na(list(pen_seconds = 0, bon_seconds = 0)) %>%
    mutate(
      time_seconds = as.numeric(seconds(hms(time))) - bon_seconds + pen_seconds,
      time_loss = abs(min(time_seconds) - time_seconds)) %>%
    select(
      rider_name, rank, team, time_seconds, bon_seconds, pen_seconds, stage_position,
      time_loss)
}

Calculate key performance indicators per rider. For example the total time of each rider after every stage and also the lag time to the leader of the overall race.

calc_kpi_per_rider <- function(df_results) {
  df_results_pro <- df_results %>% 
    nest(results = -rider_name) %>% 
    mutate(
      nr_results = map_int(results, nrow), 
      dnf_tour = nr_results != max(nr_results), 
      total_bon_seconds = map(
        results, ~ cumsum(.x$bon_seconds)), 
      total_time_seconds = map(
        results, ~ cumsum(.x$time_seconds) - cumsum(.x$bon_seconds) 
        + cumsum(.x$pen_seconds)), 
      total_pen_seconds = map(
        results, ~ cumsum(.x$pen_seconds))) %>% 
    unnest(
      cols = c(
        results, total_bon_seconds, total_time_seconds, total_pen_seconds))
  
  df_results_pro <- df_results_pro %>% 
    arrange(total_time_seconds) %>% 
    nest(data = -stage_nr) %>% 
    mutate(
      total_position = map(data, ~ 1:nrow(.x)), 
      leader_time_seconds = map_dbl(data, ~ min(.x$total_time_seconds))) %>% 
    unnest(cols = data) %>% 
    mutate(time_to_leader = abs(leader_time_seconds - total_time_seconds))
}
df_results_pro %>% 
  filter(stage_nr == max(stage_nr)) %>% 
  arrange(total_time_seconds) %>% 
  mutate(
    total_hours = seconds_to_period(total_time_seconds)) %>% 
  select(rider_name, contains("total"), leader_time_seconds)
## # A tibble: 155 x 7
##    rider_name total_bon_secon~ total_time_seco~ total_pen_secon~ total_position
##    <chr>                 <dbl>            <dbl>            <dbl> <list>        
##  1 EGAN BERN~                0           293716                0 <int [155]>   
##  2 GERAINT T~                0           293721                0 <int [155]>   
##  3 STEVEN KR~                4           293737                0 <int [155]>   
##  4 EMANUEL B~                0           293766                0 <int [155]>   
##  5 JULIAN AL~               30           293776                0 <int [155]>   
##  6 NAIRO QUI~               18           293855                0 <int [155]>   
##  7 MIKEL LAN~                8           293905                0 <int [155]>   
##  8 ALEJANDRO~                6           293965                0 <int [155]>   
##  9 RIGOBERTO~                0           293965                0 <int [155]>   
## 10 WARREN BA~                0           294122                0 <int [155]>   
## # ... with 145 more rows, and 2 more variables: total_hours <Period>,
## #   leader_time_seconds <dbl>

Visualisation

Now that the data is available in a tidy data frame, make a visualisation. It would be interesting to see the development of the overall time gap to the race leader.

First make a static plot of all the stage results and the relation of the time loss per stage and the overall time to leader.

Make a subplot for every team, but display all stages in one subplot. Later we can visualise the development by using the gganimate package.

At the moment the plot is a static ggplot.

vis_stage_results_static <- function(df_results_pro) {
  df_results_pro %>% 
    ggplot(aes(x = time_loss, y = time_to_leader, color = team)) +
    geom_point() +
    facet_wrap(~ team) +
    theme(
      axis.text.x = element_text(angle = 90), text = element_text(size = 9)) +
    scale_color_discrete(guide=FALSE) +
    theme(legend.position = "bottom")
}

Adding the animations is fairly simple. Because we already have a ggplot we only have to add a few lines (see comment), to add the animation. You can see some other examples here.

vis_stage_results_dynamic <- function(df_results_pro) {
  df_results_pro %>% 
    mutate(
      time_loss = time_loss + 1, time_to_leader = time_to_leader + 1,
      team = str_wrap(team, width = 10)) %>% 
    ggplot(aes(x = time_loss, y = time_to_leader, color = team)) +
    geom_point() +
    facet_wrap(~ team) +
    theme(
      axis.text.x = element_text(angle = 90), text = element_text(size = 9)) +
    scale_color_discrete(guide=FALSE) +
    labs(
      title = 'Stage Number: {frame_time}',
      x = 'Time Loss To Stage Winner [s]', y = 'Total Time To Leader [s]') +
    # Code for animation
    transition_time(stage_nr) +
    ease_aes('linear')
}

It is striking, that from stage number 10 there are big time jumps. From this stage on, the tour started to get hilly. You can also easily identify teams that concentrate on flat stages. The whole squad starts to move to the upper right corner in these cases.

Also you can easily spot so called ‘transfer stages’. These are flat stages in between hilly stages and you can see that there is almost no additional time loss for the whole peloton.