TidyTuesday Tour de France

The TidyTuesday project is a weekly social data project in R, where participants get the chance to analyse a new dataset every week. The task is to wrangle and explore the data with the tools that R provides.

This weeks TidyTuesday dataset is about Tour de France results. As a big Fan of the Tour, this is a good opportunity to create some interesting analysis.

Data

First load the needed libraries:

library(tidytuesdayR)
library(ggbeeswarm)
library(tidyverse)
library(lubridate)
library(janitor)
library(glue)

theme_set(theme_light())

These are ‘standard’ R libraries for wrangling and exploring data. Some not so common packages are: The tidytuesdayR package to download the data, ggbeeswarm package for later visualisation and the janitor package to clean the variable names of the dataset.

Download the raw data with the help of the tidytuesdayR package. Assign every table to a separate data frame:

tuesdata <- tidytuesdayR::tt_load(2020, week = 15)

df_stages_raw <- tuesdata$tdf_stages
df_stage_data_raw <- tuesdata$stage_data
df_winners_raw <- tuesdata$tdf_winners

Preprocessing Steps

Stages Data

Do some general preprocessing to the stage data. The data frame contains the results of the individual stages (up to 2017):

df_stage_data <- df_stage_data_raw %>% 
  mutate(
    rank = as.integer(rank),
    stage = str_remove(stage_results_id, "^stage-"),
    team = str_to_upper(team),
    team = case_when(
      team == "ASTANA PRO TEAM" ~ "ASTANA",
      team == "SKY PROCYCLING" ~ "TEAM SKY",
      team == "SAXO BANK" ~ "TEAM SAXO BANK",
      team == "CSC PROTEAM" ~ "TEAM CSC",
      team == "CAISSE D'EPARGNE - ILLES BALEARS" ~
        "CAISSE D'EPARGNE–ILLES BALEARS",
      TRUE ~ team),
    team = str_trim(team)) %>%
  mutate_at(vars(time, elapsed), parse_number) %>% 
  select(-c(stage_results_id, bib_number, points, elapsed)) %>% 
  clean_names()

df_stage_data
## # A tibble: 255,752 x 8
##    edition  year  rank  time rider                 age team  stage
##      <dbl> <dbl> <int> <dbl> <chr>               <dbl> <chr> <chr>
##  1       1  1903     1    13 Garin Maurice          32 <NA>  1    
##  2       1  1903     2    55 Pagie Émile            32 <NA>  1    
##  3       1  1903     3    59 Georget Léon           23 <NA>  1    
##  4       1  1903     4    48 Augereau Fernand       20 <NA>  1    
##  5       1  1903     5    53 Fischer Jean           36 <NA>  1    
##  6       1  1903     6    53 Kerff Marcel           37 <NA>  1    
##  7       1  1903     7    55 Catteau Aloïs          25 <NA>  1    
##  8       1  1903     8    47 Pivin Ernest           33 <NA>  1    
##  9       1  1903     9    15 Habets Léon            NA <NA>  1    
## 10       1  1903    10    26 Beaugendre François    22 <NA>  1    
## # ... with 255,742 more rows

Stages

First take a look at all the different ‘types’ of stages that are present in the data:

df_stages_raw %>% 
  distinct(Type)
## # A tibble: 18 x 1
##    Type                         
##    <chr>                        
##  1 Individual time trial        
##  2 Flat stage                   
##  3 Medium mountain stage        
##  4 High mountain stage          
##  5 Mountain time trial          
##  6 Team time trial              
##  7 Flat cobblestone stage       
##  8 Mountain stage               
##  9 Mountain Stage               
## 10 Transition stage             
## 11 Plain stage                  
## 12 Intermediate stage           
## 13 Hilly stage                  
## 14 Stage with mountain(s)       
## 15 Plain stage with cobblestones
## 16 Flat Stage                   
## 17 Half Stage                   
## 18 Stage with mountain

Assign these types in more general categories and include the results (if present) for each stage as a new list column:

df_stages <- df_stages_raw %>% 
  clean_names() %>%
  mutate(
    year = year(date),
    category = case_when(
      str_detect(type, regex("time trial", ignore_case = TRUE)) ~ "time_trial",
      str_detect(type, regex("mountain", ignore_case = TRUE)) ~ "mountain",
      str_detect(type, regex("hilly", ignore_case = TRUE)) ~ "mountain",
      str_detect(type, regex("flat", ignore_case = TRUE)) ~ "flat",
      str_detect(type, regex("transition", ignore_case = TRUE)) ~ "flat",
      str_detect(type, regex("plain", ignore_case = TRUE)) ~ "flat",
      str_detect(type, regex("intermediate", ignore_case = TRUE)) ~ "flat",
      str_detect(type, regex("half", ignore_case = TRUE)) ~ "flat")) %>% 
  nest_join(df_stage_data, by = c("stage", "year"), name = "stage_results")
  
df_stages %>% 
  distinct(type, category)
## # A tibble: 18 x 2
##    type                          category  
##    <chr>                         <chr>     
##  1 Individual time trial         time_trial
##  2 Flat stage                    flat      
##  3 Medium mountain stage         mountain  
##  4 High mountain stage           mountain  
##  5 Mountain time trial           time_trial
##  6 Team time trial               time_trial
##  7 Flat cobblestone stage        flat      
##  8 Mountain stage                mountain  
##  9 Mountain Stage                mountain  
## 10 Transition stage              flat      
## 11 Plain stage                   flat      
## 12 Intermediate stage            flat      
## 13 Hilly stage                   mountain  
## 14 Stage with mountain(s)        mountain  
## 15 Plain stage with cobblestones flat      
## 16 Flat Stage                    flat      
## 17 Half Stage                    flat      
## 18 Stage with mountain           mountain

Unnest the data from the previous step:

df_stages_results <- df_stages %>% 
  unnest(stage_results)

Preprocess the winner data frame, which lists the winners of the editions. Join the data from the previous step as a new list column:

df_winners <- df_winners_raw %>%
  rename(distance_tour = distance) %>% 
  mutate(
    winner_team = str_to_upper(winner_team),
    winner_team = str_remove_all(winner_team, "\\."),
    winner_team = str_trim(winner_team),
    name_winner = map_chr(
      str_split(winner_name, "\\s"), ~ str_glue("{.x[[2]]} {.x[[1]]}"))) %>% 
  nest_join(
    df_stages_results, by = c("edition", "winner_team" = "team"),
    name = "winner_team_stage_results") %>% 
  select(-c(age)) %>% 
  arrange(desc(start_date))

df_winners %>% 
  select(winner_name, edition, winner_team, start_date, winner_team_stage_results) %>% 
  filter(map_lgl(winner_team_stage_results, ~ nrow(.x) == 0))
## # A tibble: 46 x 5
##    winner_name    edition winner_team           start_date winner_team_stage_re~
##    <chr>            <dbl> <chr>                 <date>     <list>               
##  1 Egan Bernal        106 TEAM INEOS            2019-07-06 <tibble [0 x 14]>    
##  2 Geraint Thomas     105 TEAM SKY              2018-07-07 <tibble [0 x 14]>    
##  3 Marco Pantani       85 MERCATONE UNO–BIANCHI 1998-07-11 <tibble [0 x 14]>    
##  4 Greg LeMond         77 Z–TOMASSO             1990-06-30 <tibble [0 x 14]>    
##  5 Greg LeMond         76 AD RENTING–W-CUP–BOT~ 1989-07-01 <tibble [0 x 14]>    
##  6 Stephen Roche       74 CARRERA JEANS–VAGABO~ 1987-07-01 <tibble [0 x 14]>    
##  7 Laurent Fignon      71 RENAULT–ELF           1984-06-29 <tibble [0 x 14]>    
##  8 Laurent Fignon      70 RENAULT–ELF           1983-07-01 <tibble [0 x 14]>    
##  9 Bernard Hinau~      69 RENAULT–ELF–GITANE    1982-07-02 <tibble [0 x 14]>    
## 10 Bernard Hinau~      68 RENAULT–ELF–GITANE    1981-06-25 <tibble [0 x 14]>    
## # ... with 36 more rows

As one can see, not every edition has stage result data available.

Visualisation

Get the data in the final form for visualisation.

Extract all the results of the overall winning team. Filter for year greater than 1995 and exclude time trials. Also discard observations, where there is no result data.

df_winner_team_results <- df_winners %>% 
  filter(map_lgl(winner_team_stage_results, ~ nrow(.x) != 0)) %>% 
  unnest(winner_team_stage_results) %>% 
  mutate(
    year = as.character(year),
    winner_team = str_trim(str_remove(winner_team, "TEAM"))) %>% 
  filter(year > 1995, category != "time_trial")

Extract all the data about the winners, where there is data present. This data frame is used in the visualisation to display the name of the winner over the plot.

df_winners_1996 <- df_winners %>% 
  mutate(year = year(start_date), year = as.character(year)) %>%
  semi_join(df_winner_team_results, by = "year")

Make the final visualisation. For every edition (where there is detailed information about the individual stage results) of the tour (since 1996), visualise the ranks of the overall winning team based on the ‘category’ of the stage:

df_winner_team_results %>% 
  ggplot(aes(x = year, y = rank, color = winner_team)) +
    geom_beeswarm() +
    geom_text(
      data = df_winners_1996,
      aes(label = winner_name), angle = 90, color = "black", y = 130) +
    facet_wrap(~ category, ncol = 1) +
    expand_limits(y = 240) +
    theme(
      legend.position = "bottom", axis.text.x = element_text(angle = 90)) +
    labs(
      title = "Stage results for riders of overall classification winning teams",
      subtitle = "By type of stage (flat / mountainous)",
      x = "Year", y = "Rank", color = "Team")

One can spot the teams that specialized to win the overall classification. These teams don’t achieve very good results on flat stages and seem to often perform well on mountainous stages. Notable examples are US Postal, Astana and Team Sky.