The TidyTuesday project is a weekly social data project in R, where participants get the chance to analyse a new dataset every week. The task is to wrangle and explore the data with the tools that R provides.
This weeks TidyTuesday dataset is about Tour de France results. As a big Fan of the Tour, this is a good opportunity to create some interesting analysis.
First load the needed libraries:
library(tidytuesdayR)
library(ggbeeswarm)
library(tidyverse)
library(lubridate)
library(janitor)
library(glue)
theme_set(theme_light())
These are ‘standard’ R libraries for wrangling and exploring data.
Some not so common packages are:
The tidytuesdayR
package to download the data,
ggbeeswarm
package for later visualisation and the janitor
package to
clean the variable names of the dataset.
Download the raw data with the help of the tidytuesdayR
package.
Assign every table to a separate data frame:
tuesdata <- tidytuesdayR::tt_load(2020, week = 15)
df_stages_raw <- tuesdata$tdf_stages
df_stage_data_raw <- tuesdata$stage_data
df_winners_raw <- tuesdata$tdf_winners
Do some general preprocessing to the stage data. The data frame contains the results of the individual stages (up to 2017):
df_stage_data <- df_stage_data_raw %>%
mutate(
rank = as.integer(rank),
stage = str_remove(stage_results_id, "^stage-"),
team = str_to_upper(team),
team = case_when(
team == "ASTANA PRO TEAM" ~ "ASTANA",
team == "SKY PROCYCLING" ~ "TEAM SKY",
team == "SAXO BANK" ~ "TEAM SAXO BANK",
team == "CSC PROTEAM" ~ "TEAM CSC",
team == "CAISSE D'EPARGNE - ILLES BALEARS" ~
"CAISSE D'EPARGNE–ILLES BALEARS",
TRUE ~ team),
team = str_trim(team)) %>%
mutate_at(vars(time, elapsed), parse_number) %>%
select(-c(stage_results_id, bib_number, points, elapsed)) %>%
clean_names()
df_stage_data
## # A tibble: 255,752 x 8
## edition year rank time rider age team stage
## <dbl> <dbl> <int> <dbl> <chr> <dbl> <chr> <chr>
## 1 1 1903 1 13 Garin Maurice 32 <NA> 1
## 2 1 1903 2 55 Pagie Émile 32 <NA> 1
## 3 1 1903 3 59 Georget Léon 23 <NA> 1
## 4 1 1903 4 48 Augereau Fernand 20 <NA> 1
## 5 1 1903 5 53 Fischer Jean 36 <NA> 1
## 6 1 1903 6 53 Kerff Marcel 37 <NA> 1
## 7 1 1903 7 55 Catteau Aloïs 25 <NA> 1
## 8 1 1903 8 47 Pivin Ernest 33 <NA> 1
## 9 1 1903 9 15 Habets Léon NA <NA> 1
## 10 1 1903 10 26 Beaugendre François 22 <NA> 1
## # ... with 255,742 more rows
First take a look at all the different ‘types’ of stages that are present in the data:
df_stages_raw %>%
distinct(Type)
## # A tibble: 18 x 1
## Type
## <chr>
## 1 Individual time trial
## 2 Flat stage
## 3 Medium mountain stage
## 4 High mountain stage
## 5 Mountain time trial
## 6 Team time trial
## 7 Flat cobblestone stage
## 8 Mountain stage
## 9 Mountain Stage
## 10 Transition stage
## 11 Plain stage
## 12 Intermediate stage
## 13 Hilly stage
## 14 Stage with mountain(s)
## 15 Plain stage with cobblestones
## 16 Flat Stage
## 17 Half Stage
## 18 Stage with mountain
Assign these types in more general categories and include the results (if present) for each stage as a new list column:
df_stages <- df_stages_raw %>%
clean_names() %>%
mutate(
year = year(date),
category = case_when(
str_detect(type, regex("time trial", ignore_case = TRUE)) ~ "time_trial",
str_detect(type, regex("mountain", ignore_case = TRUE)) ~ "mountain",
str_detect(type, regex("hilly", ignore_case = TRUE)) ~ "mountain",
str_detect(type, regex("flat", ignore_case = TRUE)) ~ "flat",
str_detect(type, regex("transition", ignore_case = TRUE)) ~ "flat",
str_detect(type, regex("plain", ignore_case = TRUE)) ~ "flat",
str_detect(type, regex("intermediate", ignore_case = TRUE)) ~ "flat",
str_detect(type, regex("half", ignore_case = TRUE)) ~ "flat")) %>%
nest_join(df_stage_data, by = c("stage", "year"), name = "stage_results")
df_stages %>%
distinct(type, category)
## # A tibble: 18 x 2
## type category
## <chr> <chr>
## 1 Individual time trial time_trial
## 2 Flat stage flat
## 3 Medium mountain stage mountain
## 4 High mountain stage mountain
## 5 Mountain time trial time_trial
## 6 Team time trial time_trial
## 7 Flat cobblestone stage flat
## 8 Mountain stage mountain
## 9 Mountain Stage mountain
## 10 Transition stage flat
## 11 Plain stage flat
## 12 Intermediate stage flat
## 13 Hilly stage mountain
## 14 Stage with mountain(s) mountain
## 15 Plain stage with cobblestones flat
## 16 Flat Stage flat
## 17 Half Stage flat
## 18 Stage with mountain mountain
Unnest the data from the previous step:
df_stages_results <- df_stages %>%
unnest(stage_results)
Preprocess the winner data frame, which lists the winners of the editions. Join the data from the previous step as a new list column:
df_winners <- df_winners_raw %>%
rename(distance_tour = distance) %>%
mutate(
winner_team = str_to_upper(winner_team),
winner_team = str_remove_all(winner_team, "\\."),
winner_team = str_trim(winner_team),
name_winner = map_chr(
str_split(winner_name, "\\s"), ~ str_glue("{.x[[2]]} {.x[[1]]}"))) %>%
nest_join(
df_stages_results, by = c("edition", "winner_team" = "team"),
name = "winner_team_stage_results") %>%
select(-c(age)) %>%
arrange(desc(start_date))
df_winners %>%
select(winner_name, edition, winner_team, start_date, winner_team_stage_results) %>%
filter(map_lgl(winner_team_stage_results, ~ nrow(.x) == 0))
## # A tibble: 46 x 5
## winner_name edition winner_team start_date winner_team_stage_re~
## <chr> <dbl> <chr> <date> <list>
## 1 Egan Bernal 106 TEAM INEOS 2019-07-06 <tibble [0 x 14]>
## 2 Geraint Thomas 105 TEAM SKY 2018-07-07 <tibble [0 x 14]>
## 3 Marco Pantani 85 MERCATONE UNO–BIANCHI 1998-07-11 <tibble [0 x 14]>
## 4 Greg LeMond 77 Z–TOMASSO 1990-06-30 <tibble [0 x 14]>
## 5 Greg LeMond 76 AD RENTING–W-CUP–BOT~ 1989-07-01 <tibble [0 x 14]>
## 6 Stephen Roche 74 CARRERA JEANS–VAGABO~ 1987-07-01 <tibble [0 x 14]>
## 7 Laurent Fignon 71 RENAULT–ELF 1984-06-29 <tibble [0 x 14]>
## 8 Laurent Fignon 70 RENAULT–ELF 1983-07-01 <tibble [0 x 14]>
## 9 Bernard Hinau~ 69 RENAULT–ELF–GITANE 1982-07-02 <tibble [0 x 14]>
## 10 Bernard Hinau~ 68 RENAULT–ELF–GITANE 1981-06-25 <tibble [0 x 14]>
## # ... with 36 more rows
As one can see, not every edition has stage result data available.
Get the data in the final form for visualisation.
Extract all the results of the overall winning team. Filter for year greater than 1995 and exclude time trials. Also discard observations, where there is no result data.
df_winner_team_results <- df_winners %>%
filter(map_lgl(winner_team_stage_results, ~ nrow(.x) != 0)) %>%
unnest(winner_team_stage_results) %>%
mutate(
year = as.character(year),
winner_team = str_trim(str_remove(winner_team, "TEAM"))) %>%
filter(year > 1995, category != "time_trial")
Extract all the data about the winners, where there is data present. This data frame is used in the visualisation to display the name of the winner over the plot.
df_winners_1996 <- df_winners %>%
mutate(year = year(start_date), year = as.character(year)) %>%
semi_join(df_winner_team_results, by = "year")
Make the final visualisation. For every edition (where there is detailed information about the individual stage results) of the tour (since 1996), visualise the ranks of the overall winning team based on the ‘category’ of the stage:
df_winner_team_results %>%
ggplot(aes(x = year, y = rank, color = winner_team)) +
geom_beeswarm() +
geom_text(
data = df_winners_1996,
aes(label = winner_name), angle = 90, color = "black", y = 130) +
facet_wrap(~ category, ncol = 1) +
expand_limits(y = 240) +
theme(
legend.position = "bottom", axis.text.x = element_text(angle = 90)) +
labs(
title = "Stage results for riders of overall classification winning teams",
subtitle = "By type of stage (flat / mountainous)",
x = "Year", y = "Rank", color = "Team")
One can spot the teams that specialized to win the overall classification. These teams don’t achieve very good results on flat stages and seem to often perform well on mountainous stages. Notable examples are US Postal, Astana and Team Sky.