Datannery: Hanukkah of Data Revisited (2023 / Speedrun)

Julian During

Title photo from Diana Polekhina on Unsplash.

Challenge

‘Hanukkah of Data’ is a data challenge where you have to solve 8 puzzles surrounding a fictional data set. I have already participated in last year’s challenge, but it was a lot of fun to revisit the old puzzles and rework some of my solutions.

Especially the ‘speed-run’ challenge had some twists in it, therefore I will talk about these puzzles in this blog post. The puzzles are mostly the same as in the normal version, but the data has some more difficult edge cases in it.

To solve the puzzles we use the following R libraries:

library(tidyverse)
library(assertr)

I’ve already used the tidyverse in last years challenge. This year I am also using assertr to detect problems with my solutions as early as possible.

Behind the scenes the whole analysis is created as a targets pipeline. So if you want to reproduce the analysis, you have to perform the following steps:

Clone the repository
Run renv::restore() to restore the R package versions
Run targets::tar_make() to run the pipeline

Day 1

To find the rug, we will need to contact a private investigator. The last name of the investigator can be spelled by using the letters printed on the phone buttons. For example: 2 has “ABC”, and 3 “DEF”, etc.

The key pad can be represented in R like this:

phone_letter <- function() {
  tibble(letter = letters) |>
    mutate(
      nr = as.character(case_when(
        letter %in% c("a", "b", "c") ~ 2,
        letter %in% c("d", "e", "f") ~ 3,
        letter %in% c("g", "h", "i") ~ 4,
        letter %in% c("j", "k", "l") ~ 5,
        letter %in% c("m", "n", "o") ~ 6,
        letter %in% c("p", "q", "r", "s") ~ 7,
        letter %in% c("t", "u", "v") ~ 8,
        letter %in% c("w", "x", "y", "z") ~ 9,
        TRUE ~ NA_real_)))
}

We then need to find the last name of each person. This can be a little bit tricky. Names can also include special suffixes like Jr. or roman numbers like III. Therefore we use a regex to filter for last names that start with a upper case letter and end with one or more lower case letters.

After that we transform the data so that every letter is one row:

last_names <- function(df_customers) {
  df_customers |>
    transmute(
      customerid,
      name_split = str_split(name, "\\s+"),
      name_split = map(
        name_split,
        ~ str_subset(.x, regex("^[A-Z][a-z]+$"))),
      last_name = map_chr(name_split, last),
      letter = str_split(last_name, ""),
      phone_chr = str_remove_all(phone, "-")) |>
    unnest(letter) |>
    mutate(letter = str_to_lower(letter))
}

By combining both data sources, we can answer the question:

investigator <- function(df_last_names, df_phone_letter) {
  df_last_names |>
    left_join(df_phone_letter, by = "letter") |>
    group_by(customerid, phone_chr) |>
    summarise(phone_pro = str_flatten(nr), .groups = "drop") |>
    filter(phone_chr == phone_pro) |>
    verify(length(customerid) == 1)
}

customerid	name	phone
4249	Tracy Rosenkranz	767-365-7269

Day 2

Now we are looking for a contractor, to whom the rug was given to by a cleaning company.

Look for customers with the searched initials (‘ds’, for the speed run):

initials <- function(df_customers, searched_initials) {
  df_customers |>
    mutate(
      name_split = str_split(name, "\\s+"),
      name_split = map(
        name_split, ~ str_subset(.x, regex("^[a-z]+$", ignore_case = TRUE))),
      first_name = map_chr(name_split, first),
      last_name = map_chr(name_split, last)) |>
    transmute(
      customerid,
      initials = str_to_lower(str_c(
        str_sub(first_name, end = 1), str_sub(last_name, end = 1)))) |>
    filter(initials == searched_initials)
}

We then look for products that are ‘coffee’ or ‘bagels’. Join the order items to the filtered products:

coffee_bagels <- function(df_products, df_order_items) {
  df_coffee <- df_products |>
    filter(str_detect(desc, regex("coffee", ignore_case = TRUE)))
  
  df_bagel <- df_products |>
    filter(str_detect(desc, regex("bagel", ignore_case = TRUE)))
  
  bind_rows(
    list(bagel = df_bagel, coffee = df_coffee), .id = "coffee_bagel") |>
    left_join(df_order_items, by = "sku")
}

Look for 2017 orders where coffee or bagels were bought. Keep only those were the customer has the above mentioned initials.

order_contractor <- function(df_orders, df_coffee_bagels, df_initials) {
  df_orders |>
    filter(year(ordered) == 2017) |>
    inner_join(df_coffee_bagels, by = "orderid") |>
    group_by(customerid, day = floor_date(ordered, unit = "day")) |>
    summarise(
      coffee = any(coffee_bagel == "coffee"),
      bagel = any(coffee_bagel == "bagel"), .groups = "drop_last") |>
    summarise(coffee_and_bagel = any(coffee & bagel)) |>
    filter(coffee_and_bagel) |>
    semi_join(df_initials, by = "customerid") |>
    verify(length(customerid) == 1)
}

customerid	name	phone
5745	David Swanson Jr.	838-351-0370

Day 3

Search for the neighbor with the spider hat. The filtering conditions are the following:

Born in goat year
Born in libra zodiac
Lives in neighborhood to contractor

These are the specific filter conditions for the speed run. For the other versions of the game, these are different. But the general filtering stays the same.

Define goat years (pasted from wikipedia):

goat_years <- function() {
  df_goat_raw <- tibble::tribble(
           ~Start.date,          ~End.date, ~Heavenly.branch,
    "13 February 1907",  "1 February 1908",      "Fire Goat",
     "1 February 1919", "19 February 1920",     "Earth Goat",
    "17 February 1931",  "5 February 1932",     "Metal Goat",
     "5 February 1943",  "24 January 1944",     "Water Goat",
     "24 January 1955", "11 February 1956",      "Wood Goat",
     "9 February 1967",  "29 January 1968",      "Fire Goat",
     "28 January 1979", "15 February 1980",     "Earth Goat",
    "15 February 1991",  "3 February 1992",     "Metal Goat",
     "1 February 2003",  "21 January 2004",     "Water Goat",
    "19 February 2015",  "7 February 2016",      "Wood Goat",
     "6 February 2027",  "25 January 2028",      "Fire Goat",
     "24 January 2039", "11 February 2040",     "Earth Goat",
    "11 February 2051",  "31 January 2052",     "Metal Goat",
     "29 January 2063", "16 February 2064",     "Water Goat",
    "15 February 2075",  "4 February 2076",      "Wood Goat",
     "3 February 2087",  "23 January 2088",      "Fire Goat",
     "21 January 2099",  "8 February 2100",     "Earth Goat") |>
    clean_names()
  
  df_goat_raw |>
    mutate(
      across(c(start_date, end_date), ~ parse_date(.x, "%d %B %Y")))
}

spider_hat <- function(df_customers, df_contractor, df_chinese_year) {
  df_customers |>
    filter(
      map_lgl(
        birthdate,
        ~ any(
          df_chinese_year$start_date <= .x & df_chinese_year$end_date >= .x)),
      case_when(
        month(birthdate) == 9 ~ day(birthdate) >= 23,
        month(birthdate) == 10 ~ day(birthdate) <= 23,
        TRUE ~ FALSE),
      str_detect(citystatezip, df_contractor$citystatezip)) |>
    select(customerid, name, citystatezip, phone) |>
    verify(length(customerid) == 1)
}

customerid	name	citystatezip	phone
3864	David Perez	Queens Village, NY 11427	914-594-5535

Day 4

Look for order items that are ‘pastries’:

order_items_pastries <- function(df_order_items, df_products) {
  df_products_pastries <- df_products |>
    filter(str_detect(sku, regex("bky", ignore_case = TRUE)))
  
  df_order_items |>
    semi_join(df_products_pastries, by = "sku")
}

Look for persons that order pastries early in the morning:

tinder_woman <- function(df_orders, df_order_items_pastries, df_customers) {
  df_order_items_pastries |>
    left_join(df_orders, by = "orderid") |>
    filter(hour(ordered) < 9) |>
    arrange(ordered) |>
    group_by(day = floor_date(ordered, "day")) |>
    summarise(
      earliest_order = min(ordered),
      customerid = unique(customerid[ordered == earliest_order])) |>
    count(customerid, sort = TRUE) |>
    slice(1) |>
    left_join(
      select(df_customers, customerid, name, phone),
      by = c("customerid"))
}

customerid	n	name	phone
6455	13	Brittany Harmon	716-789-4433

Day 5

Look for people that live in Staten Island (not needed for the speedrun):

staten_island <- function(df_customers) {
  df_customers |>
    transmute(customerid, city = map_chr(str_split(citystatezip, ","), 1)) |>
    filter(str_detect(city, regex("Staten\\s+Island", ignore_case = TRUE)))
}

Look for products that represent cat food for senior cats:

senior_cat_food <- function(df_products) {
  df_products |>
    filter(
      str_detect(desc, regex("cat\\s+food", ignore_case = TRUE)),
      str_detect(desc, regex("senior", ignore_case = TRUE)))
}

Combine the information and look for the searched woman:

cat_lady <- function(df_order_items, df_orders, df_senior_cat_food,
                     df_staten_island) {
  df_order_items |>
    semi_join(df_senior_cat_food, by = "sku") |>
    left_join(select(df_orders, orderid, customerid), by = "orderid") |>
    #semi_join(df_staten_island, by = "customerid") |>
    count(customerid, sort = TRUE) |>
    slice(1)
}

customerid	name	citystatezip	phone
7957	Theresa Carter	Bronx, NY 10461	347-835-2358

Day 6

Calculate margin for each order item

order_items_margin <- function(df_order_items, df_products) {
  df_order_items |>
    left_join(df_products, by = "sku") |>
    group_by(orderid) |>
    summarise(margin = sum(unit_price - wholesale_cost))
}

Determine customer with the lowest total margin:

frugal_cousin <- function(df_orders, df_order_items_margin) {
  df_orders |>
    left_join(df_order_items_margin, by = "orderid") |>
    group_by(customerid) |>
    summarise(customer_margin = mean(margin)) |>
    arrange(customer_margin) |>
    slice(1)
}

customerid	name	phone
8884	Deborah Green	838-295-7143

Day 7

Find all orders that contain a colored item:

color_orders <- function(df_orders, df_order_items, df_products) {
  df_orders |>
    left_join(df_order_items, by = c("orderid")) |>
    left_join(df_products, by = "sku") |>
    mutate(
      color = str_remove_all(str_extract(desc, "\\(.+\\)"), "\\(|\\)"),
      day = as_date(floor_date(ordered, unit = "day"))) |>
    filter(!is.na(color))
}

Search for orders that happened in close proximity to the orders of the frugal cousin:

ex_boyfriend <- function(df_color_orders, df_frugal_cousin) {
  df_color_orders_fc <- df_color_orders |>
    semi_join(df_frugal_cousin, by = "customerid") |>
    mutate(start = ordered - dminutes(0.3), end = ordered + dminutes(0.3))
  
  df_color_orders |>
    anti_join(df_color_orders_fc, by = join_by(customerid)) |>
    inner_join(
      select(df_color_orders_fc, day, start, end), by = join_by(day)) |>
    filter(ordered >= start & ordered <= end) |>
    verify(length(customerid) == 1)
}

customerid	name	phone
9931	Jeremy Burch	516-544-4187

Day 8

Look for products that are collectibles

collectibles <- function(df_products) {
  df_products |>
    filter(str_detect(sku, "COL"))
}

Find the person who has all the collectibles

collector <- function(df_orders, df_order_items, df_collectibles) {
  df_order_items |>
    semi_join(df_collectibles, by = "sku") |>
    left_join(df_orders, by = "orderid") |>
    group_by(customerid) |>
    summarise(anz_coll = n_distinct(sku)) |>
    filter(anz_coll == nrow(df_collectibles)) |>
    verify(length(customerid) == 1)
}

customerid	name	phone
2602	Daniel Wilson	516-638-9966

Conclusion

As last year, I had a lot of fun solving the Hanukkah of Data challenges. I revisited my previous solutions and improved them to solve the new challenges. By using functions from the assertr package, I could spot difficulties early. Especially during the speed run at the end of the challenge, this type of assertive programming made it more easy for me, to adjust my solutions to more challenging data and edge cases. I’m already looking forward to the challenges next year :-).

Hanukkah of Data Revisited (2023 / Speedrun)