Hanukkah of Data

Short data challenge released over 8 days of Hanukkah in 2022

Julian During www.datannery.com
2023-01-15

Title photo from Gaelle Marcel on Unsplash

Challenge

‘Hanukkah of Data’ is a data challenge where you have to solve 8 puzzles surrounding a fictional data set.

We are asked to find a rug for our granduncle Noah who owns a store. The store has data about customer, products and orders. With the help of the data (CSV files) we are asked to solve a total of eight puzzles.

Data

The data for this analysis consists of four CSV files: ‘noahs-customers.csv’, ‘noahs-orders.csv’, ‘noahs-orders_items.csv’ and ‘noahs-products.csv’. The data is read into R with the help of the readr::read_csv function, specifying the correct column types as arguments in the function calls.

Customers

There are a total of 11080 customers in the raw data. Perform the following preprocessing steps:

customers <- function(df_customers_raw) {
  df_customers_raw |>
    mutate(
      phone_chr = map_chr(str_extract_all(phone, "\\d"), str_flatten),
      name_split = str_split(name, "\\s+"),
      first_name = map_chr(name_split, ~ str_flatten(.x[.x != last(.x)])),
      last_name = map_chr(name_split, last),
      initials = str_to_lower(str_glue(
        "{str_sub(first_name, end = 1)}{str_sub(last_name, end = 1)}"))) |>
    select(where(negate(is_list))) |>
    distinct(customerid, .keep_all = TRUE)
}
# A tibble: 11,079 × 10
   customerid name     address citystatezip birthdate  phone phone_chr
   <chr>      <chr>    <chr>   <chr>        <date>     <chr> <chr>    
 1 1001       Jack Qu… 201 E … Los Angeles… 1960-05-14 805-… 80528785…
 2 1002       David P… 224C T… Staten Isla… 1978-04-04 516-… 51676816…
 3 1003       Carrie … 1608 W… Tampa, FL 3… 1969-01-21 727-… 72720904…
 4 1004       Steven … 178½ E… Manhattan, … 1953-08-17 607-… 60794195…
 5 1005       Christi… 270 W … Bronx, NY 1… 1983-06-06 212-… 21275990…
 6 1006       Amanda … 183-48… Saint Alban… 1962-07-08 914-… 91442131…
 7 1007       Mark Co… 14-47 … College Poi… 1967-04-14 585-… 58555419…
 8 1008       Jill St… 735A A… Manhattan, … 1959-06-11 516-… 51630704…
 9 1009       Samuel … 56 Ric… Brooklyn, N… 1988-10-24 929-… 92986987…
10 1010       Brenda … 2821 B… Bronx, NY 1… 1960-09-07 914-… 91420526…
# ℹ 11,069 more rows
# ℹ 3 more variables: first_name <chr>, last_name <chr>,
#   initials <chr>

Products

There are a total of 1124 products. Perform the following preprocessing:

products <- function(df_products_raw) {
  df_products_raw |>
    mutate(
      category = str_remove(sku, "\\d+"),
      add_info = str_extract(desc, "\\(.+\\)"),
      desc = if_else(!is.na(add_info), str_remove(desc, add_info), desc),
      across(c(add_info, desc), ~ str_trim(str_remove_all(.x, "\\)|\\("))))
}
# A tibble: 1,124 × 5
   sku     desc                       wholesale_cost category add_info
   <chr>   <chr>                               <dbl> <chr>    <chr>   
 1 DLI0002 Smoked Whitefish Sandwich            9.33 DLI      <NA>    
 2 PET0005 Vegan Cat Food, Turkey & …           4.35 PET      <NA>    
 3 HOM0018 Power Radio                         21.8  HOM      red     
 4 KIT0034 Azure Ladle                          2.81 KIT      <NA>    
 5 PET0041 Gluten-free Cat Food, Pum…           4.6  PET      <NA>    
 6 PET0045 Gluten-free Cat Food, Sal…           4.32 PET      <NA>    
 7 TOY0048 Electric Doll                       10.2  TOY      <NA>    
 8 CMP0061 Network Printer                    136.   CMP      <NA>    
 9 DLI0066 Pickled Herring Sandwich             9.94 DLI      <NA>    
10 TOY0085 Noah's Toy Soldier                  12.0  TOY      <NA>    
# ℹ 1,114 more rows

Orders

There are a total of 214207 orders with 427258 items ordered. Join the ordered item as list column to make them easily accessible.

orders <- function(df_orders_raw, df_order_items) {
  df_orders_raw |>
    select(-items) |>
    nest_join(df_order_items, by = "orderid", name = "items")
}

Puzzles

Puzzle 1

To find the rug, we will need to contact a private investigator. The last name of the investigator can be spelled by using the letters printed on the phone buttons. For example: 2 has “ABC”, and 3 “DEF”, etc.

The key pad of the phone can be represented in R like this:

phone_letter <- function() {
  tibble(letter = letters) |>
    mutate(
      nr = ((row_number() - 1) %/% 3) + 2,
      nr = as.character(if_else(nr == 10, 0, nr)))
}
# A tibble: 26 × 2
   letter nr   
   <chr>  <chr>
 1 a      2    
 2 b      2    
 3 c      2    
 4 d      3    
 5 e      3    
 6 f      3    
 7 g      4    
 8 h      4    
 9 i      4    
10 j      5    
# ℹ 16 more rows

By combining this representation with the data on hand, we can determine the investigator:

investigator <- function(df_customers, df_phone_letter) {
  df_customers_pro <- df_customers |>
    transmute(
      name_pro = str_split(str_to_lower(str_remove_all(
        last_name, "\\s"), "\\s"), ""),
      customerid) |>
    unnest(name_pro) |>
    left_join(df_phone_letter, by = c("name_pro" = "letter")) |>
    group_by(customerid) |>
    summarise(phone_pro = str_flatten(nr))
  
  df_customers |>
    left_join(df_customers_pro, by = "customerid") |>
    filter(str_length(str_extract(phone_chr, phone_pro)) == str_length(phone_chr))
}
customerid name phone
3188 Sam Guttenberg 488-836-2374

Puzzle 2

Now we are looking for a contractor, to whom the rug was given to by a cleaning company. The following is known:

contractor <- function(df_orders, df_customers, coffee_bagel_regex) {
  df_customers_after_2017 <- df_orders |>
    filter(year(ordered) > 2017) |>
    distinct(customerid)
  
  df_rel_orders <- df_orders |>
    filter(year(ordered) <= 2017) |>
    anti_join(df_customers_after_2017, by = "customerid") |>
    unnest(items) |>
    filter(str_detect(desc, coffee_bagel_regex)) |>
    semi_join(filter(df_customers, initials == "jd"), by = "customerid")
  
  df_customers |>
    semi_join(df_rel_orders, by = "customerid")
}
customerid name citystatezip phone
4164 Jeremy Davis South Ozone Park, NY 11420 212-771-8924

Puzzle 3

We are searching for the neighbor of the contractor. The neighbor has the following characteristics:

Dog years are determined by the following function:

det_dog_years <- function() {
  2018 - (1:9 * 12)
}

Resulting in: 2006, 1994, 1982, 1970, 1958, 1946, 1934, 1922, 1910

With this information we can find out who the neighbor is:

spider_hat <- function(df_customers, df_contractor, dog_years) {
  df_customers |>
    filter(year(birthdate) %in% dog_years) |>
    filter(
      case_when(
        month(birthdate) == 3 ~ day(birthdate) >= 21,
        month(birthdate) == 4 ~ day(birthdate) <= 20,
        TRUE ~ FALSE)) |>
    filter(str_detect(citystatezip, df_contractor$citystatezip))
}
customerid name citystatezip birthdate phone
2274 Brent Nguyen South Ozone Park, NY 11420 1958-03-25 516-636-7397

Puzzle 4

The next persons has the following habit:

Filter products for the ‘BKY’ category and search for the person which always orders these products first:

tinder_woman <- function(df_products, df_orders, df_customers,
                         product_pastries) {
  df_products_pastries <- df_products |>
    filter(category == product_pastries)
  
  df_orders_pastries <- df_orders |>
    unnest(items) |>
    semi_join(df_products_pastries, by = "sku")
  
  df_first_pastries <- df_orders_pastries |>
    mutate(day = floor_date(ordered, unit = "day")) |>
    group_by(day) |>
    filter(ordered == min(ordered)) |>
    ungroup() |>
    count(customerid, sort = TRUE)
  
  df_customers |>
    semi_join(slice(df_first_pastries, 1), by = "customerid")
}
customerid name phone
5375 Christina Booker 718-649-9036

Puzzle 5

In the 5th puzzle we need to find a lady with a lot of cats as pets. We know about her and her pets:

cat_lady <- function(df_products, df_orders, df_customers) {
  df_products_cat <- df_products |>
    filter(str_detect(desc, regex("cat", ignore_case = TRUE))) |>
    filter(str_detect(desc, regex("senior", ignore_case = TRUE)))
  
  df_customers_queens <- df_customers |>
    filter(str_detect(citystatezip, regex("queens", ignore_case = TRUE)))
  
  df_customer_id <- df_orders |>
    semi_join(df_customers_queens, by = "customerid") |>
    unnest(items) |>
    semi_join(df_products_cat, by = "sku") |>
    group_by(customerid) |>
    summarise(sku = str_flatten(unique(sku), ", "), anz = n()) |>
    filter(anz > 2)
  
  df_customers |>
    semi_join(df_customer_id, by = "customerid")
}
customerid name phone
7675 Anita Koch 315-492-7411

Puzzle 6

The cat lady has a very frugal cousin. She tells us: ” Noah […] loses money whenever she comes in the store”:

frugal_cousin <- function(df_orders, df_customers) {
  df_customer_id <- df_orders |>
    mutate(
      profit = map_dbl(items, ~ sum(.x$unit_price - .x$wholesale_cost))) |>
    group_by(customerid) |>
    summarise(profit = sum(profit)) |>
    top_n(n = 1, wt = -profit)
  
  df_customers |>
    semi_join(df_customer_id, "customerid")
}
customerid name phone
8342 Emily Randolph 914-868-0316

Puzzle 7

The frugal cousin tells us how she met her ex-boyfriend. They met while they were buying the same product, but in different colors. So we look at all the orders from the frugal cousin and search for that incident:

ex_boyfriend <- function(df_frugal_cousin, df_orders, df_customers) {
  df_orders_color <- df_orders |>
    semi_join(df_frugal_cousin, by = "customerid") |>
    unnest(items) |>
    mutate(day = floor_date(ordered, "day")) |>
    filter(!is.na(add_info))
  
  df_order_rel <- df_orders |>
    mutate(day = floor_date(ordered, "day")) |>
    semi_join(df_orders_color, by = "day") |>
    unnest(items) |>
    inner_join(
      df_orders_color, by = c("desc", "day"),
      suffix = c("_male", "_female")) |>
    filter(
      add_info_male != add_info_female,
      ordered_male >= ordered_female - dminutes(10)
      & ordered_male <= ordered_female + dminutes(10))
  
  df_customers |>
    semi_join(df_order_rel, by = c("customerid" = "customerid_male"))
}
customerid name phone
8835 Jonathan Adams 315-618-5263

Puzzle 8

Finally, we search for a collector. The collector is in possession of a complete set of Noah’s collectibles. By looking at the data, the relevant category is ‘COL’:

collector <- function(df_products, df_orders, df_customers, product_collect) {
  df_products_collect <- df_products |>
    filter(category == product_collect)
  
  df_orders_unnested <- df_orders |>
    unnest(items)
  
  df_customer_id <- df_orders_unnested |>
    semi_join(df_products_collect, by = c("sku")) |>
    group_by(customerid) |>
    summarise(anz_col = n_distinct(sku)) |>
    arrange(desc(anz_col)) |>
    slice(1)
  
  df_customers |>
    semi_join(df_customer_id, by = "customerid")
}
customerid name phone
4308 Travis Bartlett 929-906-5980

End

Participating in the “Hanukkah of Data” competition was both challenging and rewarding. We successfully solved 8 difficult puzzles, sharpening our problem-solving and coding skills. I look forward to the challenge again next year and am eager to see what it holds.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/duju211/hanukkah_of_data, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

During (2023, Jan. 15). Datannery: Hanukkah of Data. Retrieved from https://www.datannery.com/posts/hanukkah-of-data/

BibTeX citation

@misc{during2023hanukkah,
  author = {During, Julian},
  title = {Datannery: Hanukkah of Data},
  url = {https://www.datannery.com/posts/hanukkah-of-data/},
  year = {2023}
}