# Hanukkah of Data

Short data challenge released over 8 days of Hanukkah in 2022

Julian During www.datannery.com
2023-01-15

Title photo from Gaelle Marcel on Unsplash

# Challenge

‘Hanukkah of Data’ is a data challenge where you have to solve 8 puzzles surrounding a fictional data set.

We are asked to find a rug for our granduncle Noah who owns a store. The store has data about customer, products and orders. With the help of the data (CSV files) we are asked to solve a total of eight puzzles.

# Data

The data for this analysis consists of four CSV files: ‘noahs-customers.csv’, ‘noahs-orders.csv’, ‘noahs-orders_items.csv’ and ‘noahs-products.csv’. The data is read into R with the help of the readr::read_csv function, specifying the correct column types as arguments in the function calls.

## Customers

There are a total of 11080 customers in the raw data. Perform the following preprocessing steps:

• Determine the Initials by splitting the name into first and last name
• Extract all digits from the phone number and save in column phone_chr
• Remove duplicates
customers <- function(df_customers_raw) {
df_customers_raw |>
mutate(
phone_chr = map_chr(str_extract_all(phone, "\\d"), str_flatten),
name_split = str_split(name, "\\s+"),
first_name = map_chr(name_split, ~ str_flatten(.x[.x != last(.x)])),
last_name = map_chr(name_split, last),
initials = str_to_lower(str_glue(
"{str_sub(first_name, end = 1)}{str_sub(last_name, end = 1)}"))) |>
select(where(negate(is_list))) |>
distinct(customerid, .keep_all = TRUE)
}
# A tibble: 11,079 × 10
customerid name     address citystatezip birthdate  phone phone_chr
<chr>      <chr>    <chr>   <chr>        <date>     <chr> <chr>
1 1001       Jack Qu… 201 E … Los Angeles… 1960-05-14 805-… 80528785…
2 1002       David P… 224C T… Staten Isla… 1978-04-04 516-… 51676816…
3 1003       Carrie … 1608 W… Tampa, FL 3… 1969-01-21 727-… 72720904…
4 1004       Steven … 178½ E… Manhattan, … 1953-08-17 607-… 60794195…
5 1005       Christi… 270 W … Bronx, NY 1… 1983-06-06 212-… 21275990…
6 1006       Amanda … 183-48… Saint Alban… 1962-07-08 914-… 91442131…
7 1007       Mark Co… 14-47 … College Poi… 1967-04-14 585-… 58555419…
8 1008       Jill St… 735A A… Manhattan, … 1959-06-11 516-… 51630704…
9 1009       Samuel … 56 Ric… Brooklyn, N… 1988-10-24 929-… 92986987…
10 1010       Brenda … 2821 B… Bronx, NY 1… 1960-09-07 914-… 91420526…
# ℹ 11,069 more rows
# ℹ 3 more variables: first_name <chr>, last_name <chr>,
#   initials <chr>

## Products

There are a total of 1124 products. Perform the following preprocessing:

• Extract the category by removing all digits from the desc string

• Extract additional information that can be found in the bracket text

• Remove remaining brackets from add_info and desc

products <- function(df_products_raw) {
df_products_raw |>
mutate(
category = str_remove(sku, "\\d+"),
add_info = str_extract(desc, "\$$.+\$$"),
}
# A tibble: 1,124 × 5
<chr>   <chr>                               <dbl> <chr>    <chr>
1 DLI0002 Smoked Whitefish Sandwich            9.33 DLI      <NA>
2 PET0005 Vegan Cat Food, Turkey & …           4.35 PET      <NA>
3 HOM0018 Power Radio                         21.8  HOM      red
4 KIT0034 Azure Ladle                          2.81 KIT      <NA>
5 PET0041 Gluten-free Cat Food, Pum…           4.6  PET      <NA>
6 PET0045 Gluten-free Cat Food, Sal…           4.32 PET      <NA>
7 TOY0048 Electric Doll                       10.2  TOY      <NA>
8 CMP0061 Network Printer                    136.   CMP      <NA>
9 DLI0066 Pickled Herring Sandwich             9.94 DLI      <NA>
10 TOY0085 Noah's Toy Soldier                  12.0  TOY      <NA>
# ℹ 1,114 more rows

## Orders

There are a total of 214207 orders with 427258 items ordered. Join the ordered item as list column to make them easily accessible.

orders <- function(df_orders_raw, df_order_items) {
df_orders_raw |>
select(-items) |>
nest_join(df_order_items, by = "orderid", name = "items")
}

# Puzzles

## Puzzle 1

To find the rug, we will need to contact a private investigator. The last name of the investigator can be spelled by using the letters printed on the phone buttons. For example: 2 has “ABC”, and 3 “DEF”, etc.

The key pad of the phone can be represented in R like this:

phone_letter <- function() {
tibble(letter = letters) |>
mutate(
nr = ((row_number() - 1) %/% 3) + 2,
nr = as.character(if_else(nr == 10, 0, nr)))
}
# A tibble: 26 × 2
letter nr
<chr>  <chr>
1 a      2
2 b      2
3 c      2
4 d      3
5 e      3
6 f      3
7 g      4
8 h      4
9 i      4
10 j      5
# ℹ 16 more rows

By combining this representation with the data on hand, we can determine the investigator:

investigator <- function(df_customers, df_phone_letter) {
df_customers_pro <- df_customers |>
transmute(
name_pro = str_split(str_to_lower(str_remove_all(
last_name, "\\s"), "\\s"), ""),
customerid) |>
unnest(name_pro) |>
left_join(df_phone_letter, by = c("name_pro" = "letter")) |>
group_by(customerid) |>
summarise(phone_pro = str_flatten(nr))

df_customers |>
left_join(df_customers_pro, by = "customerid") |>
filter(str_length(str_extract(phone_chr, phone_pro)) == str_length(phone_chr))
}
customerid name phone
3188 Sam Guttenberg 488-836-2374

## Puzzle 2

Now we are looking for a contractor, to whom the rug was given to by a cleaning company. The following is known:

• The contractor has the initials ‘jd’
• The cleaning company and the contractor had meetings at Noah’s over coffee and bagels
• The cleaning company stopped outsourcing a few years ago
contractor <- function(df_orders, df_customers, coffee_bagel_regex) {
df_customers_after_2017 <- df_orders |>
filter(year(ordered) > 2017) |>
distinct(customerid)

df_rel_orders <- df_orders |>
filter(year(ordered) <= 2017) |>
anti_join(df_customers_after_2017, by = "customerid") |>
unnest(items) |>
filter(str_detect(desc, coffee_bagel_regex)) |>
semi_join(filter(df_customers, initials == "jd"), by = "customerid")

df_customers |>
semi_join(df_rel_orders, by = "customerid")
}
customerid name citystatezip phone
4164 Jeremy Davis South Ozone Park, NY 11420 212-771-8924

## Puzzle 3

We are searching for the neighbor of the contractor. The neighbor has the following characteristics:

• Star Sign: Aries
• March 21 - April 19
• Born in the year of the Dog
• Every 12 years (2018 for example)
• Lives in the neighborhood

Dog years are determined by the following function:

det_dog_years <- function() {
2018 - (1:9 * 12)
}

Resulting in: 2006, 1994, 1982, 1970, 1958, 1946, 1934, 1922, 1910

With this information we can find out who the neighbor is:

spider_hat <- function(df_customers, df_contractor, dog_years) {
df_customers |>
filter(year(birthdate) %in% dog_years) |>
filter(
case_when(
month(birthdate) == 3 ~ day(birthdate) >= 21,
month(birthdate) == 4 ~ day(birthdate) <= 20,
TRUE ~ FALSE)) |>
filter(str_detect(citystatezip, df_contractor$citystatezip)) } customerid name citystatezip birthdate phone 2274 Brent Nguyen South Ozone Park, NY 11420 1958-03-25 516-636-7397 ## Puzzle 4 The next persons has the following habit: • Claims the first pastries of the day Filter products for the ‘BKY’ category and search for the person which always orders these products first: tinder_woman <- function(df_products, df_orders, df_customers, product_pastries) { df_products_pastries <- df_products |> filter(category == product_pastries) df_orders_pastries <- df_orders |> unnest(items) |> semi_join(df_products_pastries, by = "sku") df_first_pastries <- df_orders_pastries |> mutate(day = floor_date(ordered, unit = "day")) |> group_by(day) |> filter(ordered == min(ordered)) |> ungroup() |> count(customerid, sort = TRUE) df_customers |> semi_join(slice(df_first_pastries, 1), by = "customerid") } customerid name phone 5375 Christina Booker 718-649-9036 ## Puzzle 5 In the 5th puzzle we need to find a lady with a lot of cats as pets. We know about her and her pets: • The cats are senior • She lives in Queens • She has a lot of cats cat_lady <- function(df_products, df_orders, df_customers) { df_products_cat <- df_products |> filter(str_detect(desc, regex("cat", ignore_case = TRUE))) |> filter(str_detect(desc, regex("senior", ignore_case = TRUE))) df_customers_queens <- df_customers |> filter(str_detect(citystatezip, regex("queens", ignore_case = TRUE))) df_customer_id <- df_orders |> semi_join(df_customers_queens, by = "customerid") |> unnest(items) |> semi_join(df_products_cat, by = "sku") |> group_by(customerid) |> summarise(sku = str_flatten(unique(sku), ", "), anz = n()) |> filter(anz > 2) df_customers |> semi_join(df_customer_id, by = "customerid") } customerid name phone 7675 Anita Koch 315-492-7411 ## Puzzle 6 The cat lady has a very frugal cousin. She tells us: ” Noah […] loses money whenever she comes in the store”: frugal_cousin <- function(df_orders, df_customers) { df_customer_id <- df_orders |> mutate( profit = map_dbl(items, ~ sum(.x$unit_price - .x\$wholesale_cost))) |>
group_by(customerid) |>
summarise(profit = sum(profit)) |>
top_n(n = 1, wt = -profit)

df_customers |>
semi_join(df_customer_id, "customerid")
}
customerid name phone
8342 Emily Randolph 914-868-0316

## Puzzle 7

The frugal cousin tells us how she met her ex-boyfriend. They met while they were buying the same product, but in different colors. So we look at all the orders from the frugal cousin and search for that incident:

ex_boyfriend <- function(df_frugal_cousin, df_orders, df_customers) {
df_orders_color <- df_orders |>
semi_join(df_frugal_cousin, by = "customerid") |>
unnest(items) |>
mutate(day = floor_date(ordered, "day")) |>

df_order_rel <- df_orders |>
mutate(day = floor_date(ordered, "day")) |>
semi_join(df_orders_color, by = "day") |>
unnest(items) |>
inner_join(
df_orders_color, by = c("desc", "day"),
suffix = c("_male", "_female")) |>
filter(
ordered_male >= ordered_female - dminutes(10)
& ordered_male <= ordered_female + dminutes(10))

df_customers |>
semi_join(df_order_rel, by = c("customerid" = "customerid_male"))
}
customerid name phone

## Puzzle 8

Finally, we search for a collector. The collector is in possession of a complete set of Noah’s collectibles. By looking at the data, the relevant category is ‘COL’:

collector <- function(df_products, df_orders, df_customers, product_collect) {
df_products_collect <- df_products |>
filter(category == product_collect)

df_orders_unnested <- df_orders |>
unnest(items)

df_customer_id <- df_orders_unnested |>
semi_join(df_products_collect, by = c("sku")) |>
group_by(customerid) |>
summarise(anz_col = n_distinct(sku)) |>
arrange(desc(anz_col)) |>
slice(1)

df_customers |>
semi_join(df_customer_id, by = "customerid")
}
customerid name phone
4308 Travis Bartlett 929-906-5980

# End

Participating in the “Hanukkah of Data” competition was both challenging and rewarding. We successfully solved 8 difficult puzzles, sharpening our problem-solving and coding skills. I look forward to the challenge again next year and am eager to see what it holds.

### Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

### Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/duju211/hanukkah_of_data, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

### Citation

During (2023, Jan. 15). Datannery: Hanukkah of Data. Retrieved from https://www.datannery.com/posts/hanukkah-of-data/
@misc{during2023hanukkah,
}