Friendly Webscraping

Scraping my local Football Club’s News Data

Julian During
2024-03-28

Idea

Scrape the website of my local football club to get an overview of the content there.

The CSS selectors were extracted using techniques described in this wonderful tutorial. Mainly relying on the developer features of your web browser.

If you want to reproduce this analysis, you have to perform the following steps:

Data

The following libraries are used in this analysis:

Define where to look for the data:

tsg_url <- "https://www.tsg-fussball.de/"

We want to obey the scraping restrictions defined by the host. Therefore, we introduce ourselves to the host and follow the restrictions defined in ‘robots.txt’. This can be done using the bow function from the polite package:

tsg_host <- bow(tsg_url)

These are the following for this example:

<polite session> https://www.tsg-fussball.de/
    User-agent: polite R package
    robots.txt: 1 rules are defined for 1 bots
   Crawl delay: 5 sec
  The path is scrapable for this user-agent

Define the path, where the news article of this website can be found:

news_path <- "aktuelles"

Define the CSS selector, which identifies all elements on the websites that are links to news articles:

articles_css <- ".more-link"

We now want to find all news articles on the website:

news_links <- function(tsg_host, news_path, articles_css) {
  host_news <- nod(tsg_host, path = news_path)

  html <- scrape(host_news)

  rows <- html |>
    html_elements(articles_css)

  rows |>
    html_attr("href") |>
    map(\(x) url_parse(x)) |>
    map_chr("path")
}
paths_news <- news_links(tsg_host, news_path, articles_css)

In total we have 418 articles to scrape.

Look at some example paths:

[1] "/2022/04/04/unser-trainer-pascal-kopf-u16/"                   
[2] "/2021/12/13/tombola-des-sparkassen-indoor-cups-2021/"         
[3] "/2021/08/24/zum-tode-von-drago-todorovic/"                    
[4] "/2021/09/20/regionalliga-klarer-41-sieg-in-grossaspach/"      
[5] "/2023/09/10/regionalliga-63-spektakel-gegen-astoria-walldorf/"

We want to extract the content of every article. We are looking for the following parts of the post by searching for specific CSS expressions:

news <- function(tsg_host, path_news, title_css, line_css) {
  host_detail <- nod(tsg_host, path_news)
  html_detail <- scrape(host_detail)
  tibble(
    title = html_element(html_detail, title_css) |> html_text2(),
    line = html_elements(html_detail, line_css) |> html_text2(),
    path = path_news)
}

Apply the function for each path:

df_news <- map_df(paths_news, \(x) news(tsg_host, x, title_css, line_css))

Applying this function multiple times and obeying the scraping restriction at the same time, can be quite time-consuming. Therefore, we defined in the targets pipeline (take a look at ’_targets.R’), that the function is executed exactly once per article. This means future runs of the pipeline will detect if an article is already scraped and only scrape newly added articles, making future runs of the pipeline much faster.

Sometimes the content seems to be of solely technical nature. Define a regular expression to search for these lines

tech_regex <- "xml"

We now want to extract the words from the content we scraped. Before we do so with the unnest_tokens function from the tidytext package, we exclude some lines that have solely technical content, by searching for keyword ‘xml’:

words_raw <- function(df_news, tech_regex) {
  df_news |>
    filter(str_detect(line, tech_regex, negate = TRUE)) |>
    unnest_tokens(word, line)
}
df_words_raw <- words_raw(df_news, tech_regex)

Before further analysis of the content, exclude some words that are not relevant for this analysis:

words <- function(df_words_raw) {
  df_words_raw |>
    anti_join(get_stopwords(language = "de"), by = join_by(word)) |>
    anti_join(get_stopwords(language = "en"), by = join_by(word)) |>
    filter(str_detect(word, "^\\d+$", negate = TRUE))
}
df_words <- words(df_words_raw)

Analysis

We want to finish the analysis by creating a wordcloud of the scraped content.

Define the number of top words:

top_n_words <- 200L

Count all words and filter for top 200.

words_count <- function(df_words, top_n_words) {
  df_words |>
    count(word, sort = TRUE) |>
    top_n(top_n_words, wt = n)
}
df_words_count <- words_count(df_words, top_n_words)

Create word cloud:

vis_word_cloud <- function(df_words_count) {
  df_words_count |>
    ggplot() +
    geom_text_wordcloud_area(aes(label = word, size = n)) +
    scale_size_area(max_size = 50) +
    theme_void()
}
gg_word_cloud <- vis_word_cloud(df_words_count)