Friendly Webscraping

Scraping my local Football Club’s News Data

Julian During
2024-03-28

Idea

Scrape the website of my local football club to get an overview of the content there.

The CSS selectors were extracted using techniques described in this wonderful tutorial. Mainly relying on the developer features of your web browser.

If you want to reproduce this analysis, you have to perform the following steps:

Data

The following libraries are used in this analysis:

Define where to look for the data:

tsg_url <- "https://www.tsg-fussball.de/"

We want to obey the scraping restrictions defined by the host. Therefore, we introduce ourselves to the host and follow the restrictions defined in ‘robots.txt’. This can be done using the bow function from the polite package:

tsg_host <- bow(tsg_url)

These are the following for this example:

<polite session> https://www.tsg-fussball.de/
    User-agent: polite R package
    robots.txt: 1 rules are defined for 1 bots
   Crawl delay: 5 sec
  The path is scrapable for this user-agent

Define the path, where the news article of this website can be found:

news_path <- "aktuelles"

Define the CSS selector, which identifies all elements on the websites that are links to news articles:

articles_css <- ".more-link"

We now want to find all news articles on the website:

news_links <- function(tsg_host, news_path, articles_css) {
  host_news <- nod(tsg_host, path = news_path)

  html <- scrape(host_news)

  rows <- html |>
    html_elements(articles_css)

  rows |>
    html_attr("href") |>
    map(\(x) url_parse(x)) |>
    map_chr("path")
}
paths_news <- news_links(tsg_host, news_path, articles_css)

In total we have 418 articles to scrape.

Look at some example paths:

[1] "/2022/04/04/unser-trainer-pascal-kopf-u16/"                   
[2] "/2021/12/13/tombola-des-sparkassen-indoor-cups-2021/"         
[3] "/2021/08/24/zum-tode-von-drago-todorovic/"                    
[4] "/2021/09/20/regionalliga-klarer-41-sieg-in-grossaspach/"      
[5] "/2023/09/10/regionalliga-63-spektakel-gegen-astoria-walldorf/"

We want to extract the content of every article. We are looking for the following parts of the post by searching for specific CSS expressions:

news <- function(tsg_host, path_news, title_css, line_css) {
  host_detail <- nod(tsg_host, path_news)
  html_detail <- scrape(host_detail)
  tibble(
    title = html_element(html_detail, title_css) |> html_text2(),
    line = html_elements(html_detail, line_css) |> html_text2(),
    path = path_news)
}

Apply the function for each path:

df_news <- map_df(paths_news, \(x) news(tsg_host, x, title_css, line_css))

Applying this function multiple times and obeying the scraping restriction at the same time, can be quite time-consuming. Therefore, we defined in the targets pipeline (take a look at ’_targets.R’), that the function is executed exactly once per article. This means future runs of the pipeline will detect if an article is already scraped and only scrape newly added articles, making future runs of the pipeline much faster.

Sometimes the content seems to be of solely technical nature. Define a regular expression to search for these lines

tech_regex <- "xml"

We now want to extract the words from the content we scraped. Before we do so with the unnest_tokens function from the tidytext package, we exclude some lines that have solely technical content, by searching for keyword ‘xml’:

words_raw <- function(df_news, tech_regex) {
  df_news |>
    filter(str_detect(line, tech_regex, negate = TRUE)) |>
    unnest_tokens(word, line)
}
df_words_raw <- words_raw(df_news, tech_regex)

Before further analysis of the content, exclude some words that are not relevant for this analysis:

words <- function(df_words_raw) {
  df_words_raw |>
    anti_join(get_stopwords(language = "de"), by = join_by(word)) |>
    anti_join(get_stopwords(language = "en"), by = join_by(word)) |>
    filter(str_detect(word, "^\\d+$", negate = TRUE))
}
df_words <- words(df_words_raw)

Analysis

We want to finish the analysis by creating a wordcloud of the scraped content.

Define the number of top words:

top_n_words <- 200L

Count all words and filter for top 200.

words_count <- function(df_words, top_n_words) {
  df_words |>
    count(word, sort = TRUE) |>
    top_n(top_n_words, wt = n)
}
df_words_count <- words_count(df_words, top_n_words)

Create word cloud:

vis_word_cloud <- function(df_words_count) {
  df_words_count |>
    ggplot() +
    geom_text_wordcloud_area(aes(label = word, size = n)) +
    scale_size_area(max_size = 50) +
    theme_void()
}
gg_word_cloud <- vis_word_cloud(df_words_count)

And there you go! A complete website scraped in a polite way and displayed with a nice word cloud. Future updates of this analysis are quickly done, because only new content is scraped, and old content is saved in the background. Happy times! Looking forward to further adventures using the techniques introduced in this blog post.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/duju211/rvest_tsg, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

During (2024, March 28). Datannery: Friendly Webscraping. Retrieved from https://www.datannery.com/posts/friendly-webscraping/

BibTeX citation

@misc{during2024friendly,
  author = {During, Julian},
  title = {Datannery: Friendly Webscraping},
  url = {https://www.datannery.com/posts/friendly-webscraping/},
  year = {2024}
}