library(tidyverse)
library(robotstxt)
library(rvest)
University of Edinburgh Art Collection
The University of Edinburgh Art Collection “supports the world-leading research and teaching that happens within the University. Comprised of an astonishing range of objects and ideas spanning two millennia and a multitude of artistic forms, the collection reflects not only the long and rich trajectory of the University, but also major national and international shifts in art history.”
In this practical we’ll scrape data on all art pieces in the Edinburgh College of Art collection.
Learning goals
- Working with R scripts
- Web scraping from a single page
- Writing functions
- Iteration by mapping functions
- Writing data out
In order to complete this assignment you will need a Chrome browser with the Selector Gadget extension installed.
R scripts vs. Quarto documents
Today you’ll be using both R scripts and R Markdown documents:
use R scripts in the web scraping stage and ultimately save the scraped data as a csv.
use an Quarto document in the web analysis stage, where we start off by reading in the csv file we wrote out in the scraping stage.
Packages
We’ll use the tidyverse package for much of the data wrangling and visualisation, the robotstxt package to check if we’re allowed to scrape the data, the rvest package for data scraping.
Data
This assignment does not come with any prepared datasets. Instead you’ll be scraping the data! But before doing so, let’s check that a bot has permissions to access pages on this domain.
paths_allowed("https://collections.ed.ac.uk/art)")
collections.ed.ac.uk
[1] TRUE
Exercises
Scraping a single page
We will start off by scraping data on the first 10 pieces in the collection from here.
First, we define a new object called first_url
, which is the link above. Then, we read the page at this url with the read_html()
function from the rvest package. The code for this is already provided in 01-scrape-page-one.R
.
# set url
<- "https://collections.ed.ac.uk/art/search/*:*/Collection:%22edinburgh+college+of+art%7C%7C%7CEdinburgh+College+of+Art%22?offset=0"
first_url
# read html page
<- read_html(first_url) page
For the ten pieces on this page we will extract title
, artist
, and link
information, and put these three variables in a data frame.
Titles
Let’s start with titles. We make use of the SelectorGadget to identify the tags for the relevant nodes:
%>%
page html_nodes(".iteminfo") %>%
html_node("h3 a")
{xml_nodeset (10)}
[1] <a href="./record/20696?highlight=*:*">South Frieze of the Parthenon Fri ...
[2] <a href="./record/53701?highlight=*:*">Espresso Cup ...
[3] <a href="./record/99347?highlight=*:*">Untitled - Two Apes Sun Bathing ...
[4] <a href="./record/21212?highlight=*:*">Portrait of a Seated Woman ...
[5] <a href="./record/21289?highlight=*:*">Seated Male Nude ...
[6] <a href="./record/99370?highlight=*:*">Nighttime Scene of the City and R ...
[7] <a href="./record/21178?highlight=*:*">Portrait of Man in Red Jacket ...
[8] <a href="./record/20743?highlight=*:*">Harbour Scene 'KY16' ...
[9] <a href="./record/21568?highlight=*:*">Untitled ...
[10] <a href="./record/102688?highlight=*:*">Machine stitched net ...
Then we extract the text with html_text()
:
%>%
page html_nodes(".iteminfo") %>%
html_node("h3 a") %>%
html_text()
[1] "South Frieze of the Parthenon Frieze (1836-1837)"
[2] "Espresso Cup "
[3] "Untitled - Two Apes Sun Bathing (1963)"
[4] "Portrait of a Seated Woman (1954)"
[5] "Seated Male Nude (1961)"
[6] "Nighttime Scene of the City and River (1962)"
[7] "Portrait of Man in Red Jacket (1968)"
[8] "Harbour Scene 'KY16' (1964)"
[9] "Untitled (May 1987)"
[10] "Machine stitched net (1946)"
And get rid of all the spurious white space in the text with str_squish()
, which reduces repeated whitespace inside a string.
Take a look at the help for str_squish()
to find out more about how it works and how it’s different from str_trim()
.
%>%
page html_nodes(".iteminfo") %>%
html_node("h3 a") %>%
html_text() %>%
str_squish()
[1] "South Frieze of the Parthenon Frieze (1836-1837)"
[2] "Espresso Cup"
[3] "Untitled - Two Apes Sun Bathing (1963)"
[4] "Portrait of a Seated Woman (1954)"
[5] "Seated Male Nude (1961)"
[6] "Nighttime Scene of the City and River (1962)"
[7] "Portrait of Man in Red Jacket (1968)"
[8] "Harbour Scene 'KY16' (1964)"
[9] "Untitled (May 1987)"
[10] "Machine stitched net (1946)"
And finally save the resulting data as a vector of length 10:
<- page %>%
titles html_nodes(".iteminfo") %>%
html_node("h3 a") %>%
html_text() %>%
str_squish()
Links
The same nodes that contain the text for the titles also contains information on the links to individual art piece pages for each title. We can extract this information using a new function from the rvest package, html_attr()
, which extracts attributes.
A mini HTML lesson! The following is how we define hyperlinked text in HTML:
<a href="https://www.google.com">Search on Google</a>
And this is how the text would look like on a webpage: Search on Google.
Here the text is Search on Google
and the href
attribute contains the url of the website you’d go to if you click on the hyperlinked text: https://www.google.com
.
The moral of the story is: the link is stored in the href
attribute.
%>%
page html_nodes(".iteminfo") %>% # same nodes
html_node("h3 a") %>% # as before
html_attr("href") # but get href attribute instead of text
[1] "./record/20696?highlight=*:*" "./record/53701?highlight=*:*"
[3] "./record/99347?highlight=*:*" "./record/21212?highlight=*:*"
[5] "./record/21289?highlight=*:*" "./record/99370?highlight=*:*"
[7] "./record/21178?highlight=*:*" "./record/20743?highlight=*:*"
[9] "./record/21568?highlight=*:*" "./record/102688?highlight=*:*"
These don’t really look like URLs as we know then though. They’re relative links.
See the help for str_replace()
to find out how it works. Remember that the first argument is passed in from the pipeline, so you just need to define the pattern
and replacement
arguments.
- Click on one of art piece titles in your browser and take note of the url of the webpage it takes you to. Think about how that url compares to what we scraped above? How is it different? Using
str_replace()
, fix the URLs. You’ll note something special happening in thepattern
to replace. We want to replace the.
, but we have it as\\.
. This is because the period.
is a special character and so we need to escape it first with backslashes,\\
s.
Artists
- Fill in the blanks to scrape artist names.
Put it altogether
- Fill in the blanks to organize everything in a tibble.
Scrape the next page
- Click on the next page, and grab its url. Fill in the blank in to define a new object:
second_url
. Copy-paste code from top of the R script to scrape the new set of art pieces, and save the resulting data frame assecond_ten
.
Functions
You’ve been using R functions, now it’s time to write your own!
Let’s start simple. Here is a function that takes in an argument x
, and adds 2 to it.
<- function(x){
add_two + 2
x }
Let’s test it:
add_two(3)
[1] 5
add_two(10)
[1] 12
The skeleton for defining functions in R is as follows:
<- function(input){
function_name # do something with the input(s)
# return something
}
Then, a function for scraping a page should look something like:
Reminder: Function names should be short but evocative verbs.
<- function(url){
function_name # read page at url
# extract title, link, artist info for n pieces on page
# return a n x 3 tibble
}
- Fill in the blanks using code you already developed in the previous exercises. Name the function
scrape_page
.
Test out your new function by running the following in the console. Does the output look right? Discuss with teammates whether you’re getting the same results as before.
scrape_page(first_url)
scrape_page(second_url)
Iteration
We went from manually scraping individual pages to writing a function to do the same. Next, we will work on making our workflow a little more efficient by using R to iterate over all pages that contain information on the art collection.
That means we give develop a list of URLs (of pages that each have 10 art pieces), and write some code that applies the scrape_page()
function to each page, and combines the resulting data frames from each page into a single data frame with 3289 rows and 3 columns.
List of URLs
Click through the first few of the pages in the art collection and observe their URLs to confirm the following pattern:
[sometext]offset=0 # Pieces 1-10
[sometext]offset=10 # Pieces 11-20
[sometext]offset=20 # Pieces 21-30
[sometext]offset=30 # Pieces 31-40
...
[sometext]offset=3280 # Pieces 3281-3289
We can construct these URLs in R by pasting together two pieces: (1) a common (root
) text for the beginning of the URL, and (2) numbers starting at 0, increasing by 10, all the way up to 3289. Two new functions are helpful for accomplishing this: glue()
for pasting two pieces of text and seq()
for generating a sequence of numbers.
- Fill in the blanks to construct the list of URLs.
Mapping
Finally, we’re ready to iterate over the list of URLs we constructed. We will do this by mapping the function we developed over the list of URLs. There are a series of mapping functions in R and they each take the following form:
map([x], [function to apply to each element of x])
In our case x
is the list of URLs we constructed and the function to apply to each element of x
is the function we developed earlier, scrape_page
. And as a result we want a data frame, so we use map_dfr
function:
map_dfr(urls, scrape_page)
- Fill in the blanks to scrape all pages, and to create a new data frame called
uoe_art
.
Write out data
- Finally write out the data frame you constructed into the
data
folder so that you can use it in the analysis section.
Analysis
For the rest of the exercises you can work in Quarto/R Markdown.
Now that we have a tidy dataset that we can analyze, let’s do that!
We’ll start with some data cleaning, to clean up the dates that appear at the end of some title text in parentheses. Some of these are years, others are more specific dates, some art pieces have no date information whatsoever, and others have some non-date information in parentheses. This should be interesting to clean up!
First thing we’ll try is to separate the title
column into two: one for the actual title
and the other for the date
if it exists. In human speak, we need to
“separate the title
column at the first occurrence of (
and put the contents on one side of the (
into a column called title
and the contents on the other side into a column called date
”
Luckily, there’s a function that does just this: separate()
!
And once we have completed separating the single title
column into title
and date
, we need to do further clean-up in the date
column to get rid of extraneous )
s with str_remove()
, capture year information, and save the data as a numeric variable.
Fill in the blanks in to implement the data wrangling we described above. Note that this will result in some warnings when you run the code, and that’s OK! Read the warnings, and explain what they mean, and why we are ok with leaving them in given that our objective is to just capture
year
where it’s convenient to do so.Print out a summary of the data frame using the
skim()
function. How many pieces have artist info missing? How many have year info missing?Make a histogram of years. Use a reasonable binwidth. Do you see anything out of the ordinary?
Find which piece has the out of the ordinary year and go to its page on the art collection website to find the correct year for it. Can you tell why our code didn’t capture the correct year information? Correct the error in the data frame and visualize the data again.
Hint: You’ll want to use mutate()
and if_else()
or case_when()
to implement the correction.
Who is the most commonly featured artist in the collection? Do you know them? Any guess as to why the university has so many pieces from them?
Final question! How many art pieces have the word “child” in their title? Try to figure it out, and ask for help if you’re stuck.
Hint: str_subset()
can be helful here. You should consider how you might capture titles where the word appears as “child” and “Child”.