library(tidyverse)
library(rvest)
library(DT)
<- read_html("https://scrapethissite.com/pages/simple/") page
Countries of the world
In order to complete this assignment you will need a Chrome browser with the Selector Gadget extension installed.
This website lists the names of 250 countries, as well as their flag, capital, population and size in square kilometres. Our goal could be to read this information into R for each country so that we can potentially analyse it further.
Before we start, we should load the required packages (we will also need the tidyverse package this time) and read the website with the function read_html()
and assign it to an R object.
Country names
Use the Selector Gadget to identify the CSS selectors needed to extract country names.
<- page %>%
country html_elements(".country-name") %>%
html_text(trim = TRUE)
head(country)
[1] "Andorra" "United Arab Emirates" "Afghanistan"
[4] "Antigua and Barbuda" "Anguilla" "Albania"
Capitals, population and area
Let us now turn to the further information for each country. Again use the selector gadget to identify the CSS selector needed which in this case is .country-info
:
%>%
page html_elements(".country-info") %>%
html_text(trim = TRUE) %>%
head(n = 10)
[1] "Capital: Andorra la VellaPopulation: 84000Area (km2): 468.0"
[2] "Capital: Abu DhabiPopulation: 4975593Area (km2): 82880.0"
[3] "Capital: KabulPopulation: 29121286Area (km2): 647500.0"
[4] "Capital: St. John'sPopulation: 86754Area (km2): 443.0"
[5] "Capital: The ValleyPopulation: 13254Area (km2): 102.0"
[6] "Capital: TiranaPopulation: 2986952Area (km2): 28748.0"
[7] "Capital: YerevanPopulation: 2968000Area (km2): 29800.0"
[8] "Capital: LuandaPopulation: 13068161Area (km2): 1246700.0"
[9] "Capital: NonePopulation: 0Area (km2): 1.4E7"
[10] "Capital: Buenos AiresPopulation: 41343201Area (km2): 2766890.0"
So we get the names of the capitals, but also the population and the size of the country. The selector was not specific enough and we have to tell html_elements()
more precisely which of these we are interested in. These CSS selectors differ between the three countries’ information:
- The selector
country-capital
gives us the capital of the countries:
<- page %>%
capital html_elements(".country-capital") %>%
html_text(trim = TRUE)
head(capital)
[1] "Andorra la Vella" "Abu Dhabi" "Kabul" "St. John's"
[5] "The Valley" "Tirana"
- The selector
country-population
gives us the population of the countries:
<- page %>%
population html_elements(".country-population") %>%
html_text() %>%
as.numeric()
head(population)
[1] 84000 4975593 29121286 86754 13254 2986952
- The selector
country-area
gives us the area of the countries:
<- page %>%
area html_elements(".country-area") %>%
html_text() %>%
as.numeric()
head(area)
[1] 468 82880 647500 443 102 28748
Note that we need to tell R to interpret the “text” read from the HTML code as numbers using the function as.numeric()
.
Merge into one tibble
We could already continue working with this, but for many applications it is more practical if we combine the data in a vertical form:
<- tibble(
countries country = country,
capital = capital,
population = population,
area = area
) countries
# A tibble: 250 × 4
country capital population area
<chr> <chr> <dbl> <dbl>
1 Andorra Andorra la Vella 84000 468
2 United Arab Emirates Abu Dhabi 4975593 82880
3 Afghanistan Kabul 29121286 647500
4 Antigua and Barbuda St. John's 86754 443
5 Anguilla The Valley 13254 102
6 Albania Tirana 2986952 28748
7 Armenia Yerevan 2968000 29800
8 Angola Luanda 13068161 1246700
9 Antarctica None 0 14000000
10 Argentina Buenos Aires 41343201 2766890
# ℹ 240 more rows
All in one step
If we are sure that we do not need the individual vectors, we can also perform the reading of the data and the creation of the tibble in a single step. Below you can see how the complete scraping process can be completed in relatively few lines.
<- "https://scrapethissite.com/pages/simple/" %>%
page read_html()
<- tibble(
countries_2 Land = page %>%
html_elements(css = ".country-name") %>%
html_text(trim = TRUE),
capital = page %>%
html_elements(css = ".country-capital") %>%
html_text(),
population = page %>%
html_elements(css = ".country-population") %>%
html_text() %>%
as.numeric(),
area = page %>%
html_elements(css = ".country-area") %>%
html_text() %>%
as.numeric()
)
countries_2
# A tibble: 250 × 4
Land capital population area
<chr> <chr> <dbl> <dbl>
1 Andorra Andorra la Vella 84000 468
2 United Arab Emirates Abu Dhabi 4975593 82880
3 Afghanistan Kabul 29121286 647500
4 Antigua and Barbuda St. John's 86754 443
5 Anguilla The Valley 13254 102
6 Albania Tirana 2986952 28748
7 Armenia Yerevan 2968000 29800
8 Angola Luanda 13068161 1246700
9 Antarctica None 0 14000000
10 Argentina Buenos Aires 41343201 2766890
# ℹ 240 more rows