Countries of the world

Author

Termeh Shafie

In order to complete this assignment you will need a Chrome browser with the Selector Gadget extension installed.

This website lists the names of 250 countries, as well as their flag, capital, population and size in square kilometres. Our goal could be to read this information into R for each country so that we can potentially analyse it further.

Before we start, we should load the required packages (we will also need the tidyverse package this time) and read the website with the function read_html() and assign it to an R object.

library(tidyverse)
library(rvest)
library(DT)

page <- read_html("https://scrapethissite.com/pages/simple/")

Country names

Use the Selector Gadget to identify the CSS selectors needed to extract country names.

country <- page %>%
  html_elements(".country-name") %>%
  html_text(trim = TRUE) 

head(country)
[1] "Andorra"              "United Arab Emirates" "Afghanistan"         
[4] "Antigua and Barbuda"  "Anguilla"             "Albania"             

Capitals, population and area

Let us now turn to the further information for each country. Again use the selector gadget to identify the CSS selector needed which in this case is .country-info:

page %>%
  html_elements(".country-info") %>%
  html_text(trim = TRUE) %>% 
  head(n = 10)
 [1] "Capital: Andorra la VellaPopulation: 84000Area (km2): 468.0"   
 [2] "Capital: Abu DhabiPopulation: 4975593Area (km2): 82880.0"      
 [3] "Capital: KabulPopulation: 29121286Area (km2): 647500.0"        
 [4] "Capital: St. John'sPopulation: 86754Area (km2): 443.0"         
 [5] "Capital: The ValleyPopulation: 13254Area (km2): 102.0"         
 [6] "Capital: TiranaPopulation: 2986952Area (km2): 28748.0"         
 [7] "Capital: YerevanPopulation: 2968000Area (km2): 29800.0"        
 [8] "Capital: LuandaPopulation: 13068161Area (km2): 1246700.0"      
 [9] "Capital: NonePopulation: 0Area (km2): 1.4E7"                   
[10] "Capital: Buenos AiresPopulation: 41343201Area (km2): 2766890.0"

So we get the names of the capitals, but also the population and the size of the country. The selector was not specific enough and we have to tell html_elements() more precisely which of these we are interested in. These CSS selectors differ between the three countries’ information:

  1. The selector country-capital gives us the capital of the countries:
capital <- page %>%
  html_elements(".country-capital") %>%
  html_text(trim = TRUE) 

head(capital)
[1] "Andorra la Vella" "Abu Dhabi"        "Kabul"            "St. John's"      
[5] "The Valley"       "Tirana"          
  1. The selector country-population gives us the population of the countries:
population <-  page %>%
  html_elements(".country-population") %>%
  html_text() %>% 
  as.numeric()
head(population)
[1]    84000  4975593 29121286    86754    13254  2986952
  1. The selector country-area gives us the area of the countries:
area <-  page %>%
  html_elements(".country-area") %>%
  html_text() %>% 
  as.numeric()
head(area)
[1]    468  82880 647500    443    102  28748

Note that we need to tell R to interpret the “text” read from the HTML code as numbers using the function as.numeric().

Merge into one tibble

We could already continue working with this, but for many applications it is more practical if we combine the data in a vertical form:

countries <- tibble(
  country = country,
  capital = capital,
  population = population,
  area = area
)
countries
# A tibble: 250 × 4
   country              capital          population     area
   <chr>                <chr>                 <dbl>    <dbl>
 1 Andorra              Andorra la Vella      84000      468
 2 United Arab Emirates Abu Dhabi           4975593    82880
 3 Afghanistan          Kabul              29121286   647500
 4 Antigua and Barbuda  St. John's            86754      443
 5 Anguilla             The Valley            13254      102
 6 Albania              Tirana              2986952    28748
 7 Armenia              Yerevan             2968000    29800
 8 Angola               Luanda             13068161  1246700
 9 Antarctica           None                      0 14000000
10 Argentina            Buenos Aires       41343201  2766890
# ℹ 240 more rows

All in one step

If we are sure that we do not need the individual vectors, we can also perform the reading of the data and the creation of the tibble in a single step. Below you can see how the complete scraping process can be completed in relatively few lines.

page <- "https://scrapethissite.com/pages/simple/" %>%
  read_html()

countries_2 <- tibble(
  Land = page %>%
    html_elements(css = ".country-name") %>% 
    html_text(trim = TRUE),
  capital = page %>% 
    html_elements(css = ".country-capital") %>% 
    html_text(),
  population = page %>% 
    html_elements(css = ".country-population") %>% 
    html_text() %>% 
    as.numeric(),
  area = page %>% 
    html_elements(css = ".country-area") %>% 
    html_text() %>% 
    as.numeric()
)

countries_2
# A tibble: 250 × 4
   Land                 capital          population     area
   <chr>                <chr>                 <dbl>    <dbl>
 1 Andorra              Andorra la Vella      84000      468
 2 United Arab Emirates Abu Dhabi           4975593    82880
 3 Afghanistan          Kabul              29121286   647500
 4 Antigua and Barbuda  St. John's            86754      443
 5 Anguilla             The Valley            13254      102
 6 Albania              Tirana              2986952    28748
 7 Armenia              Yerevan             2968000    29800
 8 Angola               Luanda             13068161  1246700
 9 Antarctica           None                      0 14000000
10 Argentina            Buenos Aires       41343201  2766890
# ℹ 240 more rows