NYC flights

Author

Termeh Shafie

Packages and Data

For the below exercies we use the tidyverse package for much of the data wrangling and visualisation and the data lives in the nycflights13 package. This is data a dataset of flights departing from New York City (NYC) airports in the year 2013.

library(tidyverse)
library(nycflights13)

The data sets available in the package can be viewed using the following syntax:

You will need one or combinations of these to solve the following exercises. These tables are organized as the figure below shows.

Familiarizing ourselves with the dataset

  1. What variables are included in the flights dataset? How many rows are there?

  2. What variables are included in the airports dataset? How many rows are there?

  3. Which variables are included in the airlines dataset? How many rows are there?

Focusing on Atlanta

  1. Let’s focus on flights from NYC area airports to Atlanta GA (FAA code ATL). Create a new object atlanta that includes only these flights. Hint: use filter()). How many flights to Atlanta were there in 2013?

Seasonality

  1. Is there a difference in the number of flights per month?
    Summarize the number of flights for each month and provide a sorted list with the months with the most flights first. Hint: use group_by() in combination with summarize()).

Use filter()

  1. Find all flights that
  • Had an arrival delay of two or more hours.

  • Flew to Houston (IAH or HOU)

  • Were operated by United, American, or Delta. Hint: In the flights dataset, the column carrier indicates the airline, but it uses two-character carrier codes. You can find the carrier codes for the airlines in the airlines dataset. Since the carrier code dataset only has 16 rows, and the names of the airlines in that dataset are not exactly “United”, “American”, or “Delta”, it is easiest to manually look up their carrier codes in that data.

  • Departed in summer (July, August, and September). Hint: the summer flights are those that departed in months 7 (July), 8 (August), and 9 (September).

  • Arrived more than two hours late, but didn’t leave late. Hint: Flights that arrived more than two hours late, but didn’t leave late will have an arrival delay of more than 120 minutes (arr_delay > 120) and a non-positive departure delay (dep_delay <=0)

  • Were delayed by at least an hour, but made up over 30 minutes in flight. Hint: If a flight was delayed by at least an hour, then dep_delay >= 60. If the flight didn’t make up any time in the air, then its arrival would be delayed by the same amount as its departure, meaning dep_delay == arr_delay, or alternatively, dep_delay - arr_delay == 0. If it makes up over 30 minutes in the air, then the arrival delay must be at least 30 minutes less than the departure delay, which is stated as dep_delay - arr_delay > 30.

  • Departed between midnight and 6 am (inclusive). Hint: In dep_time, midnight is represented by 2400, not 0. You can verify this by checking the minimum and maximum of dep_time.

Arrange rows with arrange()

  1. How could you use arrange() to sort all missing values to the start? Hint: use is.na()) and add an indicator of whether the column has a missing value, the flights will first be sorted by desc(is.na(dep_time)). Since desc(is.na(dep_time)) is either TRUE when dep_time is missing, or FALSE, when it is not, the rows with missing values of dep_time will come first, since TRUE > FALSE.

  2. Sort flights to find the most delayed flights. Find the flights that left earliest.

  3. Sort flights to find the fastest flights.

Seelct variables with select()

  1. What does the one_of() function do? Why might it be helpful in conjunction with this vector?

Add new variables with mutate()

  1. Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

  2. Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).

  3. Look at the number of cancelled flights per day. Is there a pattern? Create a plot to visualize your answers.

  4. For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

  5. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag() to explore how the delay of a flight is related to the delay of the immediately preceding flight. Use a plot to visualize this.

More Viz

  1. Visualize the distribution of on time departure rate across the three airports using a segmented bar plot. Hint: Remove NA’s and suppose that a flight that is delayed for less than 5 minutes is basically “on time”.

Advanced Exercises:

  1. Imagine you wanted to draw (approximately) the route each plane flies from its origin to its destination. What variables would you need? What tables from the package you loaded would you need to combine?

  2. This plots the approximate flight paths of the first 100 flights in the flights dataset. Try reproducing it. Hint: you can create a layer of map borders using borders(state).

  1. We know that some days of the year are “special”, and fewer people than usual fly on them. Since it is US data for 2013 we will consider: New Years Day, Independence Day, Thanksgiving Day, Christmas Day.

How might you represent that data as a data frame? What would be the primary keys of that table? How would it connect to the existing tables?

We can add a table of special dates, similar to the following table.

special_days <- tribble(
  ~year, ~month, ~day, ~holiday,
  2013, 01, 01, "New Years Day",
  2013, 07, 04, "Independence Day",
  2013, 11, 29, "Thanksgiving Day",
  2013, 12, 25, "Christmas Day"
)

The primary key of the table would be the (year, month, day) columns. The (year, month, day) columns could be used to join special_days with other tables.

  1. Create a visualization fo your own to illustrate if indeed fewer people than usual fly on the above special days.

  2. Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States (can you understand why we choose semi-join?):

airports %>%
  semi_join(flights, c("faa" = "dest")) %>%
  ggplot(aes(lon, lat)) +
  borders("state") +
  geom_point() +
  coord_quickmap() + 
  theme_void()

Hint: You might want to use the size or color of the points to display the average delay for each airport.

  1. What weather conditions make it more likely to see a delay? Use the variable precip (precipitation) from the weather dataset to answer this.

  2. What happened on June 13, 2013? Reproduce the following plot which displays the spatial pattern of delays, and then use Google to cross-reference with the weather. Hint: use library(viridis) to get the same colors.