Earlier this month, the City of Chicago became the first major American city to make rideshare data public. Following a number of recent controversies over user anonymity and privacy in publicly-released location data, the City performed an extensive amount of data de-identification before making the datasets public.
With datasets for trips, drivers, and vehicles all available, I thought it would be fun to play around and see what we might be able to find in the data! Note that only the months of November and December, 2018 are in the scope of this data, so all activity below is reflective of those two months.
Considerations with the Data
As part of the data de-identification process done by the City, there are some specific details of the data that one should be aware of when interacting
- Trip Data
- Census Tracts are suppressed in some cases, and times are rounded to the nearest 15 minutes
- Fares are rounded to the nearest $2.50 and tips are rounded to the nearest $1.00
- Inclusion of a vehicle in a monthly report indicates that the vehicle was eligible for trips in Chicago in that month for at least one day, regardless of whether it actually provided any rides
- If a vehicle is eligible in multiple months, which is common, it will have records in each of these reporting months
- With no distinct identifier for drivers (understandable given the privacy concerns), it is impossible to trace drivers on a month-over-month and give any sort of cohorted or per-driver statistics
- Data for June, 2018 is unreliable due to Q2 2018 data for the prior months being batched into that month
- Similarly to the drivers dataset, lack of a unique vehicle ID restricts our ability to trace some month-over-month activity
Totals and Averages
Right off the bat, I took a peek at some summary statistics in the data. I found the figures for % of Trips Shared and Avg. Trip Total to be most interesting - shared trips in that it was lower than I would have expected based on anecdotal experience with UberPool and LyftLine, and trip total in that it was higher than expected.
|Total Trips||Avg. Duration||Avg. Length|
|34,864,022||17.8 minutes||5.9 miles|
|Avg. Fare||Avg. Tip||Avg. Fees||Avg. Trip Total|
|% of Trips Shared||Riders per Shared Trip|
Max Trip Statistics
It’s always fun to take a peek at outlier data and wonder what the hell was happening when someone took at $1,400 Uber, or the kinds of characters you might see in an UberPool that picks up 11 more passengers before completing your ride…
|Max Duration||Max Length||Max Trips Pooled|
|1,329 minutes||389.9 miles||12|
|Max Fare||Max Tip||Max Fees||Max Trip Total|
Now, getting to the meat of the matter - I’ve put together a number of interactive maps and static charts to display some of the more interesting slices of this dataset. All maps were built using Leaflet and charts using ggplot2. Take a peek!
Where are People Coming From?
At first glance, I wasn’t quite sure what the large yellow patch towards the top left of the map was - after a quick Google revealed that is where O’Hare Airport sits, everything made a lot more sense! Also interesting to note that the North Side of the City has significantly higher trip volume than the South Side.
Where are they Going?
Turns out, pretty much the same places they are coming from! Surprisingly little change between these two maps - the Loop and North Side are the hottest spots in the city.
Trip Lengths by Start Location
Average trip lengths appear to increase pretty universally as one moves outside of Downtown - indicating that most rideshare trips are going into the city. Short average trip lengths in River North, Lake View, Logan Square, and more indicate that riders in those neighborhoods are staying relatively close by.
While the South Side may not have as many rides in total as the North Side or rest of the city, it does show a a noticeable trip trend. The percentage of trips taken which are eligible to be shared (ie. UberPool or Lyft Line) are around 50% in much of this part of the city. The Loop, by contrast, shows about 20% of trips eligible to be shared, with a total city average of 26.7% of trips.
Most Common Trips by Pickup and Dropoff
I wanted to include this map (even though it’s a little ugly) because it shows just how concentrated ridesharing trips are into a few neighorhoods. Traffic to and from areas in and surrounding the Loop and to and from O’Hare are by far the most frequent kinds of trips.
Trips by Day
Ridesharing activity drops significantly around Christmastime, where the effect is much stronger than what we see with Thanksgiving. I suspect most of the effects here are just as strong on the supply-side as on the demand-side - I would imagine most drivers just don’t want to leave their families and hit the road on a day like Christmas.
Trips by Day of Week
|Avg. Monthly Trips||% Drive for Multiple Providers|
Driver Home Zip Codes
Perhaps my favorite chart from this dataset, I think this data shows how the driver population for City of Chicago rideshare providers is spread quite far outside of the city, with many drivers hailing from Indiana, Central Illinois, or even Wisconsin. This spread of activity is much broader than what we have observed with ridership and trip data.
I suspect that this “spread” has two main sources:
- Individuals living within the City of Chicago proper are more less likely to own an automobile and more likely to take Chicago’s extensive public transit for their own transit. With less automobile ownership in the City itself, proportionally more car ownership is concentrated in surrounding areas
- Drivers for Uber/Lyft/other rideshare companies are unlikely to be particularly high income. This is a broad generalization, but it is probably fair to say that surrounding areas of Chicago are typically cheaper than areas close to downtown, and therefore more affordable to those who drive for rideshare companies. Therefore, rideshare drivers have naturally drifted outside of the City proper into areas they can more easily afford to live, which can sometimes be quite far from downtown itself
No crazy maps here, but there is a ton of vehicle data available for us to slice by make, model, color, and more. I thought it might be most interesting to look at car models and colors - if you’ve ever taken in Lyft in a major city before, I’m sure you won’t be surprised to see that a black Toyota Camry is the single most popular vehicle in the Chicago dataset.
Building the Visualizations
I wanted to leave a few code snippets that might be helpful to others looking to put together Census Tract or ZIP code based maps - I found a number of extremely helpful online resources as I was putting together these maps and charts and wanted to share some of the best!
Obtaining Census Data
Download Census Data with Tidycensus
Tidycensus is an amazing package for R made by Kyle Walker which allows users to easily and quickly retrieve and manipulate data from the decennial Census and American Community Survey. Using Tidycensus, we can retrieve census tract geography data, which we can later combine with our rideshare dataset to create maps at the tract level.
library(tidycensus) library(tidyverse) options(tigris_use_cache = TRUE) census_api_key("YOUR_KEY_HERE", install = TRUE) cookcnty <- get_acs(state = "IL", county = "Cook", geography = "tract", variables = "B19013_001", geometry = TRUE) head(cookcnty)
Set up R to Query AWS Athena
Given how large some of the files in the dataset are (trips is over 4 GB), I chose to query the data with AWS Athena over an S3 bucket (after using a Glue crawler to build a table schema to query). Any database tool supporting JDBC can query Athena - I chose to use the tools below to make my queries in R so that I could easily work with the output dataframes.
library("aws.athena") library("aws.signature") library(DBI) # log in with saved AWS credentials use_credentials() # Set up database connection to Athena con <- dbConnect(AWR.Athena::Athena(), region='us-east-1', S3OutputLocation ='s3://YOUR_BUCKET_LOCATION/', Schema='chicago-rideshare') dbListTables(con)
Get Trip Metrics by Start Tract from Athena
start_tracts <- dbGetQuery(con, "SELECT pickup_census_tract, COUNT(*) as trips, AVG(trip_miles) as avg_miles, AVG(tip) as avg_tip, cast(SUM(CASE WHEN tip > 0 THEN 1 ELSE 0 END) as double) / COUNT(*) as pct_rides_w_tip, AVG(additional_charges) as avg_additional_charges, AVG(trip_total) as avg_trip_total, cast(SUM(CASE WHEN shared_trip_authorized = TRUE THEN 1 ELSE 0 END) as double) / COUNT(*) as pct_trips_shared, AVG(trips_pooled) as avg_trips_pooled FROM trips_parquet GROUP BY 1 ORDER BY 1 ") start_tracts
To Map, Combine Results with Census Data
merge(x = cookcnty, y = start_tracts, by.x = "GEOID", by.y = "pickup_census_tract")
Create Maps in Leaflet
library(leaflet) library(stringr) library(sf) library(tidyverse) pal_trips <- colorQuantile(palette = "viridis", domain = cookcnty$trips, n = 10) map_start_tracts <- merge(x = cookcnty, y = start_tracts, by.x = "GEOID", by.y = "pickup_census_tract") %>% st_transform(crs = "+init=epsg:4326") %>% leaflet() %>% addProviderTiles(provider = "CartoDB.Positron", group = "Light Map") %>% setView(lat = 41.8781, lng = -87.7, zoom = 10) %>% addPolygons(label = ~ paste(format(trips, big.mark=","), "Trips Started"), stroke = TRUE, highlightOptions = highlightOptions(color = "white", weight = 5, bringToFront = TRUE), weight = 1, smoothFactor = 0, fillOpacity = 0.7, color = ~ pal_trips(trips)) map_start_tracts