Iteratively Generating Data Visualization Code with ChatGPT

On November 30th, OpenAI released ChatGPT into the world for public use - changing, with the press of a button, the lives of high school students forever 😂

While the OpenAI blog post explains the mechanics of the model better than I ever could, the important thing for readers here to understand is that ChatGPT essentially functions as a more knowledgeable Google - one you can make written requests to and receive back fully fleshed-out answers, which can then be follow-up on conversationally. To that point - Google has allegedly held internal meetings describing ChatGPT to be a “code red” for their search business, with executives concerned that the interactivity and simplicity of ChatGPT could replace many of the queries (and therefore, ad impressions) currently run on Google Search.

If you look at some of the silly examples I’ve embedded below, you can start to imagine how versatile the model is - I highly encourage everyone to make an account and mess around if you haven’t done so yet!

ChatGPT - Skiing Haiku

ChatGPT - Skiing Haiku 2

ChatGPT - Prime Rib Recipe

Data Visualization Application

While ChatGPT excels at conversation and generating natural language responses to prompts, it has another incredible tool in its kit: ChatGPT can code.

Given that this is predominantly a data visualization blog, I thought it would be really fun to test out ChatGPT’s code generation capabilities - focusing on three primary questions:

  • Can ChatGPT write ggplot2 code which seems to capture the semantic meaning requested in the prompt?
  • Does the generated code actually run and build the charts requested?
  • Can ChatGPT translate code between R and ggplot2 and Python and Seaborn?

Take a look below as I step through some examples, focused on answering the questions above - I’ve tried to capture my exact prompt, ChatGPT’s reply, and then the materialized chart for each!

Example 1: Simple Line Chart

Prompt

“Create a line chart with two data series trended over time, using ggplot2”

Response

To create a line chart with two data series trended over time using ggplot2, you can use the following code:

library(ggplot2)

# Create a data frame with the two data series
df <- data.frame(
  time = 1:10,
  series1 = c(1, 2, 3, 4, 5, 4, 3, 2, 1, 0),
  series2 = c(0, 1, 2, 3, 4, 5, 4, 3, 2, 1)
)

# Create the line chart
ggplot(df, aes(x = time)) +
  geom_line(aes(y = series1, color = "Series 1")) +
  geom_line(aes(y = series2, color = "Series 2")) +
  labs(title = "Line Chart with Two Data Series",
       x = "Time",
       y = "Value",
       color = "Series")

This code creates a data frame df with two data series series1 and series2, which have values for 10 time points. Then, it uses ggplot2 to create a line chart with time on the x-axis, and series1 and series2 on the y-axis. The lines for the two data series are colored differently to make them easily distinguishable. The chart also has labels for the title, x-axis, y-axis, and color legend.

Visual

ChatGPT ggplot Prompt 1

Example 2: Formatted Line Chart

Prompt

“Can you add a subtitle, make the title font more noticeable, and add dots on each point on the line?”

Response

Yes, you can add a subtitle, make the title font more noticeable, and add dots on each point on the line in the ggplot2 line chart by modifying the code as follows:

library(ggplot2)

# Create a data frame with the two data series
df <- data.frame(
  time = 1:10,
  series1 = c(1, 2, 3, 4, 5, 4, 3, 2, 1, 0),
  series2 = c(0, 1, 2, 3, 4, 5, 4, 3, 2, 1)
)

# Create the line chart
ggplot(df, aes(x = time)) +
  geom_line(aes(y = series1, color = "Series 1")) +
  geom_line(aes(y = series2, color = "Series 2")) +
  geom_point(aes(y = series1), size = 3) +
  geom_point(aes(y = series2), size = 3) +
  labs(title = "Line Chart with Two Data Series",
       subtitle = "Subtitle",
       x = "Time",
       y = "Value",
       color = "Series") +
  theme(plot.title = element_text(size = 20, face = "bold"))

In this modified code, the ggplot() function is used to create the line chart, and geom_point() is used to add dots on each point on the two data series. The labs() function is used to add a subtitle to the chart, and the theme() function is used to make the title font more noticeable by increasing its size and setting it to bold.

Visual

ChatGPT ggplot Prompt 2

Example 3: Formatted like The Economist

Prompt

“Can you format the chart in the same style as the Economist magazine?”

Response

library(ggplot2)

# Create a data frame with the two data series
df <- data.frame(
  time = 1:10,
  series1 = c(1, 2, 3, 4, 5, 4, 3, 2, 1, 0),
  series2 = c(0, 1, 2, 3, 4, 5, 4, 3, 2, 1)
)

# Create the line chart
ggplot(df, aes(x = time)) +
  geom_line(aes(y = series1, color = "Series 1")) +
  geom_line(aes(y = series2, color = "Series 2")) +
  geom_point(aes(y = series1), size = 3) +
  geom_point(aes(y = series2), size = 3) +
  labs(title = "Line Chart with Two Data Series",
       subtitle = "Subtitle",
       x = "Time",
       y = "Value",
       color = "Series") +
  theme_economist()

In this modified code, the theme_economist() function is used to apply the Economist magazine style to the line chart. This function is part of the ggthemes package, so you will need to install and load the package before using it. You can do this by running the following code:

install.packages("ggthemes")
library(ggthemes)

After installing and loading the ggthemes package, you can use the theme_economist() function to apply the Economist magazine style to your line chart.

Visual

ChatGPT ggplot Prompt 3

Example 4: Translating Line Chart Code from R code Using ggplot2 to Python code using Seaborn

Prompt

“Here is code to create a chart using ggplot2

library(ggplot2)

# Create a data frame with the two data series
df <- data.frame(
  time = 1:10,
  series1 = c(1, 2, 3, 4, 5, 4, 3, 2, 1, 0),
  series2 = c(0, 1, 2, 3, 4, 5, 4, 3, 2, 1)
)

# Create the line chart
ggplot(df, aes(x = time)) +
  geom_line(aes(y = series1, color = "Series 1")) +
  geom_line(aes(y = series2, color = "Series 2")) +
  geom_point(aes(y = series1), size = 3) +
  geom_point(aes(y = series2), size = 3) +
  labs(title = "Line Chart with Two Data Series",
       subtitle = "Subtitle",
       x = "Time",
       y = "Value",
       color = "Series") +
  theme(plot.title = element_text(size = 20, face = "bold"))

Can you help me make the same chart, but in Python, using Seaborn?”

Response

Yes, you can create a line chart with two data series using Seaborn in Python with the following code:

import seaborn as sns
import matplotlib.pyplot as plt

# Create data frame with two data series
data = {
  'time': list(range(1, 11)),
  'series1': [1, 2, 3, 4, 5, 4, 3, 2, 1, 0],
  'series2': [0, 1, 2, 3, 4, 5, 4, 3, 2, 1]
}
df = pd.DataFrame(data)

# Create line chart
sns.lineplot(x = 'time', y = 'series1', data = df, color = 'blue')
sns.lineplot(x = 'time', y = 'series2', data = df, color = 'red')

# Add title and subtitle
plt.title("Line Chart with Two Data Series", fontsize = 20, fontweight = 'bold')
plt.suptitle("Subtitle", fontsize = 12)

# Add axis labels
plt.xlabel("Time")
plt.ylabel("Value")

# Show plot
plt.show()

This code creates a line chart with two data series, series1 and series2, trended over time. The sns.lineplot() function is used to create the line chart and specify the variables to use for the x and y axes. The plt.title() and plt.suptitle() functions are used to add a title and subtitle to the chart. The plt.xlabel() and plt.ylabel() functions are used to add labels to the x and y axes. The plt.show() function is used to display the chart. You can customize the chart further by adjusting the font sizes, colors, and other options.

Fixed Response

ChatGPT didn’t import the pandas package in the header, despite calling it later to instantiate the dataframe. Fortunately, this is an easy fix - we just insert the import pandas as pd line.

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Create data frame with two data series
data = {
  'time': list(range(1, 11)),
  'series1': [1, 2, 3, 4, 5, 4, 3, 2, 1, 0],
  'series2': [0, 1, 2, 3, 4, 5, 4, 3, 2, 1]
}
df = pd.DataFrame(data)

# Create line chart
sns.lineplot(x = 'time', y = 'series1', data = df, color = 'blue')
sns.lineplot(x = 'time', y = 'series2', data = df, color = 'red')

# Add title and subtitle
plt.title("Line Chart with Two Data Series", fontsize = 20, fontweight = 'bold')
plt.suptitle("Subtitle", fontsize = 12)

# Add axis labels
plt.xlabel("Time")
plt.ylabel("Value")

# Show plot
plt.show()

Visual

Example 5: Creating Heatmaps

Prompt

“Using the average temperature per week in Los Angeles, can you make a heatmap to present the data using ggplot2?”

Response

Yes, you can create a heatmap to present the average temperature per week in Los Angeles using ggplot2 with the following code:

library(ggplot2)

# Create data frame with temperature data
data <- data.frame(
  week = c("Week 1", "Week 2", "Week 3", "Week 4"),
  Monday = c(70, 75, 80, 85),
  Tuesday = c(71, 76, 81, 86),
  Wednesday = c(72, 77, 82, 87),
  Thursday = c(73, 78, 83, 88),
  Friday = c(74, 79, 84, 89),
  Saturday = c(75, 80, 85, 90),
  Sunday = c(76, 81, 86, 91)
)

# Create heatmap
ggplot(data, aes(x = week, y = weekday)) + 
  geom_tile(aes(fill = temp)) + 
  scale_fill_gradient(low = "blue", high = "red") +
  labs(title = "Average Temperature per Week in Los Angeles",
       x = "Week",
       y = "Weekday",
       fill = "Temperature (F)") +
  theme(plot.title = element_text(size = 20, face = "bold"))

This code creates a heatmap that shows the average temperature per week in Los Angeles. The ggplot() function is used to create the chart and specify the data frame to use. The geom_tile() function is used to add tiles to the chart, and the scale_fill_gradient() function is used to specify a color gradient to use for the tiles. The labs() function is used to add labels to the x and y axes and the fill legend. The theme() function is used to specify the font size and boldness of the title. You can adjust these options to customize the appearance of your heatmap.

Fixed Response

The chart created by ChatGPT seems to assume the data is converted from “wide” to “long” data. Unfortunately, it is not - the ggplot code is looking for the weekday field, but the data still has Monday, Tuesday, etc. fields.

This is still an easy fix for us though - we simply melt the columns into the weekday field that ggplot is looking for, and then things look okay!

library(ggplot2)
library(reshape2)

# Create data frame with temperature data
data <- data.frame(
  week = c("Week 1", "Week 2", "Week 3", "Week 4"),
  Monday = c(70, 75, 80, 85),
  Tuesday = c(71, 76, 81, 86),
  Wednesday = c(72, 77, 82, 87),
  Thursday = c(73, 78, 83, 88),
  Friday = c(74, 79, 84, 89),
  Saturday = c(75, 80, 85, 90),
  Sunday = c(76, 81, 86, 91)
)

data <- melt(data, id.vars = "week", variable.name = "weekday", value.name = "temp")

# Create heatmap
ggplot(data, aes(x = week, y = weekday)) + 
  geom_tile(aes(fill = temp)) + 
  scale_fill_gradient(low = "blue", high = "red") +
  labs(title = "Average Temperature per Week in Los Angeles",
       x = "Week",
       y = "Weekday",
       fill = "Temperature (F)") +
  theme(plot.title = element_text(size = 20, face = "bold"))

Visual

ChatGPT ggplot Prompt 5

Example 6: Generating Maps

Prompt

“Can you create a map of the United States using ggplot2 which shows the results of the 2016 Presidential Election by state, with colors representing the margin of victory in the popular vote?”

Response

Yes, you can create a map of the United States showing the results of the 2016 Presidential Election by state using ggplot2 with the following code:

library(ggplot2)
library(maps)
library(mapdata)

# Load election data
election_data <- read.csv("2016_election_data.csv")

# Create map of the US
ggplot() +
  geom_map(data = map_data("state"),
           aes(x = long, y = lat, map_id = region),
           fill = "white", color = "gray") +
  # Add election data
  geom_map(data = election_data,
           aes(map_id = state, fill = margin),
           color = "black", size = 0.25) +
  # Add legend
  scale_fill_gradient(low = "blue", high = "red") +
  labs(title = "2016 Presidential Election Results",
       subtitle = "Margin of Victory in the Popular Vote",
       fill = "Margin of Victory") +
  theme(plot.title = element_text(size = 20, face = "bold"))

This code creates a map of the United States and adds election data for each state to show the margin of victory in the popular vote. The ggplot() function is used to create the map, and the geom_map() function is used to add the state boundaries to the map. The scale_fill_gradient() function is used to specify a color gradient to use for the state fill colors, and the labs() function is used to add labels to the legend and title. The theme() function is used to specify the font size and boldness of the title. You can adjust these options to customize the appearance of your map. Note that you will need to replace the 2016_election_data.csv file with actual election data to create the map.

Visual

Unfortunately, I spent some time messing around with ChatGPT and couldn’t get it to generate any mapping code which didn’t rely on some sort of file input. As such, I don’t have a map visual to post here - but I will say that the geoms added to the graphic do seem to make sense, and would seemingly result in a state-level map colorized by margin of victory, with a blue to red color gradient!

Some Takeaways

  • ChatGPT does seem to grasp how ggplot’s Grammar of Graphics works in the code that it writes: it adds layers, customized for each part of the chart, atop a base object, just like a human programmer would
  • ChatGPT’s ability to recommend dependent packages that need to be installed to run the recommended code is really neat - shows there is some amount of fundamental understanding of how the different code modules and commands come together at runtime
  • The ability to quickly translate ggplot code in R for Seaborn code in Python, while preserving the semantic meaning of the chart, is amazing - and will make switching between different development environments/toolsets so much easier
  • For data visualizations particularly reliant on the underlying data, ChatGPT tends to assume that the user intends to import data from a file, and focuses primarily on chart/map creation, rather than also building out the data acquisition code
    • If I had to guess, I think this behavior ^ is driven by the fact that the model itself must be trained on Stack Overflow data?
    • As in, many submissions from users to Stack Overflow contain snippets of code, not entire files/notebooks, and therefore ChatGPT has learned to code in more of a “snippet” style, which is more brief and makes more assumptions about what comes before and after the snippets in question
  • Moving forward, I think the killer data visualization application of ChatGPT will be quick prototyping - it’ll be super useful for quickly creating chart “skeletons” that can easily be customized and built on top of!
    • ie. “Using Matplotlib, build me a 4 by 2 faceted line chart with time on the x-axis and score on the y-axis, in Fivethirtyeight style”

Contents