Building a Machine Learning Model to Predict Plane Take Off Weight for the Opensky PRC Data Challenge

The past few months I’ve used my spare blog energy to work on a fun project that I’d like to loop everyone into!

On August 1st I received an email from the OpenSky listserv (I became a member while building) my Planes Overhead app for the Tidbyt) announcing a data science challenge, focused on predicting plane take off weights. This challenge felt precision targeted for me, as a great lover of planes and machine learning dilettante!

Calling all Data Scientists: Compete in the Eurocontrol PRC Data Challenge!

Together with Eurocontrol’s Performance Review Committee we are announcing a new aviation data challenge! Predict the actual take off weights of aircraft correctly based on their OpenSky data - and win a cumulative 5000 EUR!

I wrapped up my final submission to the challenge a few weeks ago - I don’t think I’ll be winning the five thousand euros, but I had an amazing time building out a (pretty decent!) predictive model and want to give the whole effort some closure with a blog writeup!

The Challenge

The challenge has a great website with documentation on all facets of the competition - I’ve pasted the most important details below as a quick overview:

Goal

The Performance Review Commission (PRC) Data Challenge aims at engaging with data scientists, even without Aviation background, to create teams and compete to build an open Machine Learning (ML) model able to accurately infer the Actual TakeOff Weight (ATOW) of a flown flight.

ATOW is an important input for models estimating fuel burnt and derived gaseous emissions. The current lack of openly available ATOW is typically compensated by assuming it to be equal to a certain percentage of the Maximum TakeOff Weight (MTOW) for the relevant aircraft model. With this challenge we aim at making available a better way to estimate ATOW, i.e. an Estimated TakeOff Weight (ETOW).

The PRC thinks that there is a great pool of Data Scientists that could help to define the open model of the Challenge, but above that, it is the importance to have a white box approach to the way the model is being built. This transparency is a way to build trust, reproducibility and eventually evolve a collaboration to improve the understanding (and the reduction) of the impact of aviation to the environment.

Dataset Details

The master dataset for this challenge was tabular data on 369k flights in Europe in 2022. This dataset enumerated all of the key datapoints for a given flight, such as aircraft type, origin and destination airports, amount of time flown, and known takeoff weight.

That dataset was supplemented with trajectory information from the OpenSky Network, captured every few seconds along the path of each flight, snapshotting essential information such as altitude, groundspeed, rate_of_climb, and current location.

Logging that data every couple seconds for each of the 369k flights in the dataset created a colossal amount of data - when all was said and done, the provided trajectories ended up taking up 306.4 GB on my hard drive in the form of daily Parquet files!

That volume was no joke and meant that it wouldn’t be possible to just “import” the trajectory data into some model and get any insights - you’d crash the computer instantly. Instead, I’d need to figure out what information I would want to extract from the trajectories, and then work to process the files down into a smaller, more wieldy dataset.

Flight List Description

Trajectory Description

Processing the Trajectory Data

My Personal Computer

I performed all the compute for this project using my personal computer, a tiny Linux box tucked under my computer monitor. While peppy enough to make a few charts for my blog, it’s not exactly what you’d order to work on a “big data” project like this!

As such, the 306 GB of trajectory data was definitely daunting… but necessity is the mother of invention, and after some tinkering, I developed a working pipeline to process the trajectory data! The job took many hours to run (best done overnight, to be honest with you), but eventually distilled all that trajectory information down into a much more manageable 1.1GB dataset that I could then pull into my model.

Workflow

Since the trajectory data was stored in daily parquet files, I built my processing jobs to process a single day of data at a time, creating metrics at the flight_id level and outputting a new csv file with the output. That logic was contained inside my batch_process_trajectories.R file.

Some of the flight_id level features generated in the stage:

  • time spent in different buckets of vertical_rate, capturing seconds spent in climb, descent, and level flight
  • summary statistics like mean, iqr, min, and max for altitude and groundspeed
  • the great_circle_distance flown
  • time spent in different phases of wind, capturing time spent flying with headwinds or tailwinds

A simple Bash script called check_and_process.sh orchestrated all the job runs - it would loop over all of the files in the directory of parquet trajectory files, and if there was not already a csv output for that date, it would trigger a run of the job. Each run took about two minutes - times 365 days leaves us with twelve hours of continuous processing to crunch all these files. Poor tiny computer! Of course, I could have spun up a large EC2 instance and brute-forced the solution much faster, but the scrappy underdog in me had fun figuring out how to process the information I needed within the bounds of what was possible on my hardware.

Finally, I had a small R script called combine_trajectories.R to stitch all of the small daily csv files together into a single large (but not too large!) csv suitable for use in my model.

Workflow Files

  • batch_process_trajectories.R: read in a single .parquet trajectory file and output a .csv at the flight_id level with summarized details
  • check_and_process.sh: Bash script which checks to see all the .parquet files which have yet to be processed (ie. have a .csv output in the directory) and triggers the runs on outstanding files
  • combined_trajectories.Rmd: knits together the directory of daily .csvs into a single dataframe

Exploring the Base Data

With the trajectory data processed and ready to use later, I was excited to start exploring the primary dataset!

Anytime someone tosses me a new dataset with instructions to go build a predictive model, I need to do a few rounds of exploration to get a better sense of what I’m dealing with. As a visual learner, I find that getting summary statistics and then charting visualizations of the data is the quickest way to grasp how the data is laid out and what dimensions jump out.

In this case, I built a bunch of quick-and-dirty graphics which sought to explore how tow varied as a function of different variables. This helped give an immediate sense as to:

  • the volume and variance of tow as a function of different dimensions, such as aircraft_type
  • the relationship between tow and flown_distance when normalized by aircraft_type
  • how the tow and flown_distance relationship sees notable clustering effects when also sliced by airline

Violin Plot of TOW by Aircraft Type

This plot confirmed that there were a lot of distinct aircraft types in the data and also a meaningful amount of variance in the TOWs for each aircraft type.

Violin Plot of TOW by Aircraft Type

ggplot(challenge_set, aes(x = "", y = tow)) +
  geom_violin(fill = "lightblue") +
  geom_boxplot(width = 0.1, fill = "grey") + 
  facet_wrap(~aircraft_type, scales = "free") + 
  labs(title = "Violin Plot of Take-Off Weight (TOW)", y = "TOW") +
  theme_minimal()

Scatterplot of TOW vs Flown Distance by Aircraft Type

TOW generally scaled upwards as a function of distance flown - no surprise there - but had very different slopes depending on the aircraft type in question.

Scatterplot of TOW vs Flown Distance by Aircraft Type

ggplot(challenge_set, aes(x = flown_distance, y = tow, color = pct_of_range)) +
  geom_point() +
  facet_wrap(~ aircraft_type) + 
  labs(title = "Scatter Plot of TOW vs Flown Distance", x = "Flown Distance", y = "TOW") +
  theme_minimal() +
  geom_hline(aes(yintercept = mtow), linetype = "dashed", color = "red") +
  geom_hline(aes(yintercept = oew), linetype = "dashed", color = "green") +
  geom_vline(aes(xintercept = range), linetype = "dashed", color = "blue") + 
  geom_smooth(method = "lm") 

Scatterplot of TOW vs Flown Distance by Aircraft Type and Airline

Looking the TOW vs Flown Distance relationship for four notable widebody models (B772, B77W, B788, and B789) shows there is also a meaningful difference in usage between different airlines.

Scatterplot of TOW vs Flown Distance by Aircraft Type and Airline

challenge_set %>% 
  filter(aircraft_type %in% c("B772", "B77W", "B788", "B789")) %>% 
  ggplot(aes(x = flown_distance, y = tow, color = airline)) +
    geom_point() +
    facet_wrap(~ aircraft_type) + 
    labs(title = "Scatter Plot of TOW vs Flown Distance", x = "Flown Distance", y = "TOW") +
    theme_minimal() + 
    geom_hline(aes(yintercept = mtow), linetype = "dashed", color = "red") +
    geom_hline(aes(yintercept = oew), linetype = "dashed", color = "green") +
    geom_vline(aes(xintercept = range), linetype = "dashed", color = "blue")

Building a Predictive Model

With the trajectory data processed into usable form and a few peeks taken at the data, I was finally ready to start building models!

Model Selection

Early on, I decided I would use a random forest to handle this regression task (estimating tow) - for a few reasons:

  1. Random forests aren’t black boxes: you can extract feature importance to learn about what features are particularly influential in your predictions
  2. Random forests handle a mix of categorical and continuous features with ease: they will naturally pick up on the distinct categories of airline in the data, without having to create dummy variables
  3. Random forests are good at capturing nonlinear relationships between features: I thought we might see some on the margins of airplane range, flown distance, and tow, so I wanted something which would find that signal rather than obscure it
  4. Random forests are hard to overfit: because they are an ensemble of decision trees, where a random set of features is in use at each step, preventing any one pattern from dominating the model

Random forests aren’t perfect though - key drawbacks are:

  1. Random forests are slow to compute with large feature sets, since the complexity of the underlying decision trees scales upwards with the number of features
  2. Random forests don’t output coefficients nor do they unpack multicollinear effects: meaning that while the feature importances are helpful, they don’t show where the root causal effects might actually emerge, and correlated variables can increase total model accuracy, but distort the relative contributions of similar features

I did play with XGBoost for a second as an alternative to random forests, but didn’t get meaningfully better performance and so stuck with the random forest approach.

One note on implementation is that I had significant memory utilization issues by using the default implementation of randomForest in R - my model would continually max out the 16 GB of RAM on my computer and crash when being trained.

I was able to work around this by using ranger, an alternative implementation in R, which I was able to tune to use more CPU threads and less memory - making it feasible to do the model training on my tiny desktop.

Feature Engineering

In my experience, model selection (and tuning) is important, but choosing the right features to put into your model is even more important! A neural net or decision tree or even linear regression model is going to be pretty accurate if you are able to segment and summarize your data with strong features - but even the most robust models will fail to pick up on patterns if you haven’t done the work to carve out valuable features!

There were obviously a bunch of valuable features included right off the bat in the provided data: things like aircraft_type, flown_distance, and departure_airport - I piped those features straight into the model.

To get more accuracy I knew I would need to transform those features in a variety of ways, to try to “soak up” all the information they possessed about tow

Here’s a list of what I ended up trying!

  1. Domain-specific Transformations: I generated new fields like flown_velocity from the existing flown_distance and departure/arrival times so the model could interpret the concept of “speed” in conjunction with the total flight time and distance
  2. Continuous Variable Transformations: For most continuous features, I created log-transformations to reduce the skewness of the features. As needed, I also squared the features and took the square root of the features, so that non-linear relationships could emerge (ie. what if flown_distance for a particular aircraft_type starts having a significant effect on tow as it approaches its maximum range?)
  3. Categorizing and Bucketing: I grouped continuous features into categorical bins, like converting departure_time into categories like 'morning', 'afternoon', 'evening', or segmenting flown_distance into short-haul, medium-haul, and long-haul flights. This helped capture hidden patterns in discrete segments of the data
  4. Feature Interactions: I created interaction terms like aircraft_type * departure_airport and flown_velocity * temperature to capture how combinations of features impact tow. These interactions often surfaced hidden relationships that individual features could not capture alone
  5. Model Stacking: I performed a quick-and-dirty linear regression using just aircraft_model and flown_distance to predict tow for each flight_id, and then fed those predictions into my primary model as a new feature! This helped capture some small effects in the initial regression which the random forest could then build on
  6. Dimensional Clustering: I used k-means clustering to develop a set of five clusters which grouped together similar aircraft_types. This allowed the model to better general several distinct types of aircraft into a broader gestalt (ie. “large short-haul planes”, “medium long-haul planes”, “regional jets”) that could be used to analyze their collective behavior
  7. Use of External Data: I pulled in publicly available data about airports to capture traits like their elevation, and fetched information about each aircraft_type’s maximum take-off weight (MTOW), original equipment weight (OEW), and range. This helped define the parameter space in which the model was making predictions
  8. Collapsing Rare Categories: I found that the data had a few outlier aircraft_type and airlines which only had a handful of records. To not overfit the model to those rare categories, I instead collapsed them all down into a new “rare” category for the dimension in question, which helped the model be a bit more robust
  9. Imputing Missing Data: For the handful of records which lacked trajectory data, or had values which didn’t make sense (ie. negative altitudes), I tried to “intelligently impute” what those values should have been, by looking at the other values observed for that flight or typical flights by that aircraft_type

K-means Clustering of Aircraft Type

Model Training and Evaluation

Training the model was a pretty simple process - I took 80% of the fully featurized dataset used it to train the random forest, reserving the “unseen” 20% to make predictions on later, to measure the accuracy.

Accuracy was measured with a battery of metrics, but a primary focus on:

  • RMSE: root mean squared error - the target metric for the competition, this reflected “how off” my residuals were from the correct tow in units of kilograms
  • MAPE: mean absolute percent error - a supplemental metric, this gave me a sense of what percent error I was seeing on average, a metrics that’s easier to compare when the data is sliced by different dimensions than RMSE

Slice Data into Train and Test Sets

# Remove submission set rows from challenge_set
challenge_set <- challenge_set %>% filter(!is.na(tow))

# Test and Train split
sample_index <- sample(1:nrow(challenge_set), 0.8 * nrow(challenge_set))
train_data <- challenge_set[sample_index, ]
test_data <- challenge_set[-sample_index, ]

Train Random Forest

# Train random forest regression model
rf_model <- ranger(
  formula, 
  data = train_data, 
  importance = 'impurity', 
  num.trees = 500, 
  min.node.size = 5,
  save.memory = TRUE,
  replace = TRUE,
  sample.fraction = 0.7
)

# View the model summary
print(rf_model)

Evaluate Accuracy

# Make predictions on the test set
predictions <- predict(rf_model, data = test_data)$predictions

# Calculate MAPE
mape <- mean(abs((test_data$tow - predictions) / test_data$tow)) * 100
print(paste("MAPE: ", mape))

# Calculate RMSE
rmse <- sqrt(mse)
print(paste("Root Mean Squared Error: ", rmse))

Understanding Feature Importance

Every time the model ran, it would 1) serialize itself to a file, and 2) generate an chart showing the relative importance of each feature in the model. Reviewing this each round of modeling allowed me to see which features seemed to be most useful, or not-useful at all - giving me a reason to tweak or expand on them in a future iteration of the model.

Here’s an early example, before I started interacting all the features and creating a bit of an explosion in total feature volume that makes the later charts very long to read:

Feature Importance Example

# Grab the feature importances
importance_data <- as.data.frame(rf_model$variable.importance)

# Process the importance data
importance_data$Variable <- rownames(importance_data)
colnames(importance_data) <- c("Importance", "Variable")
importance_data <- importance_data[order(importance_data$Importance, decreasing = TRUE),]

# Build a barplot of importances
ggplot(importance_data, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  xlab("Variables") +
  ylab("Importance") +
  ggtitle("Feature Importance from Random Forest Model") +
  labs(subtitle = paste("Root Mean Squared Error: ", rmse)) +
  theme_minimal()

# Export an image
ggsave(paste("training_results_", Sys.time(), ".png", sep = ""), path = "importances", height = 25, width = 8, units = "in", dpi = 300, bg = "white")

Challenge Results

Despite my best efforts and several weeks of iterative model tuning and feature engineering, I was never able to break into the “elite” tier of models submitted in the challenge! Ultimately, I finished in 13th place with a very respectable RMSE of 2,695 and a MAPE of less than 3% - solid figures, but a tier behind the silver and bronze finishers at roughly RMSE of roughly 2,250 and the winner with a borderline shocking RMSE of 1,561.

Final Challenge Rankings

I am planning on writing a follow-up post to this one where I break down some of the approaches that the winning teams used in their code - things I might have missed or could have optimized better. Stay tuned for that follow-up post in a few days, and I hope that in the meantime you enjoyed this walkthrough of my model!

Fun Takeaways

  • It took me 18 submissions to whittle my model RMSE down from about 3,500 to 2,600
  • The final random forest model ended up having 302 independent features!
  • interaction_aircraft_family_distance_category was pretty stable as the most important feature for most of the time training
  • At the end, after implementing stacking, the lm_predicted_tow_squared feature, sourced from the linear regression of the data, was the single most important feature
  • log features were typically more important than the original numbers
  • The A319 was the aircraft_type I was able to predict most accurately, with MAPE of 0.50%. The AT76 was the least accurate with MAPE of 3.38%
  • The B77W was the heaviest aircraft_type in the data with average tow of 264k kilograms, with the AT76 lowest at 20k
  • MAPE generally went down as a function of flown_distance - tow predictions tended to be slightly less accurate on shorter flights
  • I was surprised that time features were not very useful at all - I expected features like month of year would have been very important, as they capture some amount of information about passenger loads and seasonal weather, but none of the month/week/day/time features I engineered seemed particularly important to the model

Code Reference

Process Trajectories

check_and_process.sh

#!/bin/bash

# Define directories
TRAJECTORIES_DIR="/home/conor/atow_predictions/trajectories"
PROCESSED_DIR="/home/conor/atow_predictions/processed_trajectories_v5"
R_SCRIPT_PATH="/home/conor/atow_predictions/batch_process_trajectories.R"

# Function to calculate the day of the year from YYYY-MM-DD
get_day_of_year() {
  date -d "$1" +%j
}

# Loop through all .parquet files in the trajectories directory
for parquet_file in "$TRAJECTORIES_DIR"/*.parquet
do
  # Extract the base filename (without extension)
  filename=$(basename -- "$parquet_file" .parquet)
  
  # Full path to the expected CSV file
  expected_csv="$PROCESSED_DIR/$filename.csv"
  
  # Print the filenames being checked for debugging
  echo "Checking if $expected_csv exists..."
  
  # Check if the corresponding .csv file exists in the processed directory
  if [ ! -f "$expected_csv" ]; then
    echo "$filename.csv is missing, processing $filename.parquet"
    
    # Calculate the day of the year from the filename (assumes format YYYY-MM-DD)
    day_of_year=$(get_day_of_year "$filename")
    
    # Run the R script with the day of the year as the parameter
    Rscript "$R_SCRIPT_PATH" "$day_of_year"
  else
    echo "$filename.csv already exists, skipping"
  fi
done

batch_process_trajectories.R

library(arrow)
library(tidyverse)
library(dplyr)
library(tidyr)
library(readr)
library(purrr)
library(e1071)


# Haversine formula function
haversine_distance <- function(lat1, lon1, lat2, lon2) {
  # Convert degrees to radians
  lat1 <- lat1 * pi / 180
  lon1 <- lon1 * pi / 180
  lat2 <- lat2 * pi / 180
  lon2 <- lon2 * pi / 180
  
  # Haversine formula
  dlat <- lat2 - lat1
  dlon <- lon2 - lon1
  a <- sin(dlat / 2)^2 + cos(lat1) * cos(lat2) * sin(dlon / 2)^2
  c <- 2 * atan2(sqrt(a), sqrt(1 - a))
  r <- 6371  # Radius of Earth in kilometers
  distance <- r * c
  
  return(distance)  # Return distance in KM
}

# https://sgichuki.github.io/Atmo/
windDir <-function(u,v){
  (270-atan2(u,v)*180/pi)%%360 
}

windSpd <-function(u,v){
  sqrt(u^2+v^2)
}

# Function to process each file
process_file <- function(file_path) {
  # Open the dataset
  trajectory_df <- open_dataset(file_path)
  
  # Select only necessary columns to reduce memory load
  selected_data <- trajectory_df %>%
    select(flight_id, timestamp, groundspeed, altitude, latitude, longitude, vertical_rate, u_component_of_wind, v_component_of_wind, track, specific_humidity) %>%
    collect()  # Pull the data into R memory
  
  # Define speed and altitude bins after collecting
  vertical_rate_bins <- c(-Inf, -2000, -1000, -300, 300, 1000, 2000, Inf)  # Define bins for descent, level, and climb stages
  
  # Process and summarize the data in R
  time_spent_data <- selected_data %>%
    arrange(timestamp) %>%
    group_by(flight_id) %>%
    mutate(
      climb_stage = cut(vertical_rate, breaks = vertical_rate_bins, labels = FALSE),
      time_diff = as.numeric(difftime(lead(timestamp), timestamp, units = "secs"))
    ) %>%
    filter(!is.na(time_diff)) %>%
    group_by(flight_id, climb_stage) %>%
    summarise(total_time = sum(time_diff, na.rm = TRUE)) %>%
    ungroup()
  
  # Calculate total time for each flight
  total_time_per_flight <- time_spent_data %>%
    group_by(flight_id) %>%
    summarise(total_time_flight = sum(total_time, na.rm = TRUE))
  
  # Merge the total time back into the original dataset
  climb_time_spent <- time_spent_data %>%
    left_join(total_time_per_flight, by = "flight_id") %>%
    mutate(percentage_time = (total_time / total_time_flight) * 100)
  
  # Pivot to get time spent in each speed-altitude-climb category as columns
  time_spent_wide <- climb_time_spent %>%
    mutate(climb_category = paste0("climb_", climb_stage)) %>%
    select(flight_id, climb_category, total_time, percentage_time) %>%
    pivot_wider(
      names_from = climb_category, 
      values_from = c(total_time, percentage_time), 
      values_fill = 0
    )
  
  # Create a numeric timestamp variable
  selected_data$timestamp_numeric <- as.numeric(as.POSIXct(selected_data$timestamp))
  
  # Calculate wind speed and direction
  selected_data$wind_speed <- sqrt(selected_data$u_component_of_wind^2 + selected_data$v_component_of_wind^2)
  selected_data$wind_direction <- atan2(selected_data$v_component_of_wind, selected_data$u_component_of_wind) * (180 / pi)
  selected_data$wind_direction <- ifelse(selected_data$wind_direction < 0, selected_data$wind_direction + 360, selected_data$wind_direction)
  # Calculate relative wind angle (track vs wind)
  selected_data$relative_wind_angle <- (selected_data$track - selected_data$wind_direction + 180) %% 360 - 180
  # Calculate headwind/tailwind
  selected_data$headwind_tailwind <- selected_data$wind_speed * cos(selected_data$relative_wind_angle * (pi / 180))
  
  # Additional summary metrics per flight_id
  additional_metrics <- selected_data %>%
    group_by(flight_id) %>%
    summarise(
      count_trajectory_observations = n(),
      # Distance
      great_circle_distance = haversine_distance(first(latitude), first(longitude), last(latitude), last(longitude)),
      median_track = median(track, na.rm = TRUE),
      # Altitude    
      max_altitude = quantile(altitude, probs = 0.99, na.rm = TRUE),
      median_altitude = median(altitude, na.rm = TRUE),
      p25_altitude = quantile(altitude, probs = 0.25, na.rm = TRUE),
      p75_altitude = quantile(altitude, probs = 0.75, na.rm = TRUE),
      iqr_altitude = IQR(altitude, na.rm = TRUE),
      sd_altitude = sd(altitude, na.rm = TRUE),
      # Groundspeed
      max_groundspeed = quantile(groundspeed, probs = 0.99, na.rm = TRUE),
      median_groundspeed = median(groundspeed, na.rm = TRUE),
      p25_groundspeed = quantile(groundspeed, probs = 0.25, na.rm = TRUE),
      p75_groundspeed = quantile(groundspeed, probs = 0.75, na.rm = TRUE),
      iqr_groundspeed = IQR(groundspeed, na.rm = TRUE),
      sd_groundspeed = sd(groundspeed, na.rm = TRUE),
      # Wind
      avg_headwind_tailwind = mean(headwind_tailwind, na.rm = TRUE),
      # Ascent Features
      max_vertical_rate = max(vertical_rate[vertical_rate > 0], na.rm = TRUE),   # Maximum climb rate
      avg_positive_vertical_rate = mean(vertical_rate[vertical_rate > 0], na.rm = TRUE),  # Average climb rate
      total_time_ascent = sum(diff(timestamp[vertical_rate > 300]), na.rm = TRUE), # Total time in ascent
      cumulative_ascent = sum(vertical_rate[vertical_rate > 300] * diff(timestamp_numeric[vertical_rate > 300]), na.rm = TRUE),
      # Descent Features
      min_vertical_rate = max(vertical_rate[vertical_rate < 0], na.rm = TRUE),   # Maximum descent rate
      avg_negative_vertical_rate = mean(vertical_rate[vertical_rate < 0], na.rm = TRUE),  # Average descent rate
      total_time_descent = sum(diff(timestamp[vertical_rate < 300]), na.rm = TRUE), # Total time in ascent
      cumulative_descent = sum(vertical_rate[vertical_rate < 300] * diff(timestamp_numeric[vertical_rate < 300]), na.rm = TRUE),
      # other
      avg_humidity = mean(specific_humidity, na.rm = TRUE),
      # Skew
      altitude_skewness = skewness(altitude, na.rm = TRUE),
      altitude_kurtosis = kurtosis(altitude, na.rm = TRUE),
      groundspeed_skewness = skewness(groundspeed, na.rm = TRUE),
      groundspeed_kurtosis = kurtosis(groundspeed, na.rm = TRUE)
    )
  
  # Define flight phases by altitude change and vertical rate
  selected_data <- selected_data %>%
    mutate(
      phase = case_when(
        altitude < 10000 & vertical_rate > 300 ~ "climb",
        altitude > 10000 & vertical_rate < 300 & vertical_rate > -300 ~ "cruise",
        altitude < 10000 & vertical_rate < -300 ~ "descent",
        TRUE ~ "other"
      )
    )
  
  # Summarise statistics for each phase
  phase_summary <- selected_data %>%
    group_by(flight_id, phase) %>%
    summarise(
      avg_altitude = mean(altitude, na.rm = TRUE),
      avg_groundspeed = mean(groundspeed, na.rm = TRUE),
      total_time_phase = sum(as.numeric(difftime(lead(timestamp), timestamp, units = "secs")), na.rm = TRUE)
    ) %>%
    pivot_wider(
      names_from = phase,
      values_from = c(avg_altitude, avg_groundspeed, total_time_phase),
      values_fill = 0
    )
  
  # Calculate rate of change of groundspeed
  selected_data <- selected_data %>%
    arrange(timestamp) %>%
    group_by(flight_id) %>%
    mutate(
      speed_diff = lead(groundspeed) - groundspeed,
      time_diff = as.numeric(difftime(lead(timestamp), timestamp, units = "secs")),
      acceleration = speed_diff / time_diff
    )
  
  # Summarize max/min acceleration and deceleration
  acceleration_summary <- selected_data %>%
    group_by(flight_id) %>%
    summarise(
      max_acceleration = max(acceleration, na.rm = TRUE),
      min_acceleration = min(acceleration, na.rm = TRUE)
    )
  
  # Calculate altitude IQR for each phase
  altitude_iqr_per_phase <- selected_data %>%
    group_by(flight_id, phase) %>%
    summarise(altitude_iqr = IQR(altitude, na.rm = TRUE)) %>%
    pivot_wider(
      names_from = phase,
      values_from = altitude_iqr,
      values_fill = 0,
      names_prefix = "altitude_iqr_"
    )

  # Fit a spline for altitude vs time for each flight
  # Extract features from the spline (e.g., curvature, inflection points)
  # spline_features <- selected_data %>%
  #   group_by(flight_id) %>%
  #   do(spline_model = smooth.spline(.$timestamp_numeric, .$altitude, spar = 0.5)) %>%
  #   reframe(
  #     flight_id,
  #     curvature = sum(abs(diff(spline_model$y, differences = 2))),  # Approx curvature
  #     num_inflection_points = sum(diff(sign(diff(spline_model$y))) != 0)  # Inflection points
  #   )
  
  # # Apply Fourier Transform to altitude vs time
  # # Extract the dominant frequency component
  # fourier_features <- selected_data %>%
  #   group_by(flight_id) %>%
  #   summarise(
  #     fft_altitude = fft(altitude),
  #     fft_groundspeed = fft(groundspeed),
  #   ) %>% 
  #   summarise(
  #     dominant_freq_altitude = which.max(Mod(fft_altitude)),
  #     dominant_freq_groundspeed = which.max(Mod(fft_groundspeed))
  #   ) %>%
  #   mutate(
  #     dominant_freq_altitude = coalesce(dominant_freq_altitude, 0),  # Replace NA with 0
  #     dominant_freq_groundspeed = coalesce(dominant_freq_groundspeed, 0)  # Replace NA with 0
  #   )
  
  # Define altitude and groundspeed bins
  altitude_bins <- c(0, 5000, 10000, 15000, 20000, 25000, 30000, 35000, 40000, Inf)
  groundspeed_bins <- c(0, 100, 200, 300, 400, 500, 600, 700, Inf)
  
  # Create histogram features for cruise phase
  histogram_features <- selected_data %>%
    filter(phase == "cruise") %>%
    group_by(flight_id) %>%
    summarise(
      altitude_hist = list(hist(altitude, breaks = altitude_bins, plot = FALSE)$counts),
      groundspeed_hist = list(hist(groundspeed, breaks = groundspeed_bins, plot = FALSE)$counts)
    ) %>%
    unnest_wider(altitude_hist, names_sep = "_alt_bin_") %>%
    unnest_wider(groundspeed_hist, names_sep = "_speed_bin_")
  
  # Calculate the rate of phase transitions
  phase_transitions <- selected_data %>%
    group_by(flight_id) %>%
    mutate(phase_shift = lag(phase) != phase) %>%
    summarise(total_phase_transitions = sum(phase_shift, na.rm = TRUE))
  
  
  # Early and late segment summaries
  segment_summary <- selected_data %>%
    group_by(flight_id) %>%
    filter(timestamp_numeric <= min(timestamp_numeric) + 600 | timestamp_numeric >= max(timestamp_numeric) - 600) %>%
    mutate(segment = if_else(timestamp_numeric <= min(timestamp_numeric) + 600, "departure", "arrival")) %>%
    group_by(flight_id, segment) %>%
    summarise(
      avg_vertical_rate_segment = mean(vertical_rate, na.rm = TRUE),
      avg_groundspeed_segment = mean(groundspeed, na.rm = TRUE),
      avg_altitude_segment = mean(altitude, na.rm = TRUE)
    ) %>%
    pivot_wider(names_from = segment, values_from = c(avg_vertical_rate_segment, avg_groundspeed_segment, avg_altitude_segment), values_fill = 0)
  
  # Headwind and crosswind for each flight phase
  wind_phase_summary <- selected_data %>%
    group_by(flight_id, phase) %>%
    summarise(
      avg_headwind = mean(headwind_tailwind, na.rm = TRUE),
      avg_crosswind = mean(wind_speed * sin(relative_wind_angle * (pi / 180)), na.rm = TRUE)
    ) %>%
    pivot_wider(names_from = phase, values_from = c(avg_headwind, avg_crosswind), values_fill = 0)
  
  # Power indicators during climb and descent
  power_indicators <- selected_data %>%
    filter(phase %in% c("climb", "descent")) %>%
    group_by(flight_id, phase) %>%
    summarise(
      avg_power_ratio = mean(vertical_rate / groundspeed, na.rm = TRUE)  # Vertical rate-to-groundspeed ratio
    ) %>%
    pivot_wider(names_from = phase, values_from = avg_power_ratio, values_fill = 0, names_prefix = "power_ratio_")
  
  # Effective climb length in terms of time and distance
  climb_length <- selected_data %>%
    filter(phase == "climb") %>%
    group_by(flight_id) %>%
    summarise(
      climb_duration = sum(as.numeric(difftime(lead(timestamp), timestamp, units = "secs")), na.rm = TRUE),
      climb_distance = sum(haversine_distance(latitude, longitude, lead(latitude), lead(longitude)), na.rm = TRUE)
    )
  
  # Merge all feature sets together into the final summary
  final_summary <- time_spent_wide %>%
    left_join(additional_metrics, by = "flight_id") %>%
    left_join(phase_summary, by = "flight_id") %>%
    left_join(acceleration_summary, by = "flight_id") %>%
    left_join(altitude_iqr_per_phase, by = "flight_id") %>%
    left_join(wind_phase_summary, by = "flight_id") %>% 
    left_join(histogram_features, by = "flight_id") %>%
    #left_join(spline_features, by = "flight_id") %>%
    #left_join(fourier_features, by = "flight_id") %>%
    left_join(phase_transitions, by = "flight_id") %>% 
    left_join(wind_phase_summary, by = "flight_id") %>%
    left_join(segment_summary, by = "flight_id") %>%
    left_join(power_indicators, by = "flight_id") %>%
    left_join(climb_length, by = "flight_id")
  
  # Create CSV file name based on parquet file name
  csv_file_name <- gsub("\\.parquet$", ".csv", basename(file_path))
  
  # Write final_summary to CSV
  write_csv(final_summary, file.path("processed_trajectories_v5", csv_file_name))
  
  # Cleanup
  rm(list = ls())
  gc()
  
}

# Capture command-line arguments
args <- commandArgs(trailingOnly = TRUE)

# Extract the first argument
start_index <- as.numeric(args[1])
end_index <- start_index# + 1

# Print the parameter (or use it in your script)
cat("start_index:", start_index, "\t", "end_index:", end_index)

# Get a list of files to process
file_list <- list.files(path = "trajectories/", pattern = "*.parquet", full.names = TRUE)

# Iterate over files
walk(file_list[start_index:end_index], process_file)

combine_trajectories.R

library(tidyverse)
library(lubridate)
library(readr)

# Define the path to the directory containing CSV files
csv_directory <- "processed_trajectories_v5/"
file_list <- list.files(path = csv_directory, pattern = "*.csv", full.names = TRUE)

# Function to read and clean each file
read_and_clean_file <- function(file) {
  data <- read_csv(file, col_types = cols(.default = "c"))  # Read all columns as characters
  # Perform type conversion for specific columns
  data <- data %>%
    mutate(
      across(everything(), as.numeric),  # Try to convert all columns to numeric
      .names = "clean_{col}"
    ) %>%
    mutate(across(where(is.character), as.character))  # Convert remaining columns back to character if needed
  return(data)
}

# Apply the function to each file and combine results
combined_df <- file_list %>%
  lapply(read_and_clean_file) %>%
  bind_rows()  # Combine all data into a single DataFrame

write_csv(combined_df, "combined_trajectories_v5.csv")

Feature Engineering

Aircraft Type Clustering

For the first time:

aircraft_type_summary <- challenge_set %>%
  group_by(aircraft_type) %>%
  summarize(
    count = n(),
    distinct_operators = n_distinct(airline),
    median_flown_distance = median(flown_distance),
    min_flown_distance = min(flown_distance),
    max_flown_distance = max(flown_distance),
    median_flown_velocity = median(flight_duration) / median(flown_distance),
    max_flown_velocity = max(flight_duration / flown_distance),
    median_tow = median(tow, na.rm = TRUE),
    p25_tow = quantile(tow, 0.25, na.rm = TRUE)
  )

# do kmeans
set.seed(123)
kmeans_result <- aircraft_type_summary %>%
  select(count, distinct_operators, median_flown_distance, median_flown_velocity, median_tow) %>%
  scale() %>%
  kmeans(centers = 5)

# Add the cluster labels to the summarized data
aircraft_type_summary$cluster = kmeans_result$cluster

# Write to file for use later
aircraft_type_summary %>%
  select("aircraft_type", "cluster", "median_tow", "median_flown_distance", "count") %>%
  arrange(cluster) %>%
  write_csv("aircraft_type_clusters_kmeans.csv")

# Plot clusters
ggplot(aircraft_type_summary, aes(x = median_flown_distance, y = median_tow, size = log(count), color = factor(cluster))) +
  geom_point() +
  theme_minimal() +
  labs(
    title = "K-means Clustering of Aircraft Type",
    x = "Median Flown Distance",
    y = "Median TOW"
  )

On future runs:

aircraft_type_clusters_kmeans <- read_csv("aircraft_type_clusters_kmeans.csv") %>%
  select("aircraft_type", "cluster") %>% 
  rename("aircraft_type_cluster" = "cluster")

aircraft_type_clusters_kmeans$aircraft_type_cluster = as.factor(aircraft_type_clusters_kmeans$aircraft_type_cluster)

challenge_set <- merge(x = challenge_set, y = aircraft_type_clusters_kmeans, by = "aircraft_type", all.x = TRUE)

Linear Regression

# Fit linear regression models by aircraft_type and get predicted values
challenge_set <- challenge_set %>%
  group_by(aircraft_type) %>%
  do({
    model <- lm(tow ~ flown_distance, data = na.omit(.))
    data.frame(., lm_predicted_tow = predict(model, newdata = .))
  }) %>%
  ungroup()

# Fit linear regression models by aircraft_type and extract R²
r_squared_values <- challenge_set %>%
  group_by(aircraft_type) %>%
  do(model = lm(tow ~ flown_distance, data = na.omit(.))) %>%
  summarise(aircraft_type,
            lm_predicted_tow_r_squared = summary(model)$r.squared
            )

challenge_set <- merge(x = challenge_set, y = r_squared_values, by = "aircraft_type", all.x = TRUE)

Feature Transformations

# Loop over each column and collapse rare categories
for (column in columns_to_collapse_prop) {
  challenge_set <- collapse_rare_categories_by_prop(challenge_set, column, prop_threshold = 0.0025, new_category = "Rare")
}

columns_to_collapse_abs <- c("adep", "ades", "country_code_ades", "country_code_adep")

# Loop over each column and collapse rare categories
for (column in columns_to_collapse_abs) {
  challenge_set <- collapse_rare_categories_by_count(challenge_set, column, min_count = 3, new_category = "Rare")
}
challenge_set <- challenge_set %>% 
  mutate(
    offblock_time_of_day = case_when(
      hour(actual_offblock_time) %in% c(6,7,8,9,10,11) ~ 'Morning',
      hour(actual_offblock_time) %in% c(12,13,14,15,16,17) ~ 'Afternoon',
      hour(actual_offblock_time) %in% c(18,19,20,21,22,23) ~ 'Evening',
      hour(actual_offblock_time) %in% c(0,1,2,3,4,5) ~ 'Night',
      .default = 'Unknown'
    ),
    offblock_is_peak = case_when(
      hour(actual_offblock_time) %in% c(7,8,17,18) ~ 'Peak',
      .default = 'Off Peak'
    ),
    arrival_time_of_day = case_when(
      hour(arrival_time) %in% c(6,7,8,9,10,11) ~ 'Morning',
      hour(arrival_time) %in% c(12,13,14,15,16,17) ~ 'Afternoon',
      hour(arrival_time) %in% c(18,19,20,21,22,23) ~ 'Evening',
      hour(arrival_time) %in% c(0,1,2,3,4,5) ~ 'Night',
      .default = 'Unknown'
    ),
    arrival_is_peak = case_when(
      hour(arrival_time) %in% c(7,8,17,18) ~ 'Peak',
      .default = 'Off Peak'
    ),
    day_of_week = wday(date, label = TRUE),
    is_weekend = if_else(wday(date, label = TRUE) %in% c('Sat', 'Sun'), TRUE, FALSE),
    week_of_year = as.factor(isoweek(date)),
    month_of_year = month(date, label = TRUE),
    is_domestic = if_else(country_code_adep == country_code_ades, TRUE, FALSE),
    flown_velocity = flown_distance / flight_duration,
    aircraft_family = substr(aircraft_type, 1, 3),
    distance_category = cut(flown_distance, breaks = c(0, 1000, 2700, Inf), labels = c("short", "medium", "long")),
    distance_bucket = coalesce(cut(flown_distance, 
                          breaks = seq(0, quantile(flown_distance, 0.999, na.rm = TRUE) + 500, by = 500), 
                          right = FALSE, 
                          include.lowest = TRUE), 'other'),
    flown_velocity_bucket = cut(flown_distance / flight_duration, breaks = seq(0, max(flown_distance / flight_duration, na.rm = TRUE) + 1, by = 0.5), right = FALSE, include.lowest = TRUE),
    relative_distance_adep_ades = cut(flown_distance / ave(flown_distance, adep, ades, FUN = mean), breaks = 10),
    relative_distance_aircraft_type = flown_distance / ave(flown_distance, aircraft_type, FUN = mean),
    relative_flown_velocity_aircraft_type = (flown_distance / flight_duration) / ave((flown_distance / flight_duration), aircraft_type, FUN = mean),
    relative_distance_adep_ades_aircraft_type = cut(flown_distance / ave(flown_distance, adep, ades, aircraft_type, FUN = mean), breaks = 10),
    relative_distance_airline = flown_distance / ave(flown_distance, airline, FUN = mean),
    relative_distance_aircraft_type_airline = flown_distance / ave(flown_distance, aircraft_type, airline, FUN = mean),
    cruise_duration = pmax(flight_duration - 30, 0),
    flight_minus_taxiout_duration_bucket = coalesce(cut(pmax(flight_duration - taxiout_time, 0), 
                                               breaks = seq(0, quantile(pmax(flight_duration - taxiout_time, 0), 0.999, na.rm = TRUE) + 30, by = 30),  # Adjust flight_minus_taxiout_duration_bucket based on 99.9th percentile
                                               right = FALSE, 
                                               include.lowest = TRUE), 'other'),
    departure_traffic = ave(flight_id, adep, date, FUN = length),
    arrival_traffic = ave(flight_id, ades, date, FUN = length),
    departure_traffic_squared = departure_traffic ^ 2,
    arrival_traffic_squared = arrival_traffic ^ 2,
    departure_traffic_bucket = cut(departure_traffic, breaks = 10),
    arrival_traffic_bucket = cut(arrival_traffic, breaks = 10),
    product_departure_traffic_arrival_traffic = departure_traffic * arrival_traffic,
    log_flight_duration = log1p(flight_duration),
    log_flown_distance = log1p(flown_distance),
    log_flown_distance_bucket = cut(log1p(flown_distance), breaks = 10),
    log_flown_velocity = log1p(flown_distance / flight_duration),
    flown_distance_squared = flown_distance ^ 2,
    flown_distance_sqrt = sqrt(flown_distance),
    flown_velocity_squared = (flown_distance / flight_duration) ^ 2,
    flown_velocity_sqrt = sqrt(flown_distance / flight_duration),
    taxiout_time_bucket = cut(taxiout_time, breaks = 10),
    taxiout_time_squared = taxiout_time ^ 2,
    taxiout_to_flight_duration_ratio = taxiout_time / flight_duration,
    taxiout_to_flight_duration_sqrt = sqrt(taxiout_time / flight_duration),
    log_great_circle_distance_adep_ades = log1p(great_circle_distance_adep_ades),
    great_circle_distance_squared = great_circle_distance_adep_ades ^ 2,
    great_circle_distance_sqrt = sqrt(great_circle_distance_adep_ades),
    log_median_altitude = log1p(median_altitude),
    median_altitude_squared = median_altitude ^ 2,
    log_median_groundspeed = log1p(median_groundspeed),
    altitude_range = p75_altitude - p25_altitude,
    groundspeed_range = p75_groundspeed - p25_groundspeed,
    altitude_range_bucket = cut(altitude_range, breaks = 10),
    groundspeed_range_bucket = cut(groundspeed_range, breaks = 10),
    flown_distance_over_great_circle = flown_distance / great_circle_distance_adep_ades,
    great_circle_distance_bucket = coalesce(cut(great_circle_distance_adep_ades, 
                                       breaks = seq(0, quantile(great_circle_distance_adep_ades, 0.999, na.rm = TRUE) + 500, by = 500),  # Adjust great_circle_distance_bucket based on 99.9th percentile
                                       right = FALSE, 
                                       include.lowest = TRUE), 'other'), 
    total_time_ascent_squared = total_time_ascent ^ 2,
    total_time_ascent_bucket = cut(total_time_ascent, breaks = 10),
    total_time_descent_squared = total_time_descent ^ 2,
    total_time_descent_bucket = cut(total_time_descent, breaks = 10),
    pct_total_time_ascent = total_time_ascent / flight_duration,
    pct_total_time_descent = total_time_descent / flight_duration,
    mtow_oew_diff = mtow - oew,
    pct_max_flown_distance_by_aircraft_type = cut((flown_distance / ave(flown_distance, aircraft_type, FUN = max)), breaks = 10),
    headwind_tailwind_bucket = cut(avg_headwind_tailwind, breaks = 10),
    avg_humidity_sqrt = sqrt(avg_humidity),
    avg_humidity_bucket = cut(avg_humidity_sqrt, breaks = 10),
    flown_distance_pct_range = flown_distance / range,
    predicted_tow_category = cut(lm_predicted_tow, breaks = quantile(lm_predicted_tow, probs = seq(0, 1, 0.1)), include.lowest = TRUE),
    lm_predicted_tow_squared = lm_predicted_tow^2,
    lm_predicted_tow_cubed = lm_predicted_tow^3,
    log_lm_predicted_tow = log(lm_predicted_tow + 1),
    r_squared_category = cut(lm_predicted_tow_r_squared,
                                         breaks = c(-Inf, 0, 0.5, 0.7, 0.9, 1),
                                         labels = c("Very Poor", "Poor", "Average", "Good", "Excellent"),
                                         include.lowest = TRUE)

  )

Interaction Variables

# aircraft type cluster
challenge_set$interaction_aircraft_type_cluster_log_great_circle_distance_adep_ades <- interaction(challenge_set$aircraft_type_cluster, challenge_set$log_great_circle_distance_adep_ades)
challenge_set$interaction_aircraft_type_cluster_log_flown_distance <- interaction(challenge_set$aircraft_type_cluster, challenge_set$log_flown_distance)
challenge_set$interaction_aircraft_type_cluster_log_flight_duration <- interaction(challenge_set$aircraft_type_cluster, challenge_set$log_flight_duration)
challenge_set$interaction_aircraft_type_cluster_flown_distance_squared <- interaction(challenge_set$aircraft_type_cluster, challenge_set$flown_distance_squared)
challenge_set$interaction_aircraft_type_cluster_cruise_duration <- interaction(challenge_set$aircraft_type_cluster, challenge_set$cruise_duration)
challenge_set$interaction_aircraft_type_cluster_total_time_ascent <- interaction(challenge_set$aircraft_type_cluster, challenge_set$total_time_ascent)
challenge_set$interaction_aircraft_type_cluster_airline <- interaction(challenge_set$airline, challenge_set$aircraft_type_cluster)
challenge_set$interaction_aircraft_type_cluster_adep <- interaction(challenge_set$adep, challenge_set$aircraft_type_cluster)
challenge_set$interaction_aircraft_type_cluster_ades <- interaction(challenge_set$ades, challenge_set$aircraft_type_cluster)
challenge_set$interaction_aircraft_type_cluster_distance_bucket <- interaction(challenge_set$distance_bucket, challenge_set$aircraft_type_cluster)
challenge_set$interaction_aircraft_type_cluster_flown_velocity_bucket <- interaction(challenge_set$aircraft_type_cluster, challenge_set$flown_velocity_bucket)
challenge_set$interaction_aircraft_type_cluster_median_altitude <- interaction(challenge_set$aircraft_type_cluster, challenge_set$median_altitude)
challenge_set$interaction_aircraft_type_cluster_log_median_altitude <- interaction(challenge_set$aircraft_type_cluster, challenge_set$log_median_altitude)
challenge_set$interaction_aircraft_type_cluster_p25_altitude <- interaction(challenge_set$aircraft_type_cluster, challenge_set$p25_altitude)
challenge_set$interaction_aircraft_type_cluster_max_altitude <- interaction(challenge_set$aircraft_type_cluster, challenge_set$max_altitude)
challenge_set$interaction_aircraft_type_cluster_total_time_ascent <- interaction(challenge_set$total_time_ascent, challenge_set$aircraft_type_cluster)
challenge_set$interaction_aircraft_type_cluster_total_time_descent <- interaction(challenge_set$total_time_descent, challenge_set$aircraft_type_cluster)
challenge_set$interaction_aircraft_type_cluster_total_time_climb_4 <- interaction(challenge_set$total_time_climb_4, challenge_set$aircraft_type_cluster)
challenge_set$interaction_aircraft_type_cluster_percentage_time_climb_4 <- interaction(challenge_set$percentage_time_climb_4, challenge_set$aircraft_type_cluster)
challenge_set$interaction_aircraft_type_cluster_month_of_year <- interaction(challenge_set$aircraft_type_cluster, challenge_set$month_of_year)
challenge_set$interaction_aircraft_type_cluster_country_code_ades <- interaction(challenge_set$aircraft_type_cluster, challenge_set$country_code_ades)
challenge_set$interaction_aircraft_type_cluster_log_flown_velocity <- interaction(challenge_set$aircraft_type_cluster, challenge_set$log_flown_velocity)
challenge_set$interaction_aircraft_family_distance_category <- interaction(challenge_set$aircraft_family, challenge_set$distance_category)
challenge_set$interaction_aircraft_type_cluster_product_departure_traffic_arrival_traffic <- interaction(challenge_set$aircraft_type_cluster, challenge_set$product_departure_traffic_arrival_traffic)
challenge_set$interaction_aircraft_type_distance_bucket_airline <- interaction(challenge_set$distance_bucket, challenge_set$aircraft_type, challenge_set$aircraft_type)

# aircraft type
challenge_set$interaction_aircraft_type_distance_bucket <- interaction(challenge_set$distance_bucket, challenge_set$aircraft_type)
challenge_set$interaction_aircraft_type_flight_minus_taxiout_duration_bucket <- interaction(challenge_set$flight_minus_taxiout_duration_bucket, challenge_set$aircraft_type)
challenge_set$interaction_aircraft_type_flown_velocity_bucket <- interaction(challenge_set$aircraft_type, challenge_set$flown_velocity_bucket)
challenge_set$interaction_aircraft_type_median_altitude <- interaction(challenge_set$aircraft_type, challenge_set$median_altitude)
challenge_set$interaction_aircraft_type_offblock_time_of_day <- interaction(challenge_set$offblock_time_of_day, challenge_set$aircraft_type)
challenge_set$interaction_aircraft_type_month_of_year <- interaction(challenge_set$aircraft_type, challenge_set$month_of_year)
challenge_set$interaction_aircraft_type_offblock_is_peak <- interaction(challenge_set$aircraft_type, challenge_set$offblock_is_peak)
challenge_set$interaction_aircraft_type_pct_max_distance_flown_by_aircraft_type <- interaction(challenge_set$aircraft_type, challenge_set$pct_max_flown_distance_by_aircraft_type)
challenge_set$interaction_aircraft_type_altitude_range_bucket <- interaction(challenge_set$aircraft_type, challenge_set$altitude_range_bucket)
challenge_set$interaction_aircraft_type_groundspeed_range_bucket <- interaction(challenge_set$aircraft_type, challenge_set$groundspeed_range_bucket)
challenge_set$interaction_aircraft_type_elevation_ft_adep <- interaction(challenge_set$aircraft_type, challenge_set$elevation_ft_adep)
challenge_set$interaction_aircraft_type_country_code_adep <- interaction(challenge_set$aircraft_type, challenge_set$country_code_adep)
challenge_set$interaction_aircraft_type_country_code_ades <- interaction(challenge_set$aircraft_type, challenge_set$country_code_ades)
challenge_set$interaction_aircraft_type_airline <- interaction(challenge_set$aircraft_type, challenge_set$airline)
challenge_set$interaction_aircraft_type_adep <- interaction(challenge_set$aircraft_type, challenge_set$adep)
challenge_set$interaction_aircraft_type_ades <- interaction(challenge_set$aircraft_type, challenge_set$ades)
challenge_set$interaction_aircraft_type_aircraft_family <- interaction(challenge_set$aircraft_type, challenge_set$aircraft_family)
challenge_set$interaction_aircraft_type_taxiout_time_bucket <- interaction(challenge_set$aircraft_type, challenge_set$taxiout_time_bucket)
challenge_set$interaction_aircraft_type_month_of_year <- interaction(challenge_set$aircraft_type, challenge_set$month_of_year)
challenge_set$interaction_aircraft_type_offblock_time_of_day <- interaction(challenge_set$aircraft_type, challenge_set$offblock_time_of_day)
challenge_set$interaction_aircraft_type_offblock_is_peak <- interaction(challenge_set$aircraft_type, challenge_set$offblock_is_peak)
challenge_set$interaction_aircraft_type_total_time_ascent_bucket <- interaction(challenge_set$aircraft_type, challenge_set$total_time_ascent_bucket)
challenge_set$interaction_aircraft_type_total_time_descent_bucket <- interaction(challenge_set$aircraft_type, challenge_set$total_time_descent_bucket)
challenge_set$interaction_aircraft_type_departure_traffic_bucket <- interaction(challenge_set$aircraft_type, challenge_set$departure_traffic_bucket)
challenge_set$interaction_aircraft_type_distance_category <- interaction(challenge_set$aircraft_type, challenge_set$distance_category)
challenge_set$interaction_aircraft_type_product_departure_traffic_arrival_traffic <- interaction(challenge_set$aircraft_type, challenge_set$product_departure_traffic_arrival_traffic)
challenge_set$interaction_aircraft_type_predicted_tow <- interaction(challenge_set$aircraft_type, challenge_set$lm_predicted_tow)


# airline
interaction_airline_flight_duration <- interaction(challenge_set$airline, challenge_set$flight_duration)
interaction_airline_great_circle_distance <- interaction(challenge_set$airline, challenge_set$great_circle_distance_adep_ades)
interaction_airline_flown_distance <- interaction(challenge_set$airline, challenge_set$flown_distance)
challenge_set$interaction_airline_aircraft_type <- interaction(challenge_set$airline, challenge_set$aircraft_type)
challenge_set$interaction_airline_distance_bucket <- interaction(challenge_set$distance_bucket, challenge_set$airline)
challenge_set$interaction_airline_distance_category <- interaction(challenge_set$distance_category, challenge_set$airline)
challenge_set$interaction_airline_flight_minus_taxiout_duration_bucket <- interaction(challenge_set$flight_minus_taxiout_duration_bucket, challenge_set$airline)
challenge_set$interaction_airline_offblock_time_of_day <- interaction(challenge_set$distance_bucket, challenge_set$offblock_time_of_day)
challenge_set$interaction_airline_product_departure_traffic_arrival_traffic <- interaction(challenge_set$airline, challenge_set$product_departure_traffic_arrival_traffic)
interaction_airline_offblock_is_peak <- interaction(challenge_set$airline, challenge_set$offblock_is_peak)
interaction_airline_month_of_year <- interaction(challenge_set$airline, challenge_set$month_of_year)
interaction_airline_day_of_week <- interaction(challenge_set$airline, challenge_set$day_of_week)
interaction_airline_distance_category <- interaction(challenge_set$airline, challenge_set$distance_category)
interaction_airline_country_code_adep <- interaction(challenge_set$airline, challenge_set$country_code_adep)
interaction_airline_country_code_ades <- interaction(challenge_set$airline, challenge_set$country_code_ades)
interaction_airline_departure_traffic <- interaction(challenge_set$airline, challenge_set$departure_traffic)
interaction_airline_taxiout_time <- interaction(challenge_set$airline, challenge_set$taxiout_time)
interaction_airline_total_time_ascent <- interaction(challenge_set$airline, challenge_set$total_time_ascent)
interaction_airline_flight_minus_taxiout_duration_bucket <- interaction(challenge_set$airline, challenge_set$flight_minus_taxiout_duration_bucket)
interaction_airline_flown_velocity_bucket <- interaction(challenge_set$airline, challenge_set$flown_velocity_bucket)
interaction_airline_altitude_range_bucket <- interaction(challenge_set$airline, challenge_set$altitude_range_bucket)
interaction_airline_avg_humidity_bucket <- interaction(challenge_set$airline, challenge_set$avg_humidity_bucket)
interaction_airline_distance_bucket <- interaction(challenge_set$airline, challenge_set$distance_bucket)
interaction_airline_headwind_tailwind_bucket <- interaction(challenge_set$airline, challenge_set$headwind_tailwind_bucket)


# adep 
challenge_set$interaction_adep_airline <- interaction(challenge_set$adep, challenge_set$airline)
challenge_set$interaction_adep_aircraft_type <- interaction(challenge_set$adep, challenge_set$aircraft_type)
challenge_set$interaction_adep_month_of_year <- interaction(challenge_set$adep, challenge_set$month_of_year)
challenge_set$interaction_adep_ades <- interaction(challenge_set$adep, challenge_set$ades)
challenge_set$interaction_adep_ades_airline <- interaction(challenge_set$interaction_adep_ades, challenge_set$airline)
challenge_set$interaction_adep_distance_bucket <- interaction(challenge_set$adep, challenge_set$distance_bucket)
challenge_set$interaction_adep_flown_distance_squared <- interaction(challenge_set$adep, challenge_set$flown_distance_squared)

# ades
challenge_set$interaction_ades_airline <- interaction(challenge_set$ades, challenge_set$airline)
challenge_set$interaction_ades_aircraft_type <- interaction(challenge_set$ades, challenge_set$aircraft_type)
challenge_set$interaction_ades_month_of_year <- interaction(challenge_set$ades, challenge_set$month_of_year)
challenge_set$interaction_country_code_ades_country_code_adep <- interaction(challenge_set$country_code_ades, challenge_set$country_code_adep)
challenge_set$interaction_country_code_ades_flown_distance_bucket <- interaction(challenge_set$country_code_ades, challenge_set$distance_bucket)
challenge_set$interaction_country_code_ades_flown_distance_squared <- interaction(challenge_set$country_code_ades, challenge_set$flown_distance_squared)

# time
challenge_set$interaction_day_of_week_departure_traffic <- interaction(challenge_set$day_of_week, challenge_set$departure_traffic)
challenge_set$interaction_month_of_year_aircraft_type_cluster_flown_distance_bucket <- interaction(challenge_set$month_of_year, challenge_set$aircraft_type_cluster, challenge_set$distance_bucket)
challenge_set$interaction_distance_bucket_offblock_time_of_day <- interaction(challenge_set$distance_bucket, challenge_set$offblock_time_of_day)

# great_circle_distance_bucket
challenge_set$interaction_great_circle_distance_bucket_aircraft_type_cluster <- interaction(challenge_set$great_circle_distance_bucket, challenge_set$aircraft_type_cluster)
challenge_set$interaction_great_circle_distance_bucket_aircraft_type <- interaction(challenge_set$great_circle_distance_bucket, challenge_set$aircraft_type)
challenge_set$interaction_great_circle_distance_bucket_flown_velocity_bucket <- interaction(challenge_set$great_circle_distance_bucket, challenge_set$flown_velocity_bucket)
challenge_set$interaction_great_circle_distance_bucket_country_code_ades <- interaction(challenge_set$great_circle_distance_bucket, challenge_set$country_code_ades)
challenge_set$interaction_great_circle_distance_bucket_median_altitude <- interaction(challenge_set$great_circle_distance_bucket, challenge_set$median_altitude)

# continuous variable interactions
challenge_set$interaction_log_great_circle_distance_adep_ades_log_median_altitude <- interaction(challenge_set$log_great_circle_distance_adep_ades, challenge_set$log_median_altitude)
challenge_set$interaction_distance_category_median_groundspeed <- interaction(challenge_set$distance_category, challenge_set$median_groundspeed)
challenge_set$interaction_total_time_ascent_aircraft_type <- interaction(challenge_set$total_time_ascent, challenge_set$aircraft_type)
challenge_set$interaction_total_time_descent_aircraft_type <- interaction(challenge_set$total_time_descent, challenge_set$aircraft_type)
challenge_set$interaction_altitude_range_bucket_distance_bucket <- interaction(challenge_set$altitude_range_bucket, challenge_set$distance_bucket)
challenge_set$interaction_groundspeed_range_distance_bucket <- interaction(challenge_set$groundspeed_range_bucket, challenge_set$distance_bucket)

# Catchall
challenge_set$interaction_taxiout_time_flown_distance <- interaction(challenge_set$taxiout_time, challenge_set$flown_distance)
challenge_set$interaction_taxiout_time_great_circle_distance_adep_ades <- interaction(challenge_set$taxiout_time, challenge_set$great_circle_distance_adep_ades)
challenge_set$interaction_log_flown_distance_offblock_time_of_day <- interaction(challenge_set$log_flown_distance, challenge_set$offblock_time_of_day)
challenge_set$interaction_total_time_ascent_offblock_is_peak <- interaction(challenge_set$total_time_ascent, challenge_set$offblock_is_peak)
challenge_set$interaction_month_of_year_log_flown_distance <- interaction(challenge_set$month_of_year, challenge_set$log_flown_distance)
challenge_set$interaction_headwind_tailwind_bucket_altitude_range_bucket <- interaction(challenge_set$headwind_tailwind_bucket, challenge_set$altitude_range_bucket)
challenge_set$interaction_avg_humidity_bucket_total_time_ascent_bucket <- interaction(challenge_set$avg_humidity_bucket, challenge_set$total_time_ascent_bucket)
challenge_set$interaction_oew_flown_distance <- interaction(challenge_set$oew, challenge_set$flown_distance)
challenge_set$interaction_mtw_oew_diff_distance_bucket <- interaction(challenge_set$mtow_oew_diff, challenge_set$distance_bucket)
challenge_set$interaction_country_code_adep_elevation_ft_adep <- interaction(challenge_set$country_code_adep, challenge_set$elevation_ft_adep)
challenge_set$interaction_country_code_ades_elevation_ft_ades <- interaction(challenge_set$country_code_ades, challenge_set$elevation_ft_ades)
challenge_set$interaction_departure_traffic_distance_category <- interaction(challenge_set$departure_traffic, challenge_set$distance_category)
challenge_set$interaction_log_flown_distance_bucket_altitude_range_bucket <- interaction(challenge_set$log_flown_distance_bucket, challenge_set$altitude_range_bucket)
challenge_set$interaction_offblock_time_of_day_log_flown_distance_bucket <- interaction(challenge_set$offblock_time_of_day, challenge_set$log_flown_distance_bucket)
challenge_set$interaction_flight_duration_day_of_week <- interaction(challenge_set$flight_duration, challenge_set$day_of_week)
challenge_set$interaction_flight_duration_is_weekend <- interaction(challenge_set$flight_duration, challenge_set$is_weekend)
challenge_set$interaction_offblock_is_peak_month_of_year <- interaction(challenge_set$offblock_is_peak, challenge_set$month_of_year)
challenge_set$interaction_day_of_week_arrival_time_of_day <- interaction(challenge_set$day_of_week, challenge_set$arrival_time_of_day)
challenge_set$interaction_week_of_year_offblock_time_of_day <- interaction(challenge_set$week_of_year, challenge_set$offblock_time_of_day)
challenge_set$interaction_total_time_ascent_is_weekend <- interaction(challenge_set$total_time_ascent, challenge_set$is_weekend)
challenge_set$interaction_month_of_year_percentage_time_climb_4 <- interaction(challenge_set$month_of_year, challenge_set$percentage_time_climb_4)
challenge_set$interaction_avg_humidity_bucket_flight_duration <- interaction(challenge_set$avg_humidity_bucket, challenge_set$flight_duration)
challenge_set$interaction_departure_traffic_bucket_arrival_traffic_bucket <- interaction(challenge_set$departure_traffic_bucket, challenge_set$arrival_traffic_bucket)
challenge_set$interaction_flight_duration_predicted_tow <- challenge_set$flight_duration * challenge_set$lm_predicted_tow
challenge_set$interaction_flown_distance_predicted_tow <- challenge_set$flown_distance * challenge_set$lm_predicted_tow
challenge_set$interaction_flown_distance_r_squared <- challenge_set$flown_distance * challenge_set$lm_predicted_tow_r_squared
challenge_set$interaction_taxiout_time_r_squared <- challenge_set$taxiout_time * challenge_set$lm_predicted_tow_r_squared
challenge_set$interaction_aircraft_type_r_squared <- interaction(challenge_set$aircraft_type, challenge_set$lm_predicted_tow_r_squared)
challenge_set$interaction_r_squared_category_predicted_tow_category <- interaction(challenge_set$r_squared_category, challenge_set$predicted_tow_category)

Model Training and Evaluation

Divide Known Data into Test and Train Sets

# Remove submission set rows from challenge_set
challenge_set <- challenge_set %>% filter(!is.na(tow))

# Test and Train split
sample_index <- sample(1:nrow(challenge_set), 0.8 * nrow(challenge_set))
train_data <- challenge_set[sample_index, ]
test_data <- challenge_set[-sample_index, ]

Train Random Forest

# Train random forest regression model
rf_model <- ranger(
  formula, 
  data = train_data, 
  importance = 'impurity', 
  num.trees = 500, 
  min.node.size = 5,
  save.memory = TRUE,
  replace = TRUE,
  sample.fraction = 0.7
)

# View the model summary
print(rf_model)

Generate Predictions and Measure Accuracy

# Make predictions on the test set
predictions <- predict(rf_model, data = test_data)$predictions

# Calculate MAPE
mape <- mean(abs((test_data$tow - predictions) / test_data$tow)) * 100
print(paste("MAPE: ", mape))

# Calculate MAE
mae <- mean(predictions - test_data$tow)
print(paste("Mean Absolute Error: ", mae))

# Calculate MSE
mse <- mean((predictions - test_data$tow)^2)
print(paste("Mean Squared Error: ", mse))

# Calculate RMSE
rmse <- sqrt(mse)
print(paste("Root Mean Squared Error: ", rmse))

# Calculate R2
r_squared <- 1 - (sum((predictions - test_data$tow)^2) / sum((mean(train_data$tow) - test_data$tow)^2))
print(paste("R-squared: ", r_squared))