Analyzing and Predicting Delays for the NYC Ferry System

2026 MUSA Practicum Project Supporting the NYC Economic Development Corportation

Author

Yiming ‘Ming’ Cao, Sujan Kakumanu, Tyler Maynard, Guangze ‘Simon’ Sun

Code
# Set options
options(scipen = 999, dplyr.summarise.inform = FALSE)

# Packages
if(!require(pacman)){install.packages("pacman"); library(pacman, quietly = T)}
p_load(tidyverse, sf, viridis, lubridate, kableExtra, gridExtra, ggthemes, ggspatial, leaflet, spatstat, spdep, htmlwidgets, concaveman, patchwork, yardstick, arrow, broom, car, scales)

1 Use Case

NYC Ferry launched on May 1st, 2017 with a broad goal: connect underserved communities, complement existing transit, and support growing neighborhoods. That mission is reflected in how the system is designed — single-seat rides are possible between communities as far apart as the Bronx and the Rockaways. Since launch, the system has crossed 50 million riders and today spans 5 routes operated by 38 custom-built vessels across New York Harbor.

This project was done in partnership with the NYC Economic Development Corporation, the operator of NYC Ferry, to support on-time performance through better operational forecasting. The use case centers on an operational manager who needs to allocate resources — upland agents, sweep boats, vessel assignments — ahead of anticipated delays.
This breaks down into two key deliverables. First, an associative model exploring the historical contributing factors to delays between landings. Second, a predictive model that estimates the probability of an excessive delay (>10 minutes) between landings up to 48 hours ahead of a scheduled departure. The latter feeds into a web-based application designed for operational use.

2 Exploratory Data Analysis

2.1 Ridership and Trip Data

By: Guangze ‘Simon’ Sun and Yiming ‘Ming’ Cao

Definition of Delay

Arrival Time Variance (minutes): Actual Arrival − Scheduled Arrival

Thresholds:

  • Delayed: > 5 minutes
  • Excessively delayed: > 10 minutes

Rationale: Based on NYC Ferry operational definitions. The >5 minute threshold is used for stop-level late-rate analysis throughout this section, reflecting stable classification with less noise than stricter cutoffs.


Code
tz_local <- "America/New_York"

ridership_files <- c(
  "../exploratory_code/ridership_eda/NYCF_ridership_details_2024.01_06.csv",
  "../exploratory_code/ridership_eda/NYCF_ridership_details_2024.07_12.csv",
  "../exploratory_code/ridership_eda/NYCF_ridership_details_2025.01_06.csv"
)
stop_files <- c(
  "../exploratory_code/ridership_eda/NYCF_stop_data_2024.csv",
  "../exploratory_code/ridership_eda/NYCF_stop_data_2025.01_06.csv"
)

2.1.1 Overview

2.1.1.1 Scope & sample definition

This notebook performs exploratory data analysis (EDA) on compiled NYC Ferry stop-level operations data.

All downstream analysis uses records from 2024-01-01 to 2025-06-30 (inclusive), and excludes 2024-04-25 for route SB due to a known schedule mismatch issue. Passenger activity is measured using boarding_wake and alightings_wake throughout.

2.1.1.2 What this notebook produces

The notebook constructs a cleaned stop-level table, derives operational and reliability metrics (dwell, segment “water time,” delay, and late/early flags), and generates summary tables and plots across routes, months, and hours.

The final analysis sample is stored as dat (stop-level), which is also the only dataset exported at the end for downstream modeling and reporting.

2.1.2 Data Preparation

2.1.2.1 Import Data

The following code loads the compiled stop-level dataset and the ridership details dataset. Ridership details are used only to obtain vessel capacity for computing load factor.

Code
standardize_cols_ridership <- function(df) {
  nm <- names(df)
  nm <- ifelse(nm == "Original_On",  "Original_ON",  nm)
  nm <- ifelse(nm == "Original_Off", "Original_OFF", nm)
  nm <- ifelse(nm == "WheelChair_On",  "WheelChair_ON",  nm)
  nm <- ifelse(nm == "WheelChair_Off", "WheelChair_OFF", nm)
  names(df) <- nm
  df
}

ridership <- purrr::map_dfr(
  ridership_files,
  ~ readr::read_csv(.x, show_col_types = FALSE) %>%
    standardize_cols_ridership()
)

stops <- purrr::map_dfr(
  stop_files,
  ~ readr::read_csv(.x, show_col_types = FALSE)
)

glimpse(ridership)
Rows: 961,890
Columns: 15
$ Date1           <chr> "1/1/2024", "1/1/2024", "1/1/2024", "1/1/2024", "1/1/2…
$ Route           <chr> "ER", "ER", "ER", "ER", "ER", "ER", "ER", "ER", "ER", …
$ Direction       <chr> "NB", "NB", "NB", "NB", "NB", "NB", "NB", "SB", "SB", …
$ Trip_Pattern    <chr> "er_p11_gp_hps", "er_p11_gp_hps", "er_p11_gp_hps", "er…
$ Trip_ID         <dbl> 1101, 1101, 1101, 1101, 1101, 1101, 1101, 1102, 1102, …
$ Vessel_Id       <chr> "H201", "H201", "H201", "H201", "H201", "H201", "H201"…
$ Vessel_Capacity <dbl> 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150,…
$ Stop            <chr> "Hunters Point South", "Wall St/Pier 11", "Dumbo/Fulto…
$ Stop_Sequence   <dbl> NA, 1, 2, 3, 4, 5, 6, NA, 1, 2, 3, 4, 5, 6, NA, 1, 2, …
$ Departure_Time  <time> 10:53:00, 10:05:00, 10:13:00, 10:25:00, 10:33:00, 10:…
$ Original_ON     <dbl> 0, 8, 3, 1, 10, 16, 0, 0, 25, 37, 14, 13, 14, 12, 0, 2…
$ Original_OFF    <dbl> 0, 0, 3, 2, 2, 4, 27, 35, 0, 13, 2, 15, 6, 44, 15, 0, …
$ Left_Behind     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ WheelChair_ON   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ WheelChair_OFF  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Code
glimpse(stops)
Rows: 753,951
Columns: 27
$ Row_Number             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
$ `Sort Key`             <chr> "2024-01-011101ER  1", "2024-01-011101ER2", "20…
$ vehicle_id             <chr> "H201", "H201", "H201", "H201", "H201", "H201",…
$ trip_date              <chr> "1/1/2024", "1/1/2024", "1/1/2024", "1/1/2024",…
$ trip_no                <dbl> 1101, 1101, 1101, 1101, 1101, 1101, 1101, 1102,…
$ `Block Id`             <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,…
$ route_no               <chr> "ER", "ER", "ER", "ER", "ER", "ER", "ER", "ER",…
$ route_direction        <chr> "NB", "NB", "NB", "NB", "NB", "NB", "NB", "SB",…
$ route_pattern          <chr> "er_p11_gp_hps", "er_p11_gp_hps", "er_p11_gp_hp…
$ route_name             <chr> "East River", "East River", "East River", "East…
$ stop_name              <chr> "Wall St/Pier 11", "Dumbo/Fulton Ferry", "South…
$ scheduled_departure_dt <chr> "1/1/2024 10:05", "1/1/2024 10:13", "1/1/2024 1…
$ scheduled_arrival_dt   <chr> NA, "1/1/2024 10:13", "1/1/2024 10:25", "1/1/20…
$ actual_arrival_dt      <chr> NA, "1/1/2024 10:10", "1/1/2024 10:24", "1/1/20…
$ actual_departure_dt    <chr> "1/1/2024 10:06", "1/1/2024 10:15", "1/1/2024 1…
$ `Stop Sequence`        <dbl> 0, 1, 2, 3, 4, 5, 6, 2, 4, 5, 0, 1, 3, 6, 0, 5,…
$ boarding_wake          <dbl> 8, 3, 1, 10, 16, 0, 0, 14, 14, 12, 25, 37, 13, …
$ alightings_wake        <dbl> 0, 3, 2, 2, 4, 27, 0, 2, 6, 44, 0, 13, 15, 35, …
$ alightings             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ boardings              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lbr                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ `Dwell Time`           <time> 00:04:49, 00:02:35, 00:03:06, 00:03:25, 00:03:…
$ dept_variance_flag     <chr> "On_Time", "On_Time", "On_Time", "On_Time", "On…
$ DepartureTimeVariance  <time> 00:01:16, 00:02:10, 00:02:06, 00:03:26, 00:04:…
$ ArrivalTimeVariance    <time>       NA, 00:02:39, 00:00:29, 00:00:20, 00:00:…
$ stop_id                <dbl> 87, 20, 8, 19, 18, 17, 4, 18, 8, 20, 4, 17, 19,…
$ ...27                  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

2.1.2.2 Clean stops & build basic features

This step standardizes key fields and column names, cleans Sort Key, derives weekday/weekend indicators, and computes an onboard count per stop using cumulative boarding_wake minus alightings_wake within each trip ordered by Stop_Sequence.

Code
stops <- stops %>%
  mutate(`Sort Key` = str_replace_all(`Sort Key`, "\\s+", "")) %>%
  rename_with(~ str_replace_all(.x, "\\s+", "_")) %>%
  mutate(
    trip_no = as.character(trip_no),
    route_no = as.character(route_no),
    route_direction = as.character(route_direction),
    route_pattern = as.character(route_pattern),
    stop_name = as.character(stop_name),
    Stop_Sequence = as.integer(Stop_Sequence),
    sort_key_adj = paste0(str_sub(Sort_Key, 1, -2), Stop_Sequence),
    trip_date = mdy(trip_date),
    weekday = wday(trip_date, label = TRUE, abbr = FALSE),
    is_weekday = !wday(trip_date) %in% c(1, 7),
    lbr = coalesce(lbr, 0)
  ) %>%
  group_by(trip_date, trip_no, route_no, route_direction) %>%
  arrange(Stop_Sequence, .by_group = TRUE) %>%
  mutate(
    onboard = cumsum(coalesce(boarding_wake, 0)) - cumsum(coalesce(alightings_wake, 0))
  ) %>%
  ungroup() %>%
  select(-any_of(c("Row_Number", "Sort_Key", "route_name", "...27")))

2.1.2.3 Parse datetimes (NY local time)

Stop timestamps appear in different string formats across years. We parse both formats into POSIXct and keep them in America/New_York so that time-of-day patterns reflect local operations.

Code
parse_dt_nycf <- function(x, tz_local) {
  x_chr <- as.character(x)
  x_chr <- str_trim(na_if(x_chr, ""))

  has_ampm <- str_detect(x_chr, regex("\\b(am|pm)\\b", ignore_case = TRUE))
  has_ampm[is.na(has_ampm)] <- FALSE

  out <- as.POSIXct(rep(NA, length(x_chr)), tz = tz_local)

  idx_2024 <- !has_ampm & !is.na(x_chr)
  idx_2025 <-  has_ampm & !is.na(x_chr)

  out[idx_2024] <- mdy_hm(x_chr[idx_2024], tz = tz_local, quiet = TRUE)
  out[idx_2025] <- parse_date_time(
    x_chr[idx_2025],
    orders = c("mdY IMSp", "mdY IMp"),
    tz = tz_local,
    quiet = TRUE
  )

  out
}

stops <- stops %>%
  mutate(
    scheduled_arrival_dt   = parse_dt_nycf(scheduled_arrival_dt, tz_local),
    actual_arrival_dt      = parse_dt_nycf(actual_arrival_dt, tz_local),
    scheduled_departure_dt = parse_dt_nycf(scheduled_departure_dt, tz_local),
    actual_departure_dt    = parse_dt_nycf(actual_departure_dt, tz_local)
  )

2.1.2.4 Enforce last-stop departure as missing

For each trip, the final stop should not have a meaningful departure timestamp. We explicitly set actual_departure_dt to missing for last stops to avoid incorrect dwell and segment calculations.

Code
stops <- stops %>%
  arrange(trip_date, trip_no, route_no, route_direction, Stop_Sequence) %>%
  mutate(
    actual_departure_dt = if_else(
      row_number() == n(),
      as.POSIXct(NA, tz = tz_local),
      actual_departure_dt
    ),
    .by = c(trip_date, trip_no, route_no, route_direction)
  )

2.1.2.5 Compute delay / segment / dwell metrics

We compute arrival/departure delay relative to schedule, “water travel” time between consecutive stops (previous departure to current arrival), and dwell time at stops (arrival to departure). These derived metrics support distribution checks and late-rate analysis.

Code
stops <- stops %>%
  select(-any_of("dept_variance_flag")) %>%
  mutate(
    arr_diff = as.numeric(difftime(actual_arrival_dt, scheduled_arrival_dt, units = "mins")),
    dep_diff = as.numeric(difftime(actual_departure_dt, scheduled_departure_dt, units = "mins"))
  ) %>%
  arrange(trip_date, trip_no, route_no, route_direction, Stop_Sequence) %>%
  mutate(
    sched_water_min = as.numeric(difftime(scheduled_arrival_dt, lag(scheduled_departure_dt), units = "mins")),
    act_water_min   = as.numeric(difftime(actual_arrival_dt,    lag(actual_departure_dt),    units = "mins")),
    water_diff      = act_water_min - sched_water_min,
    .by = c(trip_date, trip_no, route_no, route_direction)
  ) %>%
  mutate(
    act_dwell_min = as.numeric(difftime(actual_departure_dt, actual_arrival_dt, units = "mins"))
  )

2.1.2.6 Define lateness / earliness flags

We define Late arrival as arr_diff > 5 minutes and Early departure as dep_diff < -1 minute. These thresholds are used consistently throughout the reliability summaries and plots.

Code
stops <- stops %>%
  mutate(
    arr_status = case_when(
      is.na(arr_diff) ~ NA_character_,
      arr_diff > 5    ~ "Late",
      TRUE            ~ "Not_Late"
    ),
    dep_status = case_when(
      is.na(dep_diff) ~ NA_character_,
      dep_diff < -1   ~ "Early",
      TRUE            ~ "Not_Early"
    )
  )

2.1.2.7 Join vessel capacity and compute load factor

We join Vessel_Capacity using four keys (trip_date, route_pattern, trip_no, stop_name) and compute load_factor = onboard / Vessel_Capacity.

Code
ridership_capacity_lookup <- ridership %>%
  transmute(
    trip_date     = mdy(Date1),
    route_pattern = as.character(Trip_Pattern),
    trip_no       = as.character(Trip_ID),
    stop_name     = as.character(Stop),
    Vessel_Capacity
  ) %>%
  filter(!is.na(Vessel_Capacity)) %>%
  distinct(trip_date, route_pattern, trip_no, stop_name, Vessel_Capacity)

stops <- stops %>%
  left_join(
    ridership_capacity_lookup,
    by = c(
      "trip_date" = "trip_date",
      "route_pattern" = "route_pattern",
      "trip_no" = "trip_no",
      "stop_name" = "stop_name"
    )
  ) %>%
  mutate(load_factor = onboard / Vessel_Capacity)

2.1.2.8 Define analysis sample

We construct the final analysis table dat by applying the study period filter and removing the known problematic day/route subset. All EDA that follows is based on dat unless explicitly noted.

Code
dat0 <- stops %>%
  filter(trip_date >= ymd("2024-01-01"),
         trip_date <= ymd("2025-06-30"))

removed_0425_sb <- dat0 %>%
  filter(trip_date == ymd("2024-04-25") & route_no == "SB")

dat <- dat0 %>%
  filter(!(trip_date == ymd("2024-04-25") & route_no == "SB"))

tibble(
  min_trip_date = min(dat$trip_date, na.rm = TRUE),
  max_trip_date = max(dat$trip_date, na.rm = TRUE),
  n_rows_in_range_before_drop = nrow(dat0),
  n_rows_removed_2024_04_25_SB = nrow(removed_0425_sb),
  n_rows_final = nrow(dat)
)
# A tibble: 1 × 5
  min_trip_date max_trip_date n_rows_in_range_before_drop n_rows_removed_2024_…¹
  <date>        <date>                              <int>                  <int>
1 2024-01-01    2025-06-30                         753951                    366
# ℹ abbreviated name: ¹​n_rows_removed_2024_04_25_SB
# ℹ 1 more variable: n_rows_final <int>

2.1.3 EDA Tables

2.1.3.1 Derived tables

We create two helper tables to support EDA at different aggregation levels:

  • trip_tbl summarizes one row per trip (useful for route/month/trip-start timing summaries).
  • dist_dat keeps stop-level records with convenience fields (e.g., stop_volume, nonnegative onboard/load factor) for distributions and late-rate calculations.
Code
trip_tbl <- dat %>%
  mutate(trip_key = paste(trip_date, route_no, route_direction, trip_no, sep = "|")) %>%
  group_by(trip_key, trip_date, route_no, route_direction) %>%
  arrange(Stop_Sequence, .by_group = TRUE) %>%
  summarise(
    n_stops = n_distinct(stop_id),
    trip_boardings  = sum(boarding_wake, na.rm = TRUE),
    trip_alightings = sum(alightings_wake, na.rm = TRUE),
    trip_total_events = sum(boarding_wake + alightings_wake, na.rm = TRUE),
    trip_start_departure = first(actual_departure_dt),
    is_weekday = first(is_weekday),
    month = floor_date(first(trip_date), "month"),
    .groups = "drop"
  ) %>%
  mutate(
    day_type = if_else(is_weekday, "Weekday", "Weekend"),
    dep_hour = hour(trip_start_departure)
  )

dist_dat <- dat %>%
  mutate(
    boarding_wake = coalesce(boarding_wake, 0),
    alightings_wake = coalesce(alightings_wake, 0),
    stop_volume = boarding_wake + alightings_wake,
    dwell_min = act_dwell_min,
    onboard_nonneg = pmax(onboard, 0),
    load_factor_nonneg = onboard_nonneg / Vessel_Capacity
  )

2.1.3.2 Route overview (Weekday vs Weekend)

Route-level summaries compare weekday and weekend patterns to distinguish commute-driven demand from leisure and discretionary travel. We report total trips, total boardings, and average boardings per trip by route.

Code
route_weekpart <- trip_tbl %>%
  group_by(route_no, day_type) %>%
  summarise(
    total_trips = n(),
    total_boardings = sum(trip_boardings, na.rm = TRUE),
    .groups = "drop"
  )

route_avg_weekpart <- trip_tbl %>%
  group_by(route_no, day_type) %>%
  summarise(
    avg_boardings_per_trip = mean(trip_boardings, na.rm = TRUE),
    n_trips = n(),
    .groups = "drop"
  )

route_summary_by_stops <- trip_tbl %>%
  group_by(route_no, route_direction, n_stops) %>%
  summarise(
    n_trips = n(),
    total_boardings = sum(trip_boardings, na.rm = TRUE),
    avg_boardings_per_trip = total_boardings / n_trips,
    .groups = "drop"
  ) %>%
  arrange(route_no, route_direction, n_stops)

route_weekpart %>% arrange(desc(total_boardings))
# A tibble: 16 × 4
   route_no day_type total_trips total_boardings
   <chr>    <chr>          <int>           <dbl>
 1 ER       Weekday        24863         2730172
 2 AS       Weekday        15047         1533359
 3 ER       Weekend         7541         1302644
 4 SV       Weekday        14933          919075
 5 SB       Weekday        18874          709380
 6 AS       Weekend         4732          692798
 7 RW       Weekday        11950          655102
 8 SG       Weekday        19744          653251
 9 SB       Weekend         5189          482218
10 RW       Weekend         4582          334244
11 SV       Weekend         4617          326769
12 SG       Weekend         6534          253360
13 GI       Weekend         1292           49853
14 GI       Weekday          444           22963
15 RR       Weekend           84            8951
16 RR       Weekday            6             900
Code
route_summary_by_stops
# A tibble: 28 × 6
   route_no route_direction n_stops n_trips total_boardings
   <chr>    <chr>             <int>   <int>           <dbl>
 1 AS       NB                    7    9705         1014103
 2 AS       SB                    7   10074         1212054
 3 ER       NB                    7   16037         2012111
 4 ER       SB                    7   16367         2020705
 5 GI       NB                    2     897           30485
 6 GI       SB                    2     839           42331
 7 RR       NB                    3      45            4085
 8 RR       SB                    3      45            5766
 9 RW       NB                    3    8511          505836
10 RW       SB                    3    8021          483510
# ℹ 18 more rows
# ℹ 1 more variable: avg_boardings_per_trip <dbl>

2.1.3.3 Stop totals

Stop totals aggregate boardings and alightings across the analysis period to identify the highest-activity stops. These rankings help interpret which terminals or transfer points dominate system usage.

Code
stop_totals <- dat %>%
  group_by(stop_id, stop_name) %>%
  summarise(
    total_boardings = sum(boarding_wake, na.rm = TRUE),
    total_alightings = sum(alightings_wake, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(total_boardings))

stop_totals
# A tibble: 25 × 4
   stop_id stop_name           total_boardings total_alightings
     <dbl> <chr>                         <dbl>            <dbl>
 1      87 Wall St/Pier 11             2163272          2562409
 2      17 East 34th Street            1404634          1566502
 3      20 Dumbo/Fulton Ferry           990583           884817
 4      19 North Williamsburg           560850           520351
 5     113 East 90th St                 533628           489918
 6      88 Rockaway                     438062           435374
 7      18 Greenpoint                   431993           388197
 8       4 Hunters Point South          417259           397537
 9      89 Astoria                      414117           342461
10     137 St. George                   387301           331679
# ℹ 15 more rows

2.1.3.4 Monthly patterns

Monthly summaries provide a simple view of seasonality and demand shifts over time. We summarize trips and ridership by month, split by weekday vs weekend, and also compute average boardings per trip.

Code
monthly_weekpart <- trip_tbl %>%
  filter(month >= ymd("2024-01-01"), month <= ymd("2025-06-01")) %>%
  group_by(month, day_type) %>%
  summarise(
    total_trips = n(),
    total_boardings = sum(trip_boardings, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(month, day_type)

monthly_avg_weekpart <- trip_tbl %>%
  filter(month >= ymd("2024-01-01"), month <= ymd("2025-06-01")) %>%
  group_by(month, day_type) %>%
  summarise(
    avg_boardings_per_trip = mean(trip_boardings, na.rm = TRUE),
    n_trips = n(),
    .groups = "drop"
  ) %>%
  arrange(month, day_type)

monthly_weekpart
# A tibble: 36 × 4
   month      day_type total_trips total_boardings
   <date>     <chr>          <int>           <dbl>
 1 2024-01-01 Weekday         5912          254505
 2 2024-01-01 Weekend         1176           49751
 3 2024-02-01 Weekday         5547          250456
 4 2024-02-01 Weekend         1176           76538
 5 2024-03-01 Weekday         5671          307774
 6 2024-03-01 Weekend         1610          142854
 7 2024-04-01 Weekday         6015          392494
 8 2024-04-01 Weekend         1799          179812
 9 2024-05-01 Weekday         6290          485912
10 2024-05-01 Weekend         1982          233380
# ℹ 26 more rows
Code
monthly_avg_weekpart
# A tibble: 36 × 4
   month      day_type avg_boardings_per_trip n_trips
   <date>     <chr>                     <dbl>   <int>
 1 2024-01-01 Weekday                    43.0    5912
 2 2024-01-01 Weekend                    42.3    1176
 3 2024-02-01 Weekday                    45.2    5547
 4 2024-02-01 Weekend                    65.1    1176
 5 2024-03-01 Weekday                    54.3    5671
 6 2024-03-01 Weekend                    88.7    1610
 7 2024-04-01 Weekday                    65.3    6015
 8 2024-04-01 Weekend                   100.0    1799
 9 2024-05-01 Weekday                    77.3    6290
10 2024-05-01 Weekend                   118.     1982
# ℹ 26 more rows

2.1.3.5 Time-of-day patterns

Hourly patterns are summarized in two ways: (1) trip-level departure hour, and (2) stop-level departure hour. This helps distinguish “when trips start” from “when stop activity occurs” across the day.

Code
hour_weekpart <- trip_tbl %>%
  filter(!is.na(dep_hour)) %>%
  mutate(dep_hour = as.integer(dep_hour)) %>%
  group_by(dep_hour, day_type) %>%
  summarise(
    total_boardings = sum(trip_boardings, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(dep_hour, day_type)

hour_breaks <- sort(unique(hour_weekpart$dep_hour))

stop_hour_weekpart <- dat %>%
  filter(!is.na(actual_departure_dt)) %>%
  mutate(
    dep_hour = hour(actual_departure_dt),
    day_type = if_else(is_weekday, "Weekday", "Weekend")
  ) %>%
  filter(!is.na(dep_hour)) %>%
  group_by(dep_hour, day_type) %>%
  summarise(
    n_stop_departures = n(),
    total_boardings = sum(boarding_wake, na.rm = TRUE),
    avg_boardings_per_stop = total_boardings / n_stop_departures,
    .groups = "drop"
  ) %>%
  arrange(dep_hour, day_type)

stop_hour_breaks <- sort(unique(stop_hour_weekpart$dep_hour))

stop_hour_weekpart
# A tibble: 35 × 5
   dep_hour day_type n_stop_departures total_boardings avg_boardings_per_stop
      <int> <chr>                <int>           <dbl>                  <dbl>
 1        5 Weekday               3952           36452                   9.22
 2        6 Weekday              20969          150588                   7.18
 3        7 Weekday              34867          481665                  13.8 
 4        7 Weekend                226            2073                   9.17
 5        8 Weekday              35072          661987                  18.9 
 6        8 Weekend               2531           18147                   7.17
 7        9 Weekday              36180          405600                  11.2 
 8        9 Weekend               8745          107551                  12.3 
 9       10 Weekday              29757          310091                  10.4 
10       10 Weekend               9459          161616                  17.1 
# ℹ 25 more rows

2.1.3.6 Late rate patterns

Late rate is the primary reliability outcome in this EDA. We report overall shares of late arrivals and early departures, then summarize late rates by route, month, weekday vs weekend, and departure hour.

Code
arr_status_tab <- dist_dat %>%
  filter(!is.na(arr_status)) %>%
  count(arr_status, name = "n") %>%
  mutate(pct = n / sum(n)) %>%
  arrange(desc(n))

dep_status_tab <- dist_dat %>%
  filter(!is.na(dep_status)) %>%
  count(dep_status, name = "n") %>%
  mutate(pct = n / sum(n)) %>%
  arrange(desc(n))

arr_status_tab
# A tibble: 2 × 3
  arr_status      n    pct
  <chr>       <int>  <dbl>
1 Not_Late   558419 0.931 
2 Late        41247 0.0688
Code
dep_status_tab
# A tibble: 2 × 3
  dep_status      n     pct
  <chr>       <int>   <dbl>
1 Not_Early  599844 0.998  
2 Early        1155 0.00192
Code
late_dat <- dist_dat %>%
  filter(!is.na(arr_status)) %>%
  mutate(
    is_late = arr_status == "Late",
    month = floor_date(trip_date, "month"),
    day_type = if_else(is_weekday, "Weekday", "Weekend"),
    dep_hour = hour(actual_departure_dt)
  )

late_by_route <- late_dat %>%
  group_by(route_no) %>%
  summarise(n = n(), pct_late = mean(is_late), .groups = "drop") %>%
  arrange(desc(pct_late))

late_by_month <- late_dat %>%
  group_by(month, day_type) %>%
  summarise(n = n(), pct_late = mean(is_late), .groups = "drop") %>%
  arrange(month, day_type)

late_by_daytype <- late_dat %>%
  group_by(day_type) %>%
  summarise(n = n(), pct_late = mean(is_late), .groups = "drop")

late_by_hour <- late_dat %>%
  filter(!is.na(dep_hour)) %>%
  group_by(dep_hour, day_type) %>%
  summarise(n = n(), pct_late = mean(is_late), .groups = "drop") %>%
  arrange(dep_hour, day_type)

late_hour_breaks <- sort(unique(late_by_hour$dep_hour))

late_by_route %>% head(30)
# A tibble: 8 × 3
  route_no      n pct_late
  <chr>     <int>    <dbl>
1 RW        31870   0.185 
2 SV        95635   0.0900
3 RR          172   0.0872
4 GI         1622   0.0660
5 SB       111056   0.0648
6 ER       191089   0.0578
7 AS       116669   0.0526
8 SG        51553   0.0435
Code
late_by_month
# A tibble: 36 × 4
   month      day_type     n pct_late
   <date>     <chr>    <int>    <dbl>
 1 2024-01-01 Weekday  24647   0.0394
 2 2024-01-01 Weekend   4960   0.0756
 3 2024-02-01 Weekday  23016   0.0374
 4 2024-02-01 Weekend   5052   0.0954
 5 2024-03-01 Weekday  23804   0.0436
 6 2024-03-01 Weekend   7033   0.152 
 7 2024-04-01 Weekday  25555   0.0492
 8 2024-04-01 Weekend   8033   0.0919
 9 2024-05-01 Weekday  25981   0.0609
10 2024-05-01 Weekend   8548   0.0930
# ℹ 26 more rows
Code
late_by_daytype
# A tibble: 2 × 3
  day_type      n pct_late
  <chr>     <int>    <dbl>
1 Weekday  451367   0.0607
2 Weekend  148299   0.0934
Code
late_by_hour
# A tibble: 34 × 4
   dep_hour day_type     n pct_late
      <int> <chr>    <int>    <dbl>
 1        5 Weekday   2549   0.0118
 2        6 Weekday  14253   0.0149
 3        7 Weekday  25748   0.0116
 4        8 Weekday  26783   0.0286
 5        8 Weekend   1582   0.0107
 6        9 Weekday  27496   0.0293
 7        9 Weekend   6560   0.0194
 8       10 Weekday  24007   0.0276
 9       10 Weekend   6970   0.0191
10       11 Weekday  18792   0.0188
# ℹ 24 more rows

2.1.4 EDA Figures

2.1.4.1 Route-level figures

These figures visualize route-level differences in service volume and demand. Weekday vs weekend splits highlight routes that are more commute-oriented versus leisure-oriented.

Code
ggplot(route_weekpart,
       aes(x = reorder(route_no, total_boardings, FUN = sum),
           y = total_boardings, fill = day_type)) +
  geom_col(position = "stack") +
  coord_flip() +
  labs(x = "Route", y = "Total boardings", fill = NULL,
       title = "Total ridership by route (Weekday vs Weekend)")

Total boardings are highest on East River (ER) and Astoria (AS).

Code
ggplot(route_avg_weekpart,
       aes(x = reorder(route_no, avg_boardings_per_trip, FUN = max),
           y = avg_boardings_per_trip, fill = day_type)) +
  geom_col(position = position_dodge(width = 0.8), width = 0.75) +
  coord_flip() +
  labs(x = "Route", y = "Avg boardings per trip", fill = NULL,
       title = "Average ridership per trip by route (Weekday vs Weekend)")

Average boardings per trip tend to be higher on weekends than weekdays. For the seasonal routes, weekday averages can appear high because weekday service days are holiday/peak-operation days.

2.1.4.2 Monthly figures

Monthly charts provide a high-level view of seasonality and temporal shifts in trips, total ridership, and average ridership per trip.

Code
ggplot(monthly_weekpart, aes(x = month, y = total_boardings, fill = day_type)) +
  geom_col(position = "stack") +
  scale_x_date(date_breaks = "1 month", date_labels = "%Y-%m") +
  labs(x = NULL, y = "Total boardings", fill = NULL,
       title = "Monthly ridership (Weekday vs Weekend)") +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Ridership varies much more strongly by season than trips do, especially on weekends.

2.1.4.3 Time-of-day figures

Time-of-day plots show how ridership and activity vary across hours.

Code
ggplot(hour_weekpart, aes(x = dep_hour, y = total_boardings, fill = day_type)) +
  geom_col(position = "stack") +
  scale_x_continuous(breaks = hour_breaks) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Total ridership by trip departure hour (Weekday vs Weekend)",
       x = "Departure hour", y = "Total boardings", fill = NULL)

On weekdays, stop activity shows morning and evening peaks consistent with commuting patterns. On weekends, higher activity is concentrated from midday through early evening.

2.1.4.4 Distributions

Distribution plots are used as diagnostics to understand skew, outliers, and reasonable ranges for modeling. These figures are intended to guide feature engineering and robustness checks.

Code
ggplot(dist_dat %>% filter(!is.na(stop_volume), stop_volume >= 0, stop_volume <= 350),
       aes(x = stop_volume)) +
  geom_histogram(bins = 50) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Distribution of stop activity (board_wake + alight_wake)",
       x = "stop_volume", y = "Count")

Stop activity is highly right-skewed: most stops have relatively low combined boardings/alightings, while a small number of high-activity stops drive the upper tail. The majority of observations are below ~50 per stop.

2.1.4.5 Late-rate figures

These figures visualize how late rate varies across routes, months, and hours.

Code
ggplot(late_by_route, aes(x = reorder(route_no, pct_late), y = pct_late)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Late rate by route", x = "Route", y = "Share Late (arr_status)")

Rockaway (RW) has the highest late rate by a wide margin relative to other routes. It is the longest route and more exposed to high seasonal peaks and variable marine conditions.

Code
ggplot(late_by_month, aes(x = month, y = pct_late, color = day_type)) +
  geom_line() +
  geom_point() +
  scale_x_date(date_breaks = "1 month", date_labels = "%Y-%m") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Late rate by month", x = NULL, y = "Share Late", color = NULL) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Late rates are generally higher in summer than winter, and higher on weekends than weekdays.

Code
ggplot(late_by_hour, aes(x = dep_hour, y = pct_late, color = day_type)) +
  geom_line() +
  geom_point() +
  scale_x_continuous(breaks = late_hour_breaks) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Late rate by stop departure hour", x = "Hour", y = "Share Late", color = NULL)

Late rates tend to peak in the afternoon.

2.1.4.6 Late propagation proxy figures

To proxy delay propagation along a trip, we relate a stop’s lateness to conditions at the previous stop. We show late rates binned by previous-stop dwell time and by previous-stop passenger activity.

Code
prev_dat <- dist_dat %>%
  arrange(trip_date, route_no, route_direction, trip_no, Stop_Sequence) %>%
  group_by(trip_date, route_no, route_direction, trip_no) %>%
  mutate(
    prev_dwell_min   = lag(dwell_min),
    prev_stop_volume = lag(stop_volume)
  ) %>%
  ungroup() %>%
  filter(!is.na(arr_status),
         !is.na(prev_dwell_min), prev_dwell_min >= 0, prev_dwell_min <= 60,
         !is.na(prev_stop_volume), prev_stop_volume >= 0)
Code
prev_dwell_late_tbl <- prev_dat %>%
  mutate(
    prev_dwell_bin = cut(prev_dwell_min,
                         breaks = c(0, 1, 2, 3, 5, 8, 12, 16, Inf),
                         include.lowest = TRUE)
  ) %>%
  group_by(prev_dwell_bin) %>%
  summarise(n = n(), pct_late = mean(arr_status == "Late", na.rm = TRUE), .groups = "drop")

prev_dwell_late_tbl
# A tibble: 8 × 3
  prev_dwell_bin      n pct_late
  <fct>           <int>    <dbl>
1 [0,1]            2041   0.111 
2 (1,2]           26936   0.108 
3 (2,3]          117033   0.0843
4 (3,5]          239932   0.0633
5 (5,8]           68604   0.0814
6 (8,12]           4537   0.232 
7 (12,16]            98   0.612 
8 (16,Inf]           46   0.978 
Code
ggplot(prev_dwell_late_tbl, aes(x = prev_dwell_bin, y = pct_late)) +
  geom_col() +
  geom_text(aes(label = paste0("n=", scales::comma(n))), vjust = -0.3, size = 3) +
  scale_y_continuous(labels = scales::percent_format(), expand = expansion(mult = c(0, 0.12))) +
  labs(title = "Late rate by previous-stop dwell time bin",
       x = "Previous stop dwell time bin (min)", y = "Late rate (current stop)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Within typical dwell time ranges, the relationship between previous-stop dwell and current-stop late rate is relatively weak. Once previous-stop dwell becomes unusually long (e.g., > ~8 minutes), late rates increase sharply.

Code
prev_vol_late_tbl <- prev_dat %>%
  mutate(
    prev_vol_bin = cut(prev_stop_volume,
                       breaks = c(0, 10, 20, 30, 50, 100, 150, 200, Inf),
                       include.lowest = TRUE)
  ) %>%
  group_by(prev_vol_bin) %>%
  summarise(n = n(), pct_late = mean(arr_status == "Late", na.rm = TRUE), .groups = "drop")

prev_vol_late_tbl
# A tibble: 8 × 3
  prev_vol_bin      n pct_late
  <fct>         <int>    <dbl>
1 [0,10]       134481   0.0540
2 (10,20]      105747   0.0544
3 (20,30]       71734   0.0677
4 (30,50]       78063   0.0873
5 (50,100]      56688   0.129 
6 (100,150]     11025   0.220 
7 (150,200]      1332   0.335 
8 (200,Inf]       157   0.573 
Code
ggplot(prev_vol_late_tbl, aes(x = prev_vol_bin, y = pct_late)) +
  geom_col() +
  geom_text(aes(label = paste0("n=", scales::comma(n))), vjust = -0.3, size = 3) +
  scale_y_continuous(labels = scales::percent_format(), expand = expansion(mult = c(0, 0.12))) +
  labs(title = "Late rate by previous-stop activity bin (board_wake + alight_wake)",
       x = "Previous stop activity bin", y = "Late rate (current stop)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

When previous-stop activity is moderate, late rates change gradually; beyond higher activity levels (e.g., > ~50), late rates rise more noticeably.

Taken together, these patterns suggest that delay is driven less by commuter peaks and more by high-demand recreational periods, particularly on routes serving waterfront leisure destinations.

Code
export_dir <- "../exploratory_code/ridership_eda"
readr::write_csv(dat, file.path(export_dir, "dat_stops_2024-01-01_2025-06-30_drop0425SB.csv.gz"))
saveRDS(dat, file.path(export_dir, "dat_stops_2024-01-01_2025-06-30_drop0425SB.rds"))

2.1.5 Operational Analysis

By: Guangze ‘Simon’ Sun and Yiming ‘Ming’ Cao

This operational EDA complements the associative model by examining how ferry delay is recorded, decomposed, generated, and recovered. The four sections ask what issues are recorded in trip notes, how stop-level delay can be decomposed, where large delays are observed versus generated upstream, and whether inherited delay can be absorbed before the next trip begins.

Code
trip_files <- c(
  "../exploratory_code/ridership_eda/NYCF_trip_data_2024.csv",
  "../exploratory_code/ridership_eda/NYCF_trip_data_2025.01_06.csv"
)
decomp_file <- "../exploratory_code/modeling/stop_level_arrival_decomposition_table_corrected_second_stop.csv"

accent       <- "#1A9988"
gold         <- "#F2C14E"
coral        <- "#D95F59"
blue         <- "#7BA7C7"
muted_palette <- c(
  "Inherited previous-trip delay" = "#1A9988",
  "Carried within-trip delay"     = "#F2C14E",
  "Upstream landside delay"       = "#7BA7C7",
  "Upstream waterside delay"      = "#D95F59"
)
plot_theme <- theme_minimal(base_size = 14) +
  theme(
    plot.title    = element_text(face = "bold", size = 17),
    panel.grid.minor = element_blank(),
    legend.title  = element_blank(),
    axis.text     = element_text(color = "#333333")
  )

fmt_pct <- function(x) percent(x, accuracy = 0.1)
clean_kable <- function(x, caption = NULL, align = NULL) {
  if (is.null(align)) {
    align <- paste0(if_else(map_lgl(x, is.character), "l", "r"), collapse = "")
  }
  kable(x, caption = caption, align = align)
}

2.1.5.1 Text Analysis

Trip notes were extracted from raw NYC Ferry trip-level files and parsed into standardized categories. One trip can have multiple categories, so category shares are presence rates rather than mutually exclusive percentages. Results are descriptive associations, not causal effects.

Code
parse_note_segments <- function(trip_uid, note_text) {
  note_text <- ifelse(is.na(note_text), "", note_text)
  if (str_trim(note_text) == "") return(tibble(trip_uid = integer(), category = character(), description = character()))
  matches <- str_match_all(note_text, "\\[(\\d+)\\]\\s*([^:\\[]+):\\s*([^\\[]*)")[[1]]
  if (nrow(matches) == 0) return(tibble(trip_uid = trip_uid, category = "Unparsed", description = note_text))
  tibble(trip_uid = trip_uid, category = matches[, 3], description = matches[, 4])
}

clean_category <- function(x) {
  out <- str_squish(x) %>% str_replace_all("\\s+", " ") %>% str_to_title()
  dplyr::recode(out, "Uscg Restriction" = "USCG Restriction", "Nyc Ferry" = "NYC Ferry", "Ny Waterway" = "NY Waterway", .default = out)
}

raw_trips <- map_dfr(trip_files, ~ read_csv(.x, show_col_types = FALSE) %>% mutate(source_file = basename(.x))) %>%
  select(-matches("^\\.\\.\\."))

trips <- raw_trips %>%
  mutate(
    trip_uid = row_number(),
    route = if ("route_no" %in% names(.)) route_no else NA_character_,
    service_date = if ("trip_date" %in% names(.)) mdy(trip_date) else as.Date(NA),
    note_text = coalesce(notes, ""),
    scheduled_arrival_ts = mdy_hm(scheduled_arrival_dt, quiet = TRUE),
    actual_arrival_ts    = mdy_hm(actual_arrival_dt,    quiet = TRUE),
    arrival_delay_min_raw = as.numeric(difftime(actual_arrival_ts, scheduled_arrival_ts, units = "mins")),
    status_text = str_squish(str_c(coalesce(trip_status, ""), coalesce(trip_status_actual, ""), sep = " | ")),
    cancelled            = str_detect(str_to_lower(status_text), "trip not completed|cancel") | coalesce(`(16) Trip Cancelled`, FALSE),
    partial_completion   = str_detect(str_to_lower(status_text), "partial") | coalesce(`(17) Partially Completed`, FALSE),
    cancel_or_partial_completion = cancelled | partial_completion,
    completed_trip = !cancel_or_partial_completion & (str_detect(str_to_lower(status_text), "completed") | coalesce(`(10)  Completed`, FALSE)),
    valid_delay         = completed_trip & !is.na(arrival_delay_min_raw),
    delayed_5           = valid_delay & arrival_delay_min_raw > 5,
    excessive_delay_10  = valid_delay & arrival_delay_min_raw > 10
  )

note_segments <- map2_dfr(trips$trip_uid, trips$note_text, parse_note_segments) %>%
  mutate(category = clean_category(category), description = str_squish(description)) %>%
  filter(!is.na(category), category != "")

trip_category <- note_segments %>%
  distinct(trip_uid, category) %>%
  left_join(trips %>% select(trip_uid, route, service_date, note_text, arrival_delay_min_raw,
                             delayed_5, excessive_delay_10, cancel_or_partial_completion), by = "trip_uid") %>%
  filter(category != "Completed")

category_presence <- trip_category %>%
  group_by(category) %>%
  summarise(
    n_trips_with_category = n_distinct(trip_uid),
    share_all_trips = n_trips_with_category / nrow(trips),
    share_among_delayed_5 = n_distinct(trip_uid[delayed_5]) / sum(trips$delayed_5, na.rm = TRUE),
    share_among_excessive_delay_10 = n_distinct(trip_uid[excessive_delay_10]) / sum(trips$excessive_delay_10, na.rm = TRUE),
    share_among_cancel_or_partial_completion = n_distinct(trip_uid[cancel_or_partial_completion]) / sum(trips$cancel_or_partial_completion, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(pmax(share_among_excessive_delay_10, share_among_cancel_or_partial_completion, na.rm = TRUE)))

category_rates <- trip_category %>%
  group_by(category) %>%
  summarise(
    n_trips_with_category = n_distinct(trip_uid),
    delayed_5_rate = mean(delayed_5, na.rm = TRUE),
    excessive_delay_10_rate = mean(excessive_delay_10, na.rm = TRUE),
    cancel_or_partial_completion_rate = mean(cancel_or_partial_completion, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(pmax(excessive_delay_10_rate, cancel_or_partial_completion_rate, na.rm = TRUE)))

support_threshold <- if_else(sum(category_rates$n_trips_with_category >= 50) >= 8, 50, 20)

presence_display <- category_presence %>%
  slice_head(n = 12) %>%
  transmute(
    Category = category,
    `All trips` = fmt_pct(share_all_trips),
    `Delay >5` = fmt_pct(share_among_delayed_5),
    `Delay >10` = fmt_pct(share_among_excessive_delay_10),
    `Cancel / partial` = fmt_pct(share_among_cancel_or_partial_completion)
  )

rates_display <- category_rates %>%
  filter(n_trips_with_category >= support_threshold) %>%
  slice_head(n = 12) %>%
  transmute(
    Category = category,
    Trips = n_trips_with_category,
    `Delay >5 rate` = fmt_pct(delayed_5_rate),
    `Delay >10 rate` = fmt_pct(excessive_delay_10_rate),
    `Cancel / partial rate` = fmt_pct(cancel_or_partial_completion_rate)
  )
Code
clean_kable(presence_display, caption = "Category presence among outcome groups", align = "lrrrr")
Category presence among outcome groups
Category All trips Delay >5 Delay >10 Cancel / partial
Passenger Delays 10.8% 39.2% 62.3% 14.5%
Marine Traffic 11.7% 44.0% 34.0% 16.1%
Max Out 1.5% 5.2% 10.2% 0.8%
Partially Completed 0.8% 0.0% 0.0% 10.2%
Vessel Mechanical 0.8% 3.5% 7.7% 2.0%
Bypassed Landing 0.6% 0.0% 0.0% 7.5%
Landing Issue 2.2% 5.5% 6.7% 6.1%
Weather 0.4% 0.7% 1.7% 4.1%
Connectivity Issues 2.5% 2.4% 2.6% 3.7%
Trip Cancelled 0.3% 0.0% 0.0% 3.7%
USCG Restriction 0.2% 0.6% 1.8% 1.8%
Hny 0.3% 1.0% 1.6% 0.4%
Code
clean_kable(rates_display, caption = paste0("Outcome rates when category appears (support threshold n >= ", support_threshold, ")"), align = "lrrrr")
Outcome rates when category appears (support threshold n >= 50)
Category Trips Delay >5 rate Delay >10 rate Cancel / partial rate
Partially Completed 1160 0.0% 0.0% 100.0%
Trip Cancelled 426 0.0% 0.0% 100.0%
Bypassed Landing 881 0.3% 0.0% 96.9%
Weather 589 11.9% 8.0% 79.8%
USCG Restriction 347 19.3% 13.8% 60.2%
Emergency 80 25.0% 17.5% 38.8%
Landing Issue 3066 18.7% 5.8% 22.6%
Vessel Mechanical 1071 33.8% 19.4% 20.8%
Max Out 2167 24.9% 12.7% 4.1%
Connectivity Issues 3573 7.1% 2.0% 11.9%
Hny 392 25.3% 11.0% 11.7%
Marine Traffic 16381 27.9% 5.6% 11.2%

The first table shows which note categories appear most often among delayed or cancelled/partially completed trips. The second table flips the question: when a category appears, how often is that trip associated with delay or cancellation?

Passenger delays and marine traffic are the most visible delay-associated note categories, but they point to different mechanisms. Passenger delays become especially concentrated among delay >10 trips, while marine traffic is more prominent in the broader delay >5 group. Rare categories such as vessel mechanical issues, weather, and USCG restrictions appear less often overall but have high disruption rates when they occur, making them operationally important despite their low frequency.

2.1.5.2 Delay Decomposition

This is a median-based expected-time decomposition. A stop-level delay is decomposed into inherited previous-trip delay, carried within-trip delay, upstream landside delay, upstream waterside delay, and upstream wait when available. Vessel chains are reconstructed by vessel and operational sequence; consecutive trips by the same vessel are linked when they are operationally continuous. This is accounting/decomposition, not causal inference.

Code
decomp <- read_csv(decomp_file, show_col_types = FALSE)
component_cols <- c("inherited_prev_trip_delay", "carried_within_trip_delay", "upstream_landside_delay", "upstream_waterside_delay", "upstream_wait")
display_components <- c("inherited_prev_trip_delay", "carried_within_trip_delay", "upstream_landside_delay", "upstream_waterside_delay")
component_labels <- c(
  inherited_prev_trip_delay = "Inherited previous-trip delay",
  carried_within_trip_delay = "Carried within-trip delay",
  upstream_landside_delay   = "Upstream landside delay",
  upstream_waterside_delay  = "Upstream waterside delay",
  upstream_wait             = "Upstream wait"
)

eligible <- decomp %>%
  mutate(eligible_flag = eligible_arrival_decomp %in% c(TRUE, 1, "TRUE", "true", "1")) %>%
  filter(eligible_flag)

decomp_long <- eligible %>%
  mutate(`Delay >5` = arrival_delay_min > 5, `Delay >10` = arrival_delay_min > 10) %>%
  select(trip_key, stop_id, stop_sequence, `Delay >5`, `Delay >10`, all_of(component_cols)) %>%
  pivot_longer(all_of(component_cols), names_to = "component", values_to = "component_minutes") %>%
  mutate(Component = dplyr::recode(component, !!!component_labels), displayed = component %in% display_components)

decomp_groups <- bind_rows(
  decomp_long %>% filter(`Delay >5`)  %>% mutate(delay_group = "Delay >5"),
  decomp_long %>% filter(`Delay >10`) %>% mutate(delay_group = "Delay >10")
)

component_summary <- decomp_groups %>%
  group_by(delay_group, component, Component, displayed) %>%
  summarise(avg_min = mean(component_minutes, na.rm = TRUE),
            positive_avg_min = pmax(avg_min, 0), .groups = "drop") %>%
  group_by(delay_group) %>%
  mutate(share = if_else(displayed, positive_avg_min / sum(positive_avg_min[displayed], na.rm = TRUE), NA_real_)) %>%
  ungroup()

component_table_display <- component_summary %>%
  select(delay_group, Component, avg_min, share) %>%
  pivot_wider(names_from = delay_group, values_from = c(avg_min, share), names_glue = "{delay_group}_{.value}") %>%
  mutate(Component = factor(Component, levels = c("Inherited previous-trip delay", "Carried within-trip delay",
                                                   "Upstream landside delay", "Upstream waterside delay", "Upstream wait"))) %>%
  arrange(Component) %>%
  mutate(Component = as.character(Component)) %>%
  transmute(
    Component,
    `Delay >5 avg min`  = round(`Delay >5_avg_min`, 2),
    `Delay >5 share`    = if_else(is.na(`Delay >5_share`),  "—", fmt_pct(`Delay >5_share`)),
    `Delay >10 avg min` = round(`Delay >10_avg_min`, 2),
    `Delay >10 share`   = if_else(is.na(`Delay >10_share`), "—", fmt_pct(`Delay >10_share`))
  )

plot_components <- component_summary %>%
  filter(displayed) %>%
  mutate(Component = factor(Component, levels = rev(c("Inherited previous-trip delay", "Carried within-trip delay",
                                                       "Upstream landside delay", "Upstream waterside delay"))))

decomp_bar <- ggplot(plot_components, aes(x = delay_group, y = positive_avg_min, fill = Component)) +
  geom_col(width = 0.65) +
  scale_fill_manual(values = muted_palette) +
  labs(title = "Average Delay Components", x = NULL, y = "Average component minutes") +
  plot_theme
Code
print(decomp_bar)

Average component minutes among stop observations with delay greater than 5 or 10 minutes.
Code
clean_kable(component_table_display, caption = "Average delay components and displayed-component shares", align = "lrrrr")
Average delay components and displayed-component shares
Component Delay >5 avg min Delay >5 share Delay >10 avg min Delay >10 share
Inherited previous-trip delay 6.33 64.7% 11.16 68.4%
Carried within-trip delay 2.20 22.4% 3.45 21.1%
Upstream landside delay 0.56 5.7% 0.96 5.9%
Upstream waterside delay 0.71 7.2% 0.74 4.5%
Upstream wait 0.02 0.01

Average component minutes among stop observations with delay greater than 5 or 10 minutes.

The decomposition shows that most delayed stop observations are not mainly created at the current stop or segment. Inherited previous-trip delay is the largest component in both groups, accounting for 64.7% of displayed component minutes among delay >5 observations and 68.4% among delay >10 observations. Carried within-trip delay is the second-largest component, contributing about one-fifth of displayed delay in both groups. Newly generated upstream landside and waterside components are much smaller on average.

The severity difference is also important. Delay >10 observations have much larger inherited delay than delay >5 observations, while the shares of landside and waterside delay remain comparatively small. This suggests that severe delay is often the result of propagation through the vessel chain rather than a local event at the delayed stop.

2.1.5.3 Delay Genesis

This section traces delayed stop observations backward through the same vessel chain. Targets use arrival_delay_min > 5 and arrival_delay_min > 10. A major source is a single landside event greater than 5 minutes or a single waterside event greater than 5 minutes.

Code
stop_label      <- if ("stop_name" %in% names(decomp)) "stop_name" else "stop_id"
prev_stop_label <- if ("prev_stop_name" %in% names(decomp)) "prev_stop_name" else "prev_stop_id"
vessel_col      <- if ("vessel_id" %in% names(decomp)) "vessel_id" else if ("vessel" %in% names(decomp)) "vessel" else NA_character_
if (is.na(vessel_col)) stop("A vessel identifier is required for delay genesis tracing.")

timeline0 <- decomp %>%
  mutate(
    scheduled_ts  = suppressWarnings(as.POSIXct(scheduled_time_original, tz = "America/New_York")),
    eligible_flag = eligible_arrival_decomp %in% c(TRUE, 1, "TRUE", "true", "1")
  )

trip_bounds <- timeline0 %>%
  group_by(.data[[vessel_col]], trip_key) %>%
  summarise(trip_start_ts = min(scheduled_ts, na.rm = TRUE),
            trip_end_ts   = max(scheduled_ts, na.rm = TRUE), .groups = "drop") %>%
  arrange(.data[[vessel_col]], trip_start_ts, trip_end_ts, trip_key) %>%
  group_by(.data[[vessel_col]]) %>%
  mutate(
    prev_trip_end_ts = lag(trip_end_ts),
    trip_gap_min     = as.numeric(difftime(trip_start_ts, prev_trip_end_ts, units = "mins")),
    chain_break      = is.na(trip_gap_min) | trip_gap_min > 30,
    vessel_chain_id  = paste(.data[[vessel_col]], cumsum(chain_break), sep = "_")
  ) %>%
  ungroup() %>%
  select(.data[[vessel_col]], trip_key, vessel_chain_id, trip_start_ts, trip_gap_min)

timeline <- timeline0 %>%
  left_join(trip_bounds, by = c(vessel_col, "trip_key")) %>%
  mutate(
    positive_landside_delay  = pmax(replace_na(upstream_landside_delay,  0), 0),
    positive_waterside_delay = pmax(replace_na(upstream_waterside_delay, 0), 0),
    observed_landing  = coalesce(as.character(.data[[stop_label]]),      as.character(stop_id)),
    source_from_landing = coalesce(as.character(.data[[prev_stop_label]]), as.character(prev_stop_id), "Origin"),
    source_to_landing = observed_landing,
    observed_segment  = paste(source_from_landing, "->", observed_landing),
    source_segment    = paste(source_from_landing, "->", observed_landing),
    row_top_generated_delay = pmax(positive_landside_delay, positive_waterside_delay),
    row_top_source_type = case_when(
      row_top_generated_delay <= 0 ~ NA_character_,
      positive_landside_delay >= positive_waterside_delay ~ "Landside",
      TRUE ~ "Waterside"
    ),
    major_landside  = positive_landside_delay  > 5,
    major_waterside = positive_waterside_delay > 5
  ) %>%
  arrange(vessel_chain_id, scheduled_ts, trip_start_ts, trip_key, stop_sequence) %>%
  group_by(vessel_chain_id) %>%
  mutate(
    row_order_within_chain       = row_number(),
    cumulative_positive_generated = cumsum(positive_landside_delay + positive_waterside_delay),
    cumulative_major_sources      = cumsum(major_landside) + cumsum(major_waterside)
  ) %>%
  ungroup()

top_source_by_chain <- function(chain_df) {
  chain_df <- chain_df %>% arrange(row_order_within_chain)
  best_delay <- numeric(nrow(chain_df)); best_type <- character(nrow(chain_df))
  best_landing <- character(nrow(chain_df)); best_segment <- character(nrow(chain_df))
  current_best <- 0; current_type <- NA_character_; current_landing <- NA_character_; current_segment <- NA_character_
  for (i in seq_len(nrow(chain_df))) {
    if (!is.na(chain_df$row_top_generated_delay[i]) && chain_df$row_top_generated_delay[i] > current_best) {
      current_best    <- chain_df$row_top_generated_delay[i]
      current_type    <- chain_df$row_top_source_type[i]
      current_landing <- chain_df$source_from_landing[i]
      current_segment <- chain_df$source_segment[i]
    }
    best_delay[i] <- current_best; best_type[i] <- current_type
    best_landing[i] <- current_landing; best_segment[i] <- current_segment
  }
  chain_df %>% mutate(
    top_source_generated_minutes = best_delay,
    top_source_type    = na_if(best_type, ""),
    top_source_landing = na_if(best_landing, ""),
    top_source_segment = na_if(best_segment, "")
  )
}

timeline <- timeline %>% group_split(vessel_chain_id) %>% map_dfr(top_source_by_chain)

observed_landing_den  <- timeline %>% filter(eligible_flag) %>% count(observed_landing,    name = "Observed total observations")
observed_segment_den  <- timeline %>% filter(eligible_flag) %>% count(observed_segment,    name = "Observed total observations")
generated_landing_den <- timeline %>% filter(eligible_flag) %>% count(source_from_landing, name = "Generated total source observations")
generated_segment_den <- timeline %>% filter(eligible_flag) %>% count(source_segment,      name = "Generated total source observations")

target_base <- timeline %>%
  filter(eligible_flag, arrival_delay_min > 5) %>%
  transmute(
    target_id_base = row_number(),
    observed_landing, observed_segment, arrival_delay_min,
    cumulative_positive_generated,
    num_major_sources        = cumulative_major_sources,
    top_source_generated_minutes,
    top_source_type, top_source_landing, top_source_segment,
    top_source_share = if_else(cumulative_positive_generated > 0,
                               top_source_generated_minutes / cumulative_positive_generated, NA_real_)
  )

genesis_targets <- bind_rows(
  target_base %>% mutate(delay_group = "delay_gt5",  delay_group_label = "Delay >5"),
  target_base %>% filter(arrival_delay_min > 10) %>% mutate(delay_group = "delay_gt10", delay_group_label = "Delay >10")
) %>%
  mutate(
    target_id    = paste(delay_group, target_id_base, sep = "_"),
    genesis_type = case_when(
      cumulative_positive_generated <= 0 | is.na(cumulative_positive_generated) ~ "No traced positive source",
      num_major_sources == 1 & top_source_share >= 0.5 ~ "Single major",
      num_major_sources >= 2 ~ "Multiple major",
      TRUE ~ "Cumulative"
    )
  )

classified_targets <- genesis_targets %>% filter(genesis_type %in% c("Single major", "Multiple major", "Cumulative"))

genesis_type_summary <- classified_targets %>%
  count(delay_group, delay_group_label, genesis_type, name = "n_targets") %>%
  group_by(delay_group) %>%
  mutate(share_of_classified_targets = n_targets / sum(n_targets)) %>%
  ungroup()

genesis_type_bar <- ggplot(genesis_type_summary, aes(x = delay_group_label, y = share_of_classified_targets, fill = genesis_type)) +
  geom_col(width = 0.65) +
  scale_y_continuous(labels = percent_format()) +
  scale_fill_manual(values = c("Single major" = accent, "Multiple major" = gold, "Cumulative" = blue)) +
  labs(title = "Delay Genesis Type", x = NULL, y = "Share of classified delayed observations") +
  plot_theme

top_source_event_summary <- classified_targets %>%
  count(delay_group, delay_group_label, top_source_type, name = "n_targets") %>%
  group_by(delay_group) %>%
  mutate(share_of_targets = n_targets / sum(n_targets)) %>%
  ungroup()

top_source_event_plot <- ggplot(top_source_event_summary, aes(x = delay_group_label, y = share_of_targets, fill = top_source_type)) +
  geom_col(width = 0.65) +
  scale_y_continuous(labels = percent_format()) +
  scale_fill_manual(values = c("Landside" = accent, "Waterside" = coral)) +
  labs(title = "Top Source Event Type", x = NULL, y = "Share of classified delayed observations") +
  plot_theme

make_location_tables <- function(geography = c("landing", "segment")) {
  geography   <- match.arg(geography)
  obs_col     <- if (geography == "landing") "observed_landing" else "observed_segment"
  gen_col     <- if (geography == "landing") "top_source_landing" else "top_source_segment"
  obs_den     <- if (geography == "landing") observed_landing_den else observed_segment_den
  gen_den     <- if (geography == "landing") generated_landing_den else generated_segment_den
  obs_den_col <- if (geography == "landing") "observed_landing" else "observed_segment"
  gen_den_col <- if (geography == "landing") "source_from_landing" else "source_segment"

  observed <- genesis_targets %>%
    filter(delay_group == "delay_gt5") %>%
    count(Location = .data[[obs_col]], name = "Observed delayed count") %>%
    left_join(rename(obs_den, Location = all_of(obs_den_col)), by = "Location") %>%
    mutate(`Observed delay rate` = `Observed delayed count` / `Observed total observations`)

  generated <- classified_targets %>%
    filter(delay_group == "delay_gt5", !is.na(.data[[gen_col]])) %>%
    count(Location = .data[[gen_col]], name = "Generated linked delay count") %>%
    left_join(rename(gen_den, Location = all_of(gen_den_col)), by = "Location") %>%
    mutate(`Generated source-linked rate` = `Generated linked delay count` / `Generated total source observations`)

  obs_support <- if (sum(observed$`Observed total observations` >= 100, na.rm = TRUE) >= 5) 100 else 50
  gen_support <- if (sum(generated$`Generated total source observations` >= 100, na.rm = TRUE) >= 5) 100 else 50

  count_table <- full_join(
    observed  %>% arrange(desc(`Observed delayed count`))       %>% mutate(Rank = row_number()) %>% slice_head(n = 5) %>% select(Rank, `Observed delay location` = Location, `Observed delayed count`),
    generated %>% arrange(desc(`Generated linked delay count`)) %>% mutate(Rank = row_number()) %>% slice_head(n = 5) %>% select(Rank, `Generated source location` = Location, `Generated linked delay count`),
    by = "Rank"
  ) %>% arrange(Rank)

  rate_table <- full_join(
    observed  %>% filter(`Observed total observations`        >= obs_support) %>% arrange(desc(`Observed delay rate`),         desc(`Observed delayed count`))       %>% mutate(Rank = row_number()) %>% slice_head(n = 5) %>% select(Rank, `Observed delay location` = Location, `Observed delayed count`, `Observed total observations`, `Observed delay rate`),
    generated %>% filter(`Generated total source observations` >= gen_support) %>% arrange(desc(`Generated source-linked rate`), desc(`Generated linked delay count`)) %>% mutate(Rank = row_number()) %>% slice_head(n = 5) %>% select(Rank, `Generated source location` = Location, `Generated linked delay count`, `Generated total source observations`, `Generated source-linked rate`),
    by = "Rank"
  ) %>% arrange(Rank)

  list(count = count_table, rate = rate_table)
}

landing_gt5 <- make_location_tables("landing")
segment_gt5 <- make_location_tables("segment")
Code
print(genesis_type_bar)

Genesis types among classified delayed observations.

Most delayed observations are classified as cumulative rather than being explained by one large source event. This is especially true for the broader delay >5 group. For delay >10 observations, the share of single-major and multiple-major cases increases, indicating that severe delays are more likely to involve at least one large upstream landside or waterside increment.

Code
print(top_source_event_plot)

Top source event type for classified delayed observations.

For delay >5 observations, top source events are roughly balanced, with waterside sources slightly more prominent. For delay >10 observations, the split shifts more toward landside sources, suggesting that the most severe traced delays may be more connected to landing, dwell, boarding, or turnaround-related processes at the upstream source location.

Observed versus generated locations. The observed-location tables show where delays appear if we only look at delayed arrivals. The generated-source tables show where the top traced source events occur after applying the decomposition logic.

Code
format_rate_table <- function(x) {
  x %>% mutate(
    `Observed delay rate`       = fmt_pct(`Observed delay rate`),
    `Generated source-linked rate` = fmt_pct(`Generated source-linked rate`)
  )
}

clean_kable(landing_gt5$count %>% rename(`Observed delay landing` = `Observed delay location`, `Generated source landing` = `Generated source location`), caption = "Delay >5 landing comparison by count", align = "rlrlr")
Delay >5 landing comparison by count
Rank Observed delay landing Observed delayed count Generated source landing Generated linked delay count
1 Wall St/Pier 11 5565 Wall St/Pier 11 9898
2 East 34th Street 5068 East 34th Street 3448
3 Dumbo/Fulton Ferry 3330 Dumbo/Fulton Ferry 2851
4 Sunset Park/BAT 2774 East 90th St 2588
5 Greenpoint 2390 Hunters Point South 2038
Code
clean_kable(format_rate_table(landing_gt5$rate) %>% rename(`Observed delay landing` = `Observed delay location`, `Generated source landing` = `Generated source location`), caption = "Delay >5 landing comparison by local rate", align = "rlrrlrrr")
Delay >5 landing comparison by local rate
Rank Observed delay landing Observed delayed count Observed total observations Observed delay rate Generated source landing Generated linked delay count Generated total source observations Generated source-linked rate
1 Rockaway 869 7693 11.3% Rockaway 2023 8173 24.8%
2 Ferry Point Park 1042 9265 11.2% Corlears Hook 1688 8445 20.0%
3 Soundview 1929 18883 10.2% Wall St/Pier 11 9898 63024 15.7%
4 Corlears Hook 853 8584 9.9% Hunters Point South 2038 15911 12.8%
5 Sunset Park/BAT 2774 29165 9.5% Bay Ridge 860 8364 10.3%
Code
clean_kable(segment_gt5$count %>% rename(`Observed delay segment` = `Observed delay location`, `Generated source segment` = `Generated source location`), caption = "Delay >5 segment comparison by count", align = "rlrlr")
Delay >5 segment comparison by count
Rank Observed delay segment Observed delayed count Generated source segment Generated linked delay count
1 Wall St/Pier 11 -> Dumbo/Fulton Ferry 1810 Wall St/Pier 11 -> Dumbo/Fulton Ferry 4423
2 East 34th Street -> Greenpoint 1577 Wall St/Pier 11 -> Stuyvesant Cove 2694
3 Dumbo/Fulton Ferry -> Wall St/Pier 11 1570 Hunters Point South -> East 34th Street 2038
4 Greenpoint -> North Williamsburg 1426 Rockaway -> Sunset Park/BAT 2020
5 North Williamsburg -> South Williamsburg 1306 Corlears Hook -> Dumbo/Fulton Ferry 1688
Code
clean_kable(format_rate_table(segment_gt5$rate) %>% rename(`Observed delay segment` = `Observed delay location`, `Generated source segment` = `Generated source location`), caption = "Delay >5 segment comparison by local rate", align = "rlrrlrrr")
Delay >5 segment comparison by local rate
Rank Observed delay segment Observed delayed count Observed total observations Observed delay rate Generated source segment Generated linked delay count Generated total source observations Generated source-linked rate
1 Sunset Park/BAT -> Wall St/Pier 11 1066 8148 13.1% Wall St/Pier 11 -> Stuyvesant Cove 2694 9474 28.4%
2 East 90th St -> Soundview 1170 9498 12.3% Bay Ridge -> Gov. Island/Yankee Pier 73 265 27.5%
3 Rockaway -> Sunset Park/BAT 961 8133 11.8% Rockaway -> Sunset Park/BAT 2020 8133 24.8%
4 Sunset Park/BAT -> Red Hook/Atlantic Basin 461 4009 11.5% Corlears Hook -> Dumbo/Fulton Ferry 1688 8445 20.0%
5 Soundview -> Ferry Point Park 1042 9265 11.2% Wall St/Pier 11 -> Dumbo/Fulton Ferry 4423 24237 18.2%

The count-based location tables show that the places where delay is observed are not always the same as the places where delay is generated. Wall St/Pier 11 and East 34th Street are major observed delay locations by count, given their central role in the network. However, the generated-source rankings shift attention upstream to locations such as East 90th St and Hunters Point South. The rate-based tables identify locations that are delay-prone relative to their own activity volume — useful for diagnosing reliability risk at a specific landing or segment.

2.1.5.4 Delay Recovery

This section focuses only on trip pairs where the next trip inherits more than 5 minutes of delay. Origin recovery means the next trip departs within 1 minute of the median-based expected departure time. Scheduled slack is recomputed as current trip scheduled origin time minus the previous trip terminal scheduled time for the same vessel.

Code
terminal_rows <- decomp %>%
  group_by(trip_key) %>%
  slice_max(order_by = stop_sequence, n = 1, with_ties = FALSE) %>%
  ungroup() %>%
  transmute(trip_key, terminal_scheduled_time_original = scheduled_time_original)

origin_context <- decomp %>%
  group_by(trip_key) %>%
  slice_min(order_by = stop_sequence, n = 1, with_ties = FALSE) %>%
  ungroup() %>%
  transmute(
    trip_key, trip_date = trip_date_et, route_id, direction, vessel_id,
    schedule_daytype, schedule_season,
    origin_scheduled_time_original = prev_stop_scheduled_time_original,
    origin_departure_delay_total, inherited_prev_trip_delay
  )

recovery_pairs_recomputed <- origin_context %>%
  left_join(terminal_rows, by = "trip_key") %>%
  arrange(vessel_id, origin_scheduled_time_original, terminal_scheduled_time_original, trip_key) %>%
  group_by(vessel_id) %>%
  mutate(
    previous_terminal_scheduled_time_original = lag(terminal_scheduled_time_original),
    planned_turnaround_gap_min = as.numeric(difftime(origin_scheduled_time_original,
                                                      previous_terminal_scheduled_time_original, units = "mins"))
  ) %>%
  ungroup() %>%
  mutate(
    inherited_prev_trip_delay     = pmax(as.numeric(inherited_prev_trip_delay), 0, na.rm = FALSE),
    origin_departure_delay_total  = as.numeric(origin_departure_delay_total),
    origin_recovered_1min         = origin_departure_delay_total <= 1,
    origin_recovered_minutes      = inherited_prev_trip_delay - origin_departure_delay_total
  )

inherit_levels <- c("5–10", "10–15", "15–20", "20–25", "25–30", "30+")
slack_levels   <- c("5–10", "10–15", "15–20", "20–25", "25–30")
mask_lookup <- tibble(
  inherited_delay_bin = factor(inherit_levels, levels = inherit_levels),
  inherited_floor     = c(5, 10, 15, 20, 25, 30)
) %>%
  crossing(tibble(slack_bin = factor(slack_levels, levels = slack_levels), slack_ceiling = c(10, 15, 20, 25, 30))) %>%
  mutate(display_cell = inherited_floor < slack_ceiling)

recovery_base <- recovery_pairs_recomputed %>%
  mutate(
    inherited_delay = inherited_prev_trip_delay,
    slack           = planned_turnaround_gap_min,
    origin_recovered = origin_recovered_1min,
    inherited_delay_bin = case_when(
      inherited_delay <= 10 ~ "5–10",  inherited_delay <= 15 ~ "10–15",
      inherited_delay <= 20 ~ "15–20", inherited_delay <= 25 ~ "20–25",
      inherited_delay <= 30 ~ "25–30", TRUE ~ "30+"
    ),
    slack_bin = case_when(
      slack <= 10 ~ "5–10",  slack <= 15 ~ "10–15",
      slack <= 20 ~ "15–20", slack <= 25 ~ "20–25",
      slack <= 30 ~ "25–30", TRUE ~ NA_character_
    )
  ) %>%
  filter(inherited_delay > 5, !is.na(slack), slack >= 5, slack <= 30, !is.na(origin_recovered))

recovery_heatmap_data <- recovery_base %>%
  mutate(inherited_delay_bin = factor(inherited_delay_bin, levels = inherit_levels),
         slack_bin           = factor(slack_bin,           levels = slack_levels)) %>%
  group_by(inherited_delay_bin, slack_bin) %>%
  summarise(
    n = n(),
    origin_recovery_rate          = mean(origin_recovered,         na.rm = TRUE),
    median_inherited_delay        = median(inherited_delay,        na.rm = TRUE),
    median_slack                  = median(slack,                  na.rm = TRUE),
    median_origin_departure_delay = median(origin_departure_delay_total, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  left_join(mask_lookup, by = c("inherited_delay_bin", "slack_bin")) %>%
  filter(display_cell)

recovery_heatmap <- ggplot(recovery_heatmap_data,
                           aes(x = slack_bin, y = inherited_delay_bin, fill = origin_recovery_rate)) +
  geom_tile(color = "white", linewidth = 0.7) +
  geom_text(aes(label = paste0(percent(origin_recovery_rate, accuracy = 1), "\n", "n=", n)), size = 3.3) +
  scale_fill_gradient(low = "#F6DCA8", high = accent, labels = percent_format(), limits = c(0, 1)) +
  labs(title = "Origin Recovery by Inherited Delay and Slack",
       x = "Scheduled slack (minutes)", y = "Inherited delay (minutes)", fill = "Recovered") +
  plot_theme

scatter_data <- recovery_pairs_recomputed %>%
  mutate(inherited_delay  = inherited_prev_trip_delay,
         slack            = planned_turnaround_gap_min,
         origin_recovered = origin_recovered_1min) %>%
  filter(inherited_delay > 5, !is.na(slack), slack >= 5, slack <= 30,
         inherited_delay <= 40, !is.na(origin_recovered))

recovery_scatter <- ggplot(scatter_data, aes(x = inherited_delay, y = slack, color = origin_recovered)) +
  geom_point(alpha = 0.3, size = 1.2) +
  coord_cartesian(ylim = c(5, 30), xlim = c(5, 40)) +
  scale_color_manual(values = c(`TRUE` = accent, `FALSE` = coral),
                     labels = c(`TRUE` = "Recovered", `FALSE` = "Not recovered")) +
  labs(title = "Origin Recovery Scatter",
       x = "Inherited delay (minutes)", y = "Scheduled slack (minutes)", color = NULL) +
  plot_theme
Code
print(recovery_heatmap)

Origin recovery among trips inheriting more than 5 minutes of delay.
Code
print(recovery_scatter)

Origin recovery among trips inheriting more than 5 minutes of delay.

The recovery heatmap shows a clear interaction between inherited delay and scheduled slack. When inherited delay is only 5–10 minutes, recovery becomes much more likely as slack increases: recovery is 13% with 5–10 minutes of slack, 31% with 10–15 minutes, and reaches 70–83% once slack is above 15 minutes. The scatter plot confirms the heatmap pattern: recovered trips cluster where inherited delay is small and scheduled slack is large.

2.1.5.5 Conclusion

Together, these EDA modules show that ferry delay is operationally structured. Trip notes identify recurring disruption categories. The decomposition shows that delayed stop observations are dominated by inherited and carried delay rather than only newly generated local delay. Genesis tracing further shows that the place where delay is observed is not always the place where delay is generated; upstream landings and segments can create delay that is recorded later downstream. Finally, the recovery analysis shows that scheduled slack can absorb moderate inherited delay, but recovery becomes unlikely when inherited delay is large or when slack is only marginal.

2.2 Automatic Identification System (AIS) Data

By: Sujan Kakumanu

2.2.1 Background

At a high level, AIS (Automatic Identification System) allows large marine vehicles to report their status (location, speed, heading, voyage data, etc). This is used to communicate with other vessels and authorities on-shore (traffic management, for example).

Depending on the size of the vessel, data can be transmitted as frequently as every 2 seconds. As a result, the provided data sets are incredibly large. The analysis code below focuses on one CSV (AIS from June 2025 in NY Harbor). This file is still too large, and will not be uploaded alongside this script.

Sources:

AIS Data Dictionary: https://coast.noaa.gov/data/marinecadastre/ais/data-dictionary.pdf Vessel type codes: https://coast.noaa.gov/data/marinecadastre/ais/VesselTypeCodes2018.pdf US Coast Guard AIS Description: https://www.navcen.uscg.gov/automatic-identification-system-overview

Columns of Interest (grouped by loose categories):

  • Vessel features:
    • mmsi: a unique identifying integer
    • vessel_name: name of the ship, seems to include the operator name
    • vessel_type: A code that defines vessel type, cargo type, etc. See link above.
    • length; width: in meters
  • Travel features:
    • sog: speed in knots
    • cog: course in degrees (course is the actual direction of the vessel, taking into account wind and current)
    • base_date_time: UTC date time
    • heading: true heading in degrees (the direction of the bow w.r.t true north)
    • latitude; longitude: in decimal degree

2.2.2 Distributions of Select Variables

For this initial exploration, I want to trim the data down to a specific time frame to help with analysis speed. I picked a Wednesday in June to hopefully get a typical working day of data in the harbor.

Code
# Loading the data. Trimming it to Wednesday June 4th for easier analysis at this point.
ais_06_2025 <- read.csv("AIS_2025.06.csv")

ais_06_16_2025_trimmed <- ais_06_2025 %>%
  mutate(base_date_time = ymd_hms(base_date_time)) %>%
  filter(date(base_date_time) == ymd("2025-06-16"))

write.csv(ais_06_16_2025_trimmed, "AIS_2025_06_16_filtered.csv", row.names = FALSE)
Code
ais_06_16_2025_trimmed <- read.csv("../exploratory_code/ais-eda/AIS_2025_06_16_filtered.csv")

Frequency of Vessel Types Below is a visual of how many pings certain vessel types are sending.

Code
ggplot(ais_06_16_2025_trimmed, aes(x = as.factor(vessel_type))) +
  geom_bar(fill = "#1A9988") +
  labs(title = "Frequency of Vessel Types",
       subtitle = "June 16, 2025 - NY Harbor",
       x = "Vessel Type Code",
       y = "Number of Pings") +
  theme_minimal()

Most Common Vessel codes:

  • 31: Tug and Tow - Towing
  • 37: Pleasure craft
  • 52: Tug and Tow - Tugging
  • 60: All ships of passenger type

Vessel Size

Code
ggplot(ais_06_16_2025_trimmed, aes(x = length)) +
  geom_histogram(binwidth = 10, fill = "#1A9988", color = "white") +
  labs(title = "Distribution of Vessel Lengths",
       x = "Length (meters)",
       y = "Count") +
  theme_minimal()

Traffic Volume

Code
# Note that I ignore docking/docked ships. Noticed a lot of zeros
ais_06_16_2025_trimmed %>%
  filter(sog > 0.5) %>% 
  ggplot(aes(x = sog)) +
  geom_histogram(binwidth = 1, fill = "#1A9988", color = "white") +
  labs(title = "Speed Distribution of Moving Vessels",
       subtitle = "Excludes stationary vessels (SOG > 0.5 knots)",
       x = "Speed Over Ground (knots)",
       y = "Number of Pings") +
  theme_minimal()

2.2.3 Hexbin Visualizations

We can aggregate traffic to hexbin cells to understand traffic patterns. Instead of counting pings (raw transmissions) per hexbin, we count unique vessels (distinct MMSIs). This matters because a slow-moving tanker anchored for hours generates far more pings than a fast tug crossing the harbor — ping count reflects dwell time, not traffic volume. Vessel count is a more honest measure of how busy a given area of the harbor actually is.

Data Loading Same June 16–22 subset as before.

Code
ais <- read.csv("../exploratory_code/ais-eda/AIS_2025_06_16_filtered.csv") %>%
  mutate(
    base_date_time = ymd_hms(base_date_time),
    date = as.Date(base_date_time),
    hour = hour(base_date_time)
  )

We created a geojson containing a stable hexbin grid to be used for our project.

Code
ais_sf <- ais %>%
  filter(sog > 0.5) %>%
  drop_na(longitude, latitude) %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(32118)

fishnet <- st_read("../exploratory_code/hex_grid_100m.geojson") %>%
  st_transform(32118) %>%
  mutate(grid_id = row_number())
Reading layer `hex_grid_100m' from data source 
  `/Users/sujankakumanu/Documents/musa-8010-nycedc/exploratory_code/hex_grid_100m.geojson' 
  using driver `GeoJSON'
Simple feature collection with 70742 features and 0 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 293123.6 ymin: 43240.14 xmax: 314873.6 ymax: 71501.44
Projected CRS: NAD83 / New York Long Island

Aggregating to Vessel Count

We filter to moving vessels (SOG > 0.5 knots) to exclude anchored or docked ships. Count distinct MMSIs per cell per hour — not raw pings.

Code
traffic <- fishnet %>%
  st_join(ais_sf) %>%
  filter(!is.na(mmsi)) %>%
  group_by(grid_id, date, hour) %>%
  summarise(
    vessel_count = n_distinct(mmsi),
    ping_count = n(),
    .groups = "drop"
  )

Ping Count vs. Vessel Count This is a quick sanity check to make sure we’re on the right path. Most hexbin cells have more pings than unique vessels, which is exactly what we expect. Vessels passing through a cell generate multiple pings.

Code
traffic %>%
  st_drop_geometry() %>%
  group_by(grid_id) %>%
  summarise(
    total_pings = sum(ping_count),
    total_vessels = sum(vessel_count),
    .groups = "drop"
  ) %>%
  filter(total_pings > 0) %>%
  ggplot(aes(x = total_pings, y = total_vessels)) +
  geom_point(alpha = 0.3, color = "#1A9988") +
  geom_abline(linetype = "dashed", color = "#D95F59") +
  labs(
    title = "Ping Count vs. Vessel Count per Hexbin Cell",
    x = "Total Pings",
    y = "Total Unique Vessels"
  ) +
  theme_minimal()

2.2.3.1 Harbor-Wide Vessel Density

Aggregated across the full day to show overall traffic patterns in the harbor. Note the use of a log scale — vessel counts vary dramatically across cells, and a log transformation allows us to see variation in both busy and quiet areas. Log10 is used for better interpretability.

Code
traffic_day <- traffic %>%
  st_drop_geometry() %>%
  group_by(grid_id) %>%
  summarise(vessel_count = sum(vessel_count), .groups = "drop")

fishnet %>%
  left_join(traffic_day, by = "grid_id") %>%
  ggplot() +
  geom_sf(aes(fill = vessel_count), color = NA) +
  scale_fill_viridis_c(option = "inferno", trans = "log10", na.value = "grey90") +
  labs(
    title = "Harbor-Wide Vessel Density (Unique Vessels)",
    subtitle = "NY Harbor — June 16, 2025",
    fill = "Vessel Count (Log10)"
  ) +
  theme_minimal() +
  theme(axis.text = element_blank(), panel.grid = element_blank())

Here, we can see some clear patterns that are validated by our domain knowledge. The East River shows a lot of traffic in a tight corridor. Same with the Hudson River, flowing into the area north of the Staten Island.

2.2.3.2 Sample Stop-Level Traffic Exploration

Just to explore, I take 3 major ferry stops (hunters point south isn’t “major”, but just an interesting one): Question: how does vessel traffic in the surrounding area vary across the day? NOTE: Stops are hard coded here. This is only for exploration purposes. NOTE 2: We’re using 500m buffers to understand patterns further away from stops. The fishnet is still 200m.

Code
ferry_stops <- tibble(
  stop_name = c("Wall St / Pier 11", "East 34th St", "Hunters Point South"),
  longitude = c(-74.00625, -73.97095, -73.96137),
  latitude  = c(40.70322,  40.74409,  40.74185)
) %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(32118)

stop_buffers <- ferry_stops %>%
  mutate(geometry = st_buffer(geometry, dist = 500))

stop_cells <- fishnet %>%
  st_join(stop_buffers) %>%
  filter(!is.na(stop_name)) %>%
  st_drop_geometry() %>%
  select(grid_id, stop_name)

stop_traffic <- traffic %>%
  st_drop_geometry() %>%
  inner_join(stop_cells, by = "grid_id") %>%
  group_by(stop_name, date, hour) %>%
  summarise(vessel_count = sum(vessel_count), .groups = "drop") %>%
  mutate(
    date = as.Date(date),
    datetime = ymd_h(paste(date, hour))
  )

ggplot(stop_traffic, aes(x = datetime, y = vessel_count, color = stop_name)) +
  geom_line() +
  geom_point(size = 1.5) +
  labs(
    title = "Vessel Traffic Near Ferry Stops",
    subtitle = "500m buffer — June 16, 2025",
    x = NULL,
    y = "Unique Vessels per Hour",
    color = "Stop"
  ) +
  scale_color_manual(values = c(
    "Wall St / Pier 11"    = "#1A9988",
    "East 34th St"         = "#D95F59",
    "Hunters Point South"  = "#7BA7C7"
  )) +
  theme_minimal() +
  theme(legend.position = "bottom")

2.2.3.3 In-Water Traffic Along Route Corridors

Beyond stop-level traffic, we want to look at traffic ferries encounter as they transit between stops. This is particularly valuable for understanding in-water delays.

The route geometry below is a straight-line placeholder between East 34th St and Hunters Point South. NOTE: This will be replaced with real SWIFTLY route geometry once that data is wrangled.

Code
# Placeholder: straight line East 34th St -> Hunters Point South
# To be replaced with SWIFTLY route geometry
route_placeholder <- st_sfc(
  st_linestring(
    matrix(c(-73.97095, 40.74409,
             -73.96137, 40.74185),
           ncol = 2, byrow = TRUE)
  ),
  crs = 4326
) %>%
  st_transform(32118) %>%
  st_sf() %>%
  mutate(route_id = "east_34th_to_hunters_point")

# Buffer 250m along the route corridor
route_buffer <- route_placeholder %>%
  st_buffer(dist = 250)

# Find fishnet cells intersecting the corridor
route_cells <- fishnet %>%
  st_join(route_buffer) %>%
  filter(!is.na(route_id)) %>%
  st_drop_geometry() %>%
  select(grid_id, route_id)

# Vessel count along corridor by hour
route_traffic <- traffic %>%
  st_drop_geometry() %>%
  inner_join(route_cells, by = "grid_id") %>%
  group_by(route_id, date, hour) %>%
  summarise(vessel_count = sum(vessel_count), .groups = "drop") %>%
  mutate(datetime = ymd_h(paste(date, hour)))

ggplot(route_traffic, aes(x = datetime, y = vessel_count)) +
  geom_line(color = "#1A9988") +
  geom_point(size = 1.5, color = "#1A9988") +
  labs(
    title = "Vessel Traffic Along Route Segment",
    subtitle = "Segment: East 34th St to Hunters Point South — 250m buffer, June 16, 2025",
    x = NULL,
    y = "Unique Vessels per Hour"
  ) +
  theme_minimal()

2.2.3.4 Sample Modeling Row

To make the feature concrete, the tables below show what joined observations will look like in the final modeling dataset. The first table shows the stop-level feature, the second shows the route corridor feature. In practice these will be joined to the stop-level delay data by stop, route, date, and hour.

Code
stop_traffic %>%
  filter(hour %in% c(8, 12, 17)) %>%
  select(stop_name, date, hour, vessel_count) %>%
  arrange(stop_name, hour) %>%
  knitr::kable(caption = "Sample: AIS Traffic Feature at Stop Level")
Sample: AIS Traffic Feature at Stop Level
stop_name date hour vessel_count
East 34th St 2025-06-16 8 8
East 34th St 2025-06-16 12 70
East 34th St 2025-06-16 17 48
Hunters Point South 2025-06-16 8 7
Hunters Point South 2025-06-16 12 27
Hunters Point South 2025-06-16 17 15
Wall St / Pier 11 2025-06-16 8 1
Wall St / Pier 11 2025-06-16 12 104
Wall St / Pier 11 2025-06-16 17 63
Code
route_traffic %>%
  filter(hour %in% c(8, 12, 17)) %>%
  select(route_id, date, hour, vessel_count) %>%
  arrange(hour) %>%
  knitr::kable(caption = "Sample: AIS Traffic Feature Along Route Corridor")
Sample: AIS Traffic Feature Along Route Corridor
route_id date hour vessel_count
east_34th_to_hunters_point 2025-06-16 8 8
east_34th_to_hunters_point 2025-06-16 12 73
east_34th_to_hunters_point 2025-06-16 17 46
Code
# 500m context buffer
crop_extent <- route_placeholder %>%
  st_buffer(500) %>%
  st_bbox()

corridor_traffic <- traffic %>%
  st_drop_geometry() %>%
  inner_join(route_cells, by = "grid_id") %>%
  filter(hour %in% c(8, 12, 17)) %>%
  group_by(grid_id, hour) %>%
  summarise(vessel_count = sum(vessel_count), .groups = "drop") %>%
  left_join(fishnet, by = "grid_id") %>%
  st_as_sf()

ggplot() +
  geom_sf(data = fishnet %>% st_crop(crop_extent), fill = "grey95", color = NA) +
  geom_sf(data = corridor_traffic, aes(fill = vessel_count), color = NA) +
  geom_sf(data = route_placeholder, color = "white", linewidth = 0.8) +
  scale_fill_viridis_c(option = "inferno", na.value = "grey80") +
  coord_sf(xlim = c(crop_extent["xmin"], crop_extent["xmax"]),
           ylim = c(crop_extent["ymin"], crop_extent["ymax"])) +
  facet_wrap(~hour, nrow = 1, labeller = as_labeller(c(
    "8"  = "08:00 AM",
    "12" = "12:00 PM",
    "17" = "05:00 PM"
  ))) +
  labs(
    title = "Vessel Traffic Along Route Corridor by Hour",
    subtitle = "East 34th St to Hunters Point South — 250m buffer, June 16, 2025",
    fill = "Vessel Count"
  ) +
  theme_minimal() +
  theme(axis.text = element_blank(), panel.grid = element_blank())

2.3 Swiftly

By: Tyler Maynard

2.3.1 Background

The Swiftly GPS dataset contains location data every 10-60 seconds from the NYC Ferry AVL system. Columns include vessel, route, trip, block, and coordinates. This will be especially useful in calculating the speed of the vehicle at various points in time, helping identify locations, routes, and trips where the system struggles with on-time performance.

Reference Week: June 16-22, 2025

For the sake of processing power, we selected a “sample week” to conduct EDA for Swiftly GPS and AIS data. Based on conversations with NYCEDC, ridership is much higher in the summer months, especially on weekends. Thus, we chose a week during the summer, also including a holiday to evaluate any unique characteristics (if any) during special service days.

Code
special_service <- data.frame(
  dates = as.Date(c("2025-06-19")))

2.3.2 Data Import

The Swiftly GPS data set is available by month in CSV files. The following chunk reads the June 2025 CSV file, filters to the dates included in the sample week, and parses out each record’s date and time into separate columns.

Code
# Original csv
jun2025_original <- read.csv("../exploratory_code/swiftly_eda/swiftly_gps.2025_06.csv")

# Removing extra 'data.' from column names, removing unneeded columns
jun2025 <- jun2025_original %>%
  rename_with(~ gsub("data.", "", .x)) %>%
  select(-(1:3), -(7))

# Converting to datetime
jun2025$time <- ymd_hms(jun2025$time)
jun2025$timeProcessed <- ymd_hms(jun2025$timeProcessed)

# Seeing how many unique values each column has
jun2025 %>%
  summarise(across(everything(), n_distinct)) %>%
  pivot_longer(cols = everything()) %>%
  kable()
name value
time 1510826
lat 63986
lon 31883
ispredictable 2
isLayover 2
isWaitStop 2
timeProcessed 2740360
vehicleId 33
headsign 16
blockassignmentorigin 3
assignmentId 2166
assignmentTime 2834
blockAssignmentMethod 8
blockId 28
directionId 3
headwayMsec 1779585
previousVehicleId 34
previousVehicleSchedAdhMsec 611794
routeId 8
routeShortName 8
schedAdh 3212
schedAdhMsec 758901
scheduledHeadwayMsec 193
serviceId 7
serviceName 7
tripId 2169
tripShortName 395
unpredictablestatus.type 2
unpredictablestatus.details.blockId 27
unpredictablestatus.details.serviceDate 31
Code
# Parse dates and times
jun2025 <- jun2025 %>%
  mutate(
    date_time = ymd_hms(time, tz = "America/New_York"),
    processed_date_time = ymd_hms(timeProcessed, tz = "America/New_York"),

    date = as.Date(date_time),
    time = format(date_time, "%H:%M:%S"),
    hour = substr(time,1,2),

    dateProcessed = as.Date(processed_date_time),
    timeProcessed = format(processed_date_time, "%H:%M:%S"),

    dayOfWeek = if_else(wday(date, week_start = 1) %in% c(6,7), "Weekend", "Weekday"),
    dayOfWeek = case_when(
      date %in% special_service$dates ~ "Holiday",
      TRUE ~ dayOfWeek)) %>%
  select(date, time, dateProcessed, timeProcessed, dayOfWeek, headsign, tripShortName, everything(), date_time, processed_date_time)

jun2025_reference_week <- jun2025 %>%
  filter(date %in% c("2025-06-16","2025-06-17","2025-06-18","2025-06-19","2025-06-20","2025-06-21","2025-06-22")) %>%
  filter(!is.na(tripId)) %>%
  arrange(vehicleId, date_time)
str(jun2025_reference_week$time)
 chr [1:851238] "06:21:24" "06:21:29" "06:21:44" "06:21:59" "06:22:14" ...

2.3.3 Shapefiles

As a way to add more spatial context to this analysis, the landings, routes, hydrology, and boroughs were extracted from the NYC Ferry and NYC OpenData websites. The hydrology object was trimmed to only include hydrology contained within the bounding box created around the study area.

Code
landings <- read_sf("../exploratory_code/swiftly_eda/Shapefiles/Landings/landings.shp") %>%
  st_transform(32118)

landings_east_river <- landings %>%
  filter(!stop_id %in% c("37","39","43","45","46","48","49","50",
    "51","52","55","56","57","58","61","62",
    "16","88","103","104","105","142",
    "145","149","150","151")) %>%
  st_transform(32118)

routes <- read_sf("../exploratory_code/swiftly_eda/Shapefiles/Routes/routes.shp") %>%
  group_by(route_id) %>%
  summarise() %>%
  st_transform(32118)

boroughs <- read_sf("../exploratory_code/swiftly_eda/Shapefiles/Boroughs/boroughs.shp") %>%
  st_transform(32118)

bbox_east_river <- st_bbox(landings_east_river)
bbox_expanded_east_river <- bbox_east_river
bbox_expanded_east_river["xmin"] <- bbox_east_river["xmin"] - 15000
bbox_expanded_east_river["ymin"] <- bbox_east_river["ymin"] - 15000
bbox_expanded_east_river["xmax"] <- bbox_east_river["xmax"] + 15000
bbox_expanded_east_river["ymax"] <- bbox_east_river["ymax"] + 15000
bbox_poly_east_river <- st_as_sfc(bbox_expanded_east_river)

water <- read_sf("../exploratory_code/swiftly_eda/Shapefiles/Hydrology/WBDHU2.shp") %>%
  st_transform(32118) %>%
  summarise(geometry = st_union(geometry)) %>%
  st_intersection(st_transform(st_as_sfc(st_bbox(bbox_expanded_east_river)), 32118))
Code
bbox <- st_bbox(c(
  xmin = -74.30603027343751,
  ymin = 40.492915026895815,
  xmax = -73.64959716796876,
  ymax = 40.919739051062216
), crs = 4326)

bbox_area <- st_as_sfc(bbox) %>%
  st_transform(32118)

bbox_service_area <- boroughs %>%
  st_transform(32118) %>%
  st_bbox()

2.3.4 Data Cleaning

The following chunk creates sf objects for each data point and computes new fields for feature engineering. Features created include:

  • Knots: based on the time and distance of the previous node of the same trip.
  • Key identifiers: for different analyses, generally including date, trip, and route.
  • OTP: classifies on-time performance at each node based on the schedule adherence column (schedAdhMsec).
  • Excessively Late: a binary field indicating whether a record is more than 10 minutes past the expected time.
  • First Record Late: indicates if the trip starts five minutes past its scheduled time.
  • OTP Transition: identifies trips where the OTP changes from one status to another (e.g., on-time to late).

Of the 851,238 data points within this sample week, 25,558 were identified as excessively late — approximately three percent of all records.

Code
pts <- st_as_sf(jun2025_reference_week, coords = c("lon", "lat"), crs = 4326, remove = FALSE) %>%
  st_transform(32118)

xy <- st_coordinates(pts)

stopifnot(nrow(jun2025_reference_week) == nrow(xy))

jun2025_reference_week <- jun2025_reference_week %>%
  mutate(x = xy[, 1], y = xy[, 2]) %>%
  group_by(date, blockId, routeId, tripId, directionId, dayOfWeek) %>%
  arrange(date_time, .by_group = TRUE) %>%
  mutate(
    directionId = case_when(
      directionId == 1 ~ "NB",
      TRUE ~ "SB"
    ),
    isFirst = row_number() == 1,
    isSecond = row_number() == 2,
    dist_to_prev_m = round(sqrt((x - lag(x))^2 + (y - lag(y))^2), 2),
    time_diff_sec  = as.numeric(difftime(date_time, lag(date_time), units = "secs")),
    m_per_sec      = if_else(is.na(time_diff_sec) | time_diff_sec <= 0,
                             0,
                             round(dist_to_prev_m / time_diff_sec, 2)),
    knots = m_per_sec * 1.94384,
    date_route_trip_direction_vehicle_dow = paste(date, routeId, tripId, directionId, vehicleId, dayOfWeek, sep = " "),
    key_identifier = paste(date, blockId, routeId, tripId, directionId, dayOfWeek, sep = " "),
    dailyTrip = paste(dayOfWeek, routeId, tripId, sep = " "),
    countDailyTrip = n(),
    OTP = case_when(
      schedAdhMsec < -60000 ~ "Early",
      schedAdhMsec > 600000 ~ "Excessively Late",
      schedAdhMsec > 300000 ~ "Late",
      TRUE ~ "Ontime"
    ),
    excessivelyLate = case_when(
      OTP == "Excessively Late" ~ 1,
      TRUE ~ 0
    ),
    timeOfDay = as.integer(hour(date_time)),
    firstRecordLate = if_else(isSecond == 1 & (OTP == "Late" | OTP == "Excessively Late"), "Yes", "No"),
    OTP_prev = lag(OTP),
    OTP_changed = coalesce(OTP != OTP_prev, FALSE),
    OTP_transition = case_when(
      isFirst ~ "Start",
      is.na(OTP_prev) ~ "Start",
      OTP == OTP_prev ~ "No Change",
      TRUE ~ paste0(OTP_prev, " -> ", OTP)
    )
  ) %>%
  ungroup() %>%
  select(-x, -y, -m_per_sec) %>%
  select(dailyTrip, key_identifier, date_route_trip_direction_vehicle_dow, date, time, dateProcessed, timeProcessed, OTP, OTP_transition, excessivelyLate, timeOfDay, schedAdh, dayOfWeek,
         dist_to_prev_m, time_diff_sec, knots, headsign, tripShortName,
         everything(), date_time, processed_date_time)

summary_table <- data.frame(
  Statistic = names(summary(jun2025_reference_week$knots)),
  Value     = as.numeric(summary(jun2025_reference_week$knots)),
  row.names = NULL
)

jun2025_reference_week_sf <- st_as_sf(jun2025_reference_week, coords = c("lon", "lat"), crs = 4326, remove = FALSE) %>%
  st_transform(32118)

kable(summary_table, digits = 2, caption = "Summary of Knots") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  ) %>%
  column_spec(1:2, width = "150px")
Summary of Knots
Statistic Value
Min. 0.00
1st Qu. 3.11
Median 10.77
Mean 11.92
3rd Qu. 20.57
Max. 4308.46
Code
jun2025_reference_week %>%
  count(excessivelyLate, sort = TRUE)
# A tibble: 2 × 2
  excessivelyLate      n
            <dbl>  <int>
1               0 825681
2               1  25557
Code
jun2025_reference_week_excessively_lates <- jun2025_reference_week %>%
  filter(OTP == "Excessively Late")

jun2025_reference_week_excessively_lates_sf <- st_as_sf(jun2025_reference_week_excessively_lates, coords = c("lon", "lat"), crs = 4326, remove = FALSE) %>%
  st_transform(32118)

hull <- concaveman(jun2025_reference_week_excessively_lates_sf)
bbox_hull <- st_bbox(hull)

ggplot() +
  geom_sf(data = boroughs, color = "white") +
  geom_sf(data = hull, color = "#D95F59", fill = NA, size = 3) +
  coord_sf(
    xlim = c(bbox_east_river["xmin"] - 10000, bbox_east_river["xmax"] + 10000),
    ylim = c(bbox_east_river["ymin"] - 10000, bbox_east_river["ymax"] + 10000),
    expand = FALSE
  ) +
  labs(title = "Excessively Late 'Zone' - Concave Polygon") +
  theme_solid() +
  theme(
    plot.title = element_text(size = 20),
    plot.title.position = "plot"
  )

Code
jun2025_reference_week %>%
  count(routeId)
# A tibble: 7 × 2
  routeId      n
  <chr>    <int>
1 AS      121703
2 ER      189153
3 GI        9921
4 RW      152398
5 SB      129553
6 SG      118302
7 SV      130208

2.3.5 OTP by Direction

This chunk computes on-time performance by direction (NB or SB), filtering to excessively late records. Northbound and southbound trips comprise roughly the same share of all trips; however, the higher number of excessively late southbound data points could be indicative of other issues.

Code
jun2025_reference_week_dir <- jun2025_reference_week %>%
  select(dailyTrip, directionId, OTP, routeId) %>%
  filter(OTP == "Excessively Late")

jun2025_reference_week_dir %>%
  count(directionId)
# A tibble: 2 × 2
  directionId     n
  <chr>       <int>
1 NB          12069
2 SB          13488
Code
jun2025_reference_week_dir %>%
  count(directionId, routeId)
# A tibble: 12 × 3
   directionId routeId     n
   <chr>       <chr>   <int>
 1 NB          AS         51
 2 NB          ER       2182
 3 NB          GI         58
 4 NB          RW       8300
 5 NB          SB        671
 6 NB          SG          5
 7 NB          SV        802
 8 SB          AS        472
 9 SB          ER       4569
10 SB          RW       6040
11 SB          SB        696
12 SB          SV       1711

2.3.6 Bar Charts

The bar charts below visualize key engineered features. The first chart shows the distribution of excessively late trips by the time spent excessively late (as a percentage of total data points per trip).

Three key takeaways from the second plot:

  1. Even though weekdays account for the majority of data points, they account for under half of excessively late data points. Weekend service is the only category where excessively late counts surpass non-excessively late counts.
  2. Nearly 80 percent of excessively late data points are attributed to the East River and Rockaway routes.
  3. The PM peak period (approximately 1pm to 10pm) accounts for the overwhelming majority of problematic data points.
Code
excessivelyLatePlotVariables <-
  jun2025_reference_week %>%
  as.data.frame() %>%
  select(excessivelyLate, dayOfWeek, routeId, timeOfDay, blockId, directionId, firstRecordLate) %>%
  gather(variable, value, -excessivelyLate) %>%
  filter(!is.na(value))

jun2025_reference_week_wide <- jun2025_reference_week %>%
  count(key_identifier, OTP) %>%
  pivot_wider(
    names_from = OTP,
    values_from = n,
    values_fill = 0
  ) %>%
  mutate(Total = rowSums(across(-key_identifier)))

jun2025_reference_week_wide
# A tibble: 2,133 × 6
   key_identifier                   Early Ontime  Late `Excessively Late` Total
   <chr>                            <int>  <int> <int>              <int> <dbl>
 1 2025-06-16 11 ER 1220 NB Weekday    72    129     0                  0   201
 2 2025-06-16 11 ER 1224 NB Weekday   191    189     0                  0   380
 3 2025-06-16 11 ER 1257 SB Weekday   260    210    28                  0   498
 4 2025-06-16 11 ER 4104 NB Weekday    86    309     0                  0   395
 5 2025-06-16 11 ER 4109 NB Weekday    24    223     0                  0   247
 6 2025-06-16 11 ER 4113 NB Weekday    52    330     0                  0   382
 7 2025-06-16 11 ER 4116 NB Weekday    19    265    25                  0   309
 8 2025-06-16 11 ER 4119 NB Weekday    85    217     0                  0   302
 9 2025-06-16 11 ER 4123 NB Weekday     0    116   166                 24   306
10 2025-06-16 11 ER 4128 NB Weekday     0     32    30                293   355
# ℹ 2,123 more rows
Code
jun2025_reference_week_percents <- jun2025_reference_week_wide %>%
  mutate(round(across(-c(key_identifier, Total), ~ . / Total * 100), 2)) %>%
  pivot_longer(
    cols = -c(key_identifier, Total),
    names_to = "OTP",
    values_to = "Percent"
  ) %>%
  filter(OTP == "Excessively Late", Percent > 0) %>%
  mutate(Percent_Rounded = round(Percent, 0))

mean_val <- mean(jun2025_reference_week_percents$Percent_Rounded)
median_val <- median(jun2025_reference_week_percents$Percent_Rounded)

ggplot() +
  geom_histogram(data = jun2025_reference_week_percents, color = "white", fill = "#1A9988", linewidth = 0.3, aes(x = Percent_Rounded), binwidth = 10) +
  geom_vline(xintercept = mean_val, color = "#D95F59", linetype = "dashed", linewidth = 1) +
  geom_vline(xintercept = median_val, color = "#F2C14E", linetype = "dashed", linewidth = 1) +
  annotate("text", x = mean_val, y = 15, label = "Mean", color = "#D95F59", angle = 90, vjust = -0.5) +
  annotate("text", x = median_val, y = 15, label = "Median", color = "#F2C14E", angle = 90, vjust = -0.5) +
  labs(
    title = "Distribution of Excessively Late Trips by Time Spent Excessively Late",
    x = "Percent",
    y = "Count"
  ) +
  theme_excel_new() +
  theme(
    axis.title = element_text(size = 14),
    plot.title = element_text(size = 20),
    plot.subtitle = element_text(size = 16),
    plot.title.position = "plot"
  )

Code
plot_df <- excessivelyLatePlotVariables %>%
  group_by(excessivelyLate, variable, value) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(excessivelyLate, variable) %>%
  mutate(prop = count / sum(count)) %>%
  ungroup()

ggplot(plot_df, aes(x = ifelse(variable == "timeOfDay", as.numeric(value), value),
                    y = prop, fill = as.factor(excessivelyLate))) +
  geom_col(position = "dodge") +
  facet_wrap(~ variable, scales = "free_x") +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = c("0" = "#1A9988", "1" = "#D95F59"),
                    labels = c("Not Excessively Late", "Excessively Late"),
                    name = "") +
  labs(x = NULL, y = "Percent within group") +
  theme_excel_new()

Code
jun2025_reference_week <- jun2025_reference_week %>%
  mutate(day_route = paste(dayOfWeek, routeId, sep = " "))

speed_by_route <- jun2025_reference_week %>%
  group_by(day_route, dayOfWeek, routeId) %>%
  summarise(
    avgKnots = mean(knots, na.rm = TRUE),
    medKnots = median(knots, na.rm = TRUE),
    n = dplyr::n(),
    .groups = "drop"
  )

ggplot() +
  geom_col(data = speed_by_route, aes(x = day_route, y = avgKnots, fill = dayOfWeek)) +
  labs(
    title = "Average Speed by Route and Service Level",
    x = "Service Level",
    y = "Average Knots"
  ) +
  scale_fill_manual(
    values = c(
      "Holiday" = "#F2C14E",
      "Weekday" = "#1A9988",
      "Weekend" = "#D95F59"
    )
  ) +
  theme_excel_new() +
  theme(
    plot.title = element_text(size = 20),
    plot.subtitle = element_text(size = 16),
    plot.title.position = "plot",
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

2.3.7 Landings Map

Code
ggplot() +
  geom_sf(data = boroughs, color = "white") +
  geom_sf(data = landings, color = "black", size = 3) +
  labs(title = "NYC Ferry Landings") +
  theme_solid() +
  theme(
    plot.title = element_text(size = 16),
    plot.title.position = "plot"
  )

2.3.8 Fishnet and Kernel Density

Since these datasets contain thousands of points per day, aggregation is the most efficient way to process and observe the data. A fishnet was created using a grid cell side of 250 meters. The figure represents the data for the sample week, excluding any grid cell with a density of 0.

Code
fishnet_swiftly <-
  st_make_grid(water, cellsize = 250) %>%
  st_sf() %>%
  mutate(uniqueID = rownames(.))

jun2025_net <- st_join(
  fishnet_swiftly, jun2025_reference_week_excessively_lates_sf, join = st_intersects) %>%
  group_by(uniqueID) %>%
  summarise(mean_speed = mean(knots, na.rm = TRUE),
            n = n(), .groups = "drop") %>%
  filter(!is.na(mean_speed))

jun2025_ppp <- as.ppp(st_coordinates(jun2025_reference_week_excessively_lates_sf), W = st_bbox(jun2025_net))
jun2025_KD <- density.ppp(jun2025_ppp, sigma = 1000)

as.data.frame(jun2025_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(jun2025_net)) %>%
  aggregate(., jun2025_net, mean) %>%
  ggplot() +
    geom_sf(aes(fill = value), color = NA, linewidth = 0) +
    geom_sf(data = landings, color = "black", size = 1, alpha = 0.5) +
    scale_fill_viridis(name = "Density") +
    labs(title = "Kernel Density of Excessively Late Data Points, 6/16/25 to 6/22/25") +
    theme_solid() +
    theme(
      plot.title = element_text(size = 16),
      plot.title.position = "plot"
    )

2.3.9 Mean Speed

The following figure shows the average speed of each data point in knots per grid cell, regardless of on-time performance designation.

Code
ggplot(jun2025_net) +
  geom_sf(data = boroughs, color = "white") +
  geom_sf(data = landings, color = "black", size = 2, alpha = 0.5) +
  geom_sf(aes(fill = mean_speed), color = NA) +
  coord_sf(
    xlim = c(bbox_hull["xmin"] - 100, bbox_hull["xmax"] + 100),
    ylim = c(bbox_hull["ymin"] - 200, bbox_hull["ymax"] + 200),
    expand = FALSE
  ) +
  scale_fill_viridis_c(option = "D", name = "Knots") +
  labs(title = "Mean Vessel Speed by Cell") +
  theme_solid() +
  theme(
    plot.title = element_text(size = 16),
    plot.title.position = "plot"
  )

2.3.10 Kernel Density Plots — Hourly

To analyze how OTP changes spatially throughout the PM peak, kernel densities of excessively late data points were mapped for each hour between 1pm and 10pm.

Code
make_hourly_kd <- function(hour_val) {
  net <- st_join(fishnet_swiftly, jun2025_reference_week_excessively_lates_sf, join = st_intersects) %>%
    group_by(uniqueID) %>%
    filter(hour == hour_val) %>%
    summarise(mean_speed = mean(knots, na.rm = TRUE), n = n(), .groups = "drop")
  kd <- density.ppp(as.ppp(st_coordinates(net), W = st_bbox(jun2025_net)), sigma = 1000)
  kd
}

kd_list <- list(
  '1PM'  = make_hourly_kd("13"),
  '2PM'  = make_hourly_kd("14"),
  '3PM'  = make_hourly_kd("15"),
  '4PM'  = make_hourly_kd("16"),
  '5PM'  = make_hourly_kd("17"),
  '6PM'  = make_hourly_kd("18"),
  '7PM'  = make_hourly_kd("19"),
  '8PM'  = make_hourly_kd("20"),
  '9PM'  = make_hourly_kd("21"),
  '10PM' = make_hourly_kd("22")
)

all_values <- unlist(lapply(kd_list, \(kd) as.data.frame(kd)$value))
global_limits <- range(all_values, na.rm = TRUE)

plots <- imap(kd_list, ~{
  sf_pts <- st_as_sf(as.data.frame(.x), coords = c("x", "y"), crs = st_crs(jun2025_net))
  agg <- aggregate(sf_pts["value"], jun2025_net, mean, na.rm = TRUE)

  ggplot() +
    geom_sf(data = boroughs, color = "white") +
    geom_sf(data = agg, aes(fill = value), color = NA, linewidth = 0) +
    coord_sf(
      xlim = c(bbox_hull["xmin"] - 100, bbox_hull["xmax"] + 100),
      ylim = c(bbox_hull["ymin"] - 100, bbox_hull["ymax"] + 100),
      expand = FALSE
    ) +
    scale_fill_viridis_c(name = "Kernel Density", limits = global_limits, oob = scales::squish) +
    labs(title = paste("KD -", .y)) +
    theme_solid() +
    theme(
      plot.title = element_text(size = 12),
      plot.title.position = "plot"
    )
})

wrap_plots(plots, ncol = 5) +
  plot_layout(guides = "collect") +
  plot_annotation(
    title = "Excessively Late Across TOD",
    caption = "Unified KD Scale, 250m Grid Cell Size"
  ) &
  theme(
    legend.position = "right",
    legend.box = "horizontal",
    plot.margin = margin(1, 1, 1, 1)
  )

One potentially useful consideration is to observe locations where trips change OTP status. The following figure includes only instances where OTP status changes to either late or excessively late.

Code
jun2025_reference_week_late_sf <- jun2025_reference_week_sf %>%
  filter(OTP_transition %in% c("Early -> Late", "Late -> Excessively Late",
                                "Ontime -> Late", "Ontime -> Excessively Late"))

ggplot() +
  geom_sf(data = boroughs, color = "white") +
  geom_sf(data = landings, color = "black", size = 2, alpha = 0.2) +
  geom_sf(data = jun2025_reference_week_late_sf,
          aes(color = OTP_transition), size = 2, alpha = 0.2) +
  coord_sf(
    xlim = c(bbox_east_river["xmin"] - 10000, bbox_east_river["xmax"] + 10000),
    ylim = c(bbox_east_river["ymin"] - 10000, bbox_east_river["ymax"] + 10000),
    expand = FALSE
  ) +
  labs(title = "OTP Transition Locations") +
  scale_color_manual(
    name = "Transition Type",
    values = c(
      "Early -> Late"               = "#7BA7C7",
      "Ontime -> Late"              = "#1A9988",
      "Ontime -> Excessively Late"  = "#F2C14E",
      "Late -> Excessively Late"    = "#D95F59"
    ),
    labels = c(
      "Early -> Late"               = "Early to Late",
      "Ontime -> Late"              = "Ontime to Late",
      "Ontime -> Excessively Late"  = "Ontime to Excessively Late",
      "Late -> Excessively Late"    = "Late to Excessively Late"
    )
  ) +
  facet_wrap(~OTP_transition) +
  theme_solid() +
  theme(
    plot.title = element_text(size = 20),
    plot.title.position = "plot"
  )

2.3.11 Leaflet Distribution

Code
landings_buffers_100ft <- st_buffer(landings, dist = 30.48)
counts <- lengths(st_intersects(landings_buffers_100ft, jun2025_reference_week_excessively_lates_sf))
landings_buffers_100ft$record_count <- counts

points_plot  <- st_transform(landings, 4326)
buffers_plot <- st_transform(landings_buffers_100ft, 4326)
routes_plot  <- st_transform(routes, 4326)
boroughs_plot <- st_transform(boroughs, 4326)

jun2025_reference_week_elate_sf <- jun2025_reference_week_sf %>%
  filter(OTP == "Excessively Late") %>%
  st_transform(4326)

jun2025_reference_week_not_elate_sf <- jun2025_reference_week_sf %>%
  filter(OTP != "Excessively Late") %>%
  st_transform(4326)

leaflet() %>%
  addProviderTiles("Esri.WorldImagery", group = "Imagery",
                   options = tileOptions(opacity = 0.6)) %>%
  addPolygons(data = boroughs_plot, color = "#828282", weight = 1, fillOpacity = 0.75, group = "Boroughs") %>%
  addPolylines(data = routes_plot, color = "#1A9988", weight = 2, opacity = 0.2, group = "Routes") %>%
  addPolygons(data = buffers_plot, color = "#F2C14E", weight = 10, fillOpacity = 0.75, group = "Landings",
              label = ~paste0("Records: ", record_count)) %>%
  addCircleMarkers(data = points_plot, radius = 3, color = "#F2C14E", opacity = 0.6, group = "Landings") %>%
  addCircleMarkers(data = jun2025_reference_week_elate_sf, radius = 0.5, color = "#D95F59",
    opacity = 0.6, fillOpacity = 0.4, group = "Excessively Late",
    label = ~paste0("Speed: ", round(knots, 2)),
    popup = ~paste0("<b>Speed:</b> ", round(knots, 2), " knots",
                    "<br><b>Route:</b> ", routeId,
                    "<br><b>Trip:</b> ", tripId,
                    "<br><b>Vessel:</b> ", vehicleId,
                    "<br><b>Time:</b> ", time)) %>%
  addCircleMarkers(data = jun2025_reference_week_not_elate_sf, radius = 0.5, color = "#7BA7C7",
    opacity = 0.6, fillOpacity = 0.4, group = "Not Excessively Late",
    label = ~paste0("Speed: ", round(knots, 2)),
    popup = ~paste0("<b>Speed:</b> ", round(knots, 2), " knots",
                    "<br><b>Route:</b> ", routeId,
                    "<br><b>Trip:</b> ", tripId,
                    "<br><b>Vessel:</b> ", vehicleId,
                    "<br><b>Time:</b> ", time)) %>%
  addLayersControl(
    overlayGroups = c("Boroughs", "Routes", "Landings", "Excessively Late", "Not Excessively Late"),
    baseGroups = c("Imagery"),
    options = layersControlOptions(collapsed = FALSE)
  )

This Leaflet map shows the data with an imagery layer for context. For the sake of processing power, the results are not shown by default — remove the eval: false chunk option to render it.

3 Modeling

3.1 Associative Model

By: Yiming ‘Ming’ Cao

3.1.1 Objective and Design

This section presents an associative model for explaining stop-level arrival lateness in the NYC Ferry system. The goal is interpretation rather than advance prediction. Instead of asking only whether a future trip will be delayed, this model asks which historical factors are associated with observed delay patterns.

The unit of analysis is a stop-level observation. The dependent variable is arrival lateness in minutes relative to the original schedule. Predictors are organized into five explanatory domains: structural context, environmental conditions, temporal patterns, operational propagation, and vessel characteristics.

This grouped design allows us to compare how different types of factors contribute to delay variation while keeping the final model interpretable for operational use.

Code
set.seed(7900)

input_parquet <- "../exploratory_code/modeling/associative/baseline_modeling_table_final_direct.parquet"
outcome_cont  <- "lateness_vs_original_schedule_min"

week_window_days <- 3

candidate_predictors <- c(
  "route_id", "direction", "stop_sequence", "prev_stop_sched_hour", "prev_stop_sched_hour_group",
  "schedule_season", "schedule_daytype", "vessel_trip_seq_in_chain", "vessel_capacity",
  "is_terminal_stop", "prev_stop_shared_stop_flag", "curr_stop_shared_stop_flag",
  "planned_turnaround_gap_min", "scheduled_headway_min", "expected_water_min",
  "prev_stop_sequence", "prev_stop_boardings", "prev_stop_alightings", "prev_stop_onboard",
  "prev_stop_load_factor", "env_temperature_c", "env_wind_speed_mps", "env_visibility_m",
  "env_precipitation_mm", "env_battery_water_level_anomaly", "env_narrows_current_residual",
  "env_battery_water_level_obs", "env_narrows_current_obs"
)
prior_delay_predictors <- c("inherited_prev_trip_delay", "carried_within_trip_delay")

3.1.2 Data Scope

The model uses the stop-level baseline modeling table as its input dataset. After loading the data, we confirmed the number of records and the availability of candidate predictors before fitting the associative models.

Code
if (!file.exists(input_parquet)) stop("Missing input parquet: ", input_parquet)
df_raw <- read_parquet(input_parquet) %>% as_tibble()

df <- df_raw %>%
  mutate(
    trip_date = as.Date(substr(as.character(trip_key), 1, 8), format = "%Y%m%d")
  )

predictors_no_prior   <- intersect(candidate_predictors, names(df))
predictors_with_prior <- unique(c(predictors_no_prior, intersect(prior_delay_predictors, names(df))))

tibble(
  input_rows              = nrow(df_raw),
  rows_after_load         = nrow(df),
  predictors_no_prior_n   = length(predictors_no_prior),
  predictors_with_prior_n = length(predictors_with_prior)
) %>% kable(caption = "Data Scope and Predictor Availability")
Data Scope and Predictor Availability
input_rows rows_after_load predictors_no_prior_n predictors_with_prior_n
613071 613071 28 30

3.1.3 Predictor Domains

To support interpretation, predictors in the associative model were organized into five explanatory domains. This grouped structure separates baseline route and stop context from environmental conditions, temporal patterns, operational propagation, and vessel-related characteristics. The goal is to interpret delay association at the domain level rather than focusing only on isolated individual coefficients.

3.1.3.1 Domain A: Structural Context

The structural domain captures baseline differences in how delay varies across the ferry network. These predictors describe route, stop position, and stop-context features that reflect how service is organized spatially and operationally. Variables in this domain include route and direction identifiers, stop sequence, terminal status, and shared-stop context. We interpret this domain as representing the structural baseline of delay rather than short-term changing conditions.

3.1.3.2 Domain B: Environmental Conditions

The environmental domain captures external natural conditions that may affect vessel movement and docking performance. These predictors include weather and water-related variables such as temperature, wind speed, visibility, precipitation, water-level, and current-related measures. This domain is especially important for the client because it helps distinguish external operating conditions from delay that is simply inherited from earlier parts of the trip.

3.1.3.3 Domain C: Temporal Patterns

The temporal domain captures recurring calendar and time-of-day structure in delay. These predictors describe when service occurs rather than where it occurs or what the vessel has already experienced. Variables in this domain include scheduled hour, scheduled hour group, day type, and season. We use this domain to represent regular service rhythms such as weekday versus weekend patterns, seasonal demand differences, and time-of-day effects.

3.1.3.4 Domain D: Operational Propagation

The operational propagation domain captures delay carried forward through the operating chain. These predictors reflect whether lateness has already accumulated earlier in the trip or in a prior linked trip on the same vessel. Variables in this domain include inherited previous-trip delay and carried within-trip delay. This domain is operationally meaningful, but it is interpreted as one component of the associative model rather than the sole basis for explaining delay.

3.1.3.5 Domain E: Vessel Characteristics

The vessel domain captures systematic differences associated with vessel capacity and operating role. These predictors help assess whether some delay patterns are associated with the characteristics of the vessel being used rather than only with route structure or environmental conditions. Variables in this domain include vessel capacity and related vessel-level operational identifiers retained in the final specification.

3.1.4 Data Preparation

Before model estimation, the stop-level modeling table was prepared to support a consistent and interpretable associative model. This section summarizes the key preparation steps: variable construction, missing-value treatment for environmental fields, and helper functions used for model fitting and summary extraction.

3.1.4.1 Data Cleaning and Variable Construction

The input table was converted into a stop-level analytic dataset. We created a trip date field from the trip identifier, checked which candidate predictors were available, and retained variables needed for the outcome and domain-based model structure.

3.1.4.2 Environmental Variable Imputation

Missing environmental values were imputed using a local time-slot rolling rule. The primary fill used observations from the same scheduled hour within a centered seven-day window around the trip date. If no value was available, the method fell back to the same scheduled hour group within the same centered window.

This approach keeps environmental predictors in the model without relying only on complete-case filtering. It also keeps the imputed values tied to nearby dates and comparable operating periods.

Code
build_week_fill_map <- function(day_slot_tbl, group_col, date_col = "trip_date", value_col = "slot_day_median", window_days = 3) {
  if (nrow(day_slot_tbl) == 0) return(tibble())
  split_tbl <- split(day_slot_tbl, day_slot_tbl[[group_col]], drop = TRUE)
  out <- lapply(split_tbl, function(gdf) {
    gdf <- gdf[order(gdf[[date_col]]), , drop = FALSE]
    d <- gdf[[date_col]]
    v <- gdf[[value_col]]
    week_fill <- vapply(seq_along(v), function(i) {
      idx <- abs(as.integer(d - d[i])) <= window_days
      val <- suppressWarnings(median(v[idx], na.rm = TRUE))
      ifelse(is.finite(val), val, NA_real_)
    }, numeric(1))
    gdf$week_fill <- week_fill
    gdf[, c(group_col, date_col, "week_fill"), drop = FALSE]
  })
  bind_rows(out)
}

impute_by_local_slot_week <- function(data, var_name, window_days = 3) {
  x <- data
  miss_before <- sum(is.na(x[[var_name]]))
  if (miss_before == 0) {
    return(list(data = x, summary = tibble(variable = var_name, n_missing_before = 0L, n_filled_primary = 0L, n_filled_fallback = 0L, n_missing_after = 0L)))
  }

  primary_daily <- x %>%
    filter(!is.na(.data[[var_name]]), !is.na(trip_date), !is.na(prev_stop_sched_hour)) %>%
    group_by(prev_stop_sched_hour, trip_date) %>%
    summarise(slot_day_median = median(.data[[var_name]], na.rm = TRUE), .groups = "drop")
  primary_fill <- build_week_fill_map(primary_daily, group_col = "prev_stop_sched_hour", value_col = "slot_day_median", window_days = window_days) %>%
    rename(fill_primary = week_fill)

  fallback_daily <- x %>%
    filter(!is.na(.data[[var_name]]), !is.na(trip_date), !is.na(prev_stop_sched_hour_group)) %>%
    group_by(prev_stop_sched_hour_group, trip_date) %>%
    summarise(slot_day_median = median(.data[[var_name]], na.rm = TRUE), .groups = "drop")
  fallback_fill <- build_week_fill_map(fallback_daily, group_col = "prev_stop_sched_hour_group", value_col = "slot_day_median", window_days = window_days) %>%
    rename(fill_fallback = week_fill)

  x <- x %>%
    left_join(primary_fill, by = c("prev_stop_sched_hour", "trip_date")) %>%
    left_join(fallback_fill, by = c("prev_stop_sched_hour_group", "trip_date"))

  was_missing      <- is.na(x[[var_name]])
  fill_primary_idx  <- was_missing & !is.na(x$fill_primary)
  fill_fallback_idx <- was_missing & is.na(x$fill_primary) & !is.na(x$fill_fallback)

  x[[var_name]][fill_primary_idx]  <- x$fill_primary[fill_primary_idx]
  x[[var_name]][fill_fallback_idx] <- x$fill_fallback[fill_fallback_idx]
  x <- x %>% select(-fill_primary, -fill_fallback)

  list(
    data = x,
    summary = tibble(
      variable          = var_name,
      n_missing_before  = miss_before,
      n_filled_primary  = sum(fill_primary_idx, na.rm = TRUE),
      n_filled_fallback = sum(fill_fallback_idx, na.rm = TRUE),
      n_missing_after   = sum(is.na(x[[var_name]]))
    )
  )
}

impute_vars <- intersect(
  c("env_temperature_c", "env_wind_speed_mps", "env_visibility_m",
    "env_precipitation_mm", "env_narrows_current_residual",
    "env_battery_water_level_obs", "env_narrows_current_obs"),
  names(df)
)

impute_summary_list <- list()
df_imp <- df
for (v in impute_vars) {
  res <- impute_by_local_slot_week(df_imp, v, window_days = week_window_days)
  df_imp <- res$data
  impute_summary_list[[length(impute_summary_list) + 1]] <- res$summary
}
imputation_summary <- bind_rows(impute_summary_list)
Code
imputation_summary %>%
  mutate(
    pct_missing_before = round(n_missing_before / nrow(df) * 100, 2),
    pct_missing_after  = round(n_missing_after  / nrow(df_imp) * 100, 2)
  ) %>%
  kable(caption = "Summary of Environmental Variable Imputation", digits = 2)
Summary of Environmental Variable Imputation
variable n_missing_before n_filled_primary n_filled_fallback n_missing_after pct_missing_before pct_missing_after
env_temperature_c 482 0 283 199 0.08 0.03
env_wind_speed_mps 1116 0 917 199 0.18 0.03
env_visibility_m 567 0 368 199 0.09 0.03
env_precipitation_mm 373 0 174 199 0.06 0.03
env_narrows_current_residual 52071 0 48636 3435 8.49 0.56
env_battery_water_level_obs 373 0 174 199 0.06 0.03
env_narrows_current_obs 52071 0 48636 3435 8.49 0.56

3.1.4.3 Modeling Utilities

A small set of reusable helper functions was created to prepare model inputs, fit linear models, extract model metrics, and generate standardized coefficient summaries.

Code
prepare_model_df <- function(data, outcome, predictors) {
  data %>% select(all_of(c(outcome, predictors))) %>% drop_na()
}

fit_linear <- function(data, model_name, outcome, predictors) {
  d   <- prepare_model_df(data, outcome, predictors)
  fml <- as.formula(paste(outcome, "~", paste(predictors, collapse = " + ")))
  fit <- lm(fml, data = d)
  pred <- predict(fit, newdata = d)
  rmse <- sqrt(mean((d[[outcome]] - pred)^2))
  g    <- glance(fit)

  list(
    metrics = tibble(
      model_name  = model_name, model_type = "linear",
      n = nrow(d), aic = g$AIC, bic = g$BIC,
      rsq = g$r.squared, adj_rsq = g$adj.r.squared, rmse = rmse,
      pseudo_rsq = NA_real_, accuracy = NA_real_,
      sensitivity = NA_real_, specificity = NA_real_, auc = NA_real_
    ),
    coefs        = tidy(fit) %>% mutate(model_name = model_name, model_type = "linear"),
    formula_text = paste(deparse(fml), collapse = "")
  )
}

standardize_for_lm <- function(data, outcome, predictors) {
  d       <- data %>% select(all_of(c(outcome, predictors))) %>% drop_na()
  num_vars <- names(d)[sapply(d, is.numeric)]
  d %>% mutate(across(all_of(num_vars), ~ as.numeric(scale(.x))))
}

fit_standardized_linear <- function(data, model_name, outcome, predictors) {
  d_std <- standardize_for_lm(data, outcome, predictors)
  fml   <- as.formula(paste(outcome, "~", paste(predictors, collapse = " + ")))
  fit   <- lm(fml, data = d_std)
  broom::tidy(fit) %>%
    filter(term != "(Intercept)") %>%
    mutate(model_name = model_name, abs_estimate = abs(estimate)) %>%
    arrange(desc(abs_estimate))
}

3.1.5 Variable Screening and Final Specification

This section explains how the final associative-model specification was selected from the broader predictor pool. The goal was not to include every available variable, but to keep interpretable coverage across the five explanatory domains while reducing redundancy among closely related predictors.

3.1.5.1 Initial Candidate Variables

The initial predictor pool was intentionally broad so that each explanatory domain was represented. Candidate variables included route and stop structure, environmental conditions, temporal indicators, operational propagation terms, and vessel/load-related characteristics. This broader pool provided the starting point for a smaller, more interpretable final specification.

Code
domain_a_predictors <- intersect(c(
  "route_id", "direction", "stop_sequence", "is_terminal_stop",
  "prev_stop_shared_stop_flag", "curr_stop_shared_stop_flag"
), names(df_imp))

domain_b_predictors <- intersect(c(
  "env_temperature_c", "env_wind_speed_mps", "env_visibility_m",
  "env_precipitation_mm", "env_narrows_current_residual",
  "env_battery_water_level_obs", "env_narrows_current_obs"
), names(df_imp))

domain_c_predictors <- intersect(c(
  "prev_stop_sched_hour", "schedule_season", "schedule_daytype"
), names(df_imp))

domain_d_predictors <- intersect(c(
  "inherited_prev_trip_delay", "carried_within_trip_delay"
), names(df_imp))

domain_e_predictors <- intersect(c(
  "vessel_capacity", "vessel_trip_seq_in_chain",
  "prev_stop_boardings", "prev_stop_alightings",
  "prev_stop_onboard", "prev_stop_load_factor"
), names(df_imp))

final_predictors <- unique(c(
  domain_a_predictors, domain_b_predictors,
  domain_c_predictors, domain_d_predictors, domain_e_predictors
))

3.1.5.2 Redundancy and Multicollinearity Checks

Because several candidate predictors measured closely related aspects of stop position, ridership, and operating context, we reviewed pairwise correlations among numeric predictors before fitting the final associative model. The goal was not to eliminate all correlation, but to identify clusters of variables that conveyed largely overlapping information and would make coefficient interpretation less clear.

Code
numeric_candidate_predictors <- intersect(
  candidate_predictors,
  names(df_imp)[sapply(df_imp, is.numeric)]
)

corr_vars   <- df_imp %>%
  select(all_of(numeric_candidate_predictors)) %>%
  select(where(~ !all(is.na(.))))

corr_matrix <- cor(corr_vars, use = "pairwise.complete.obs")

The figure below highlights selected variables reviewed during specification design. These variables are related to stop position and previous-stop load conditions, so they were checked carefully before finalizing the model.

Code
focus_vars <- c("stop_sequence", "prev_stop_sequence", "prev_stop_onboard", "prev_stop_load_factor")
focus_corr <- corr_matrix[focus_vars, focus_vars]

as.data.frame(as.table(focus_corr)) %>%
  rename(var1 = Var1, var2 = Var2, corr = Freq) %>%
  mutate(label = ifelse(var1 == var2, "", sprintf("%.2f", corr))) %>%
  ggplot(aes(x = var1, y = var2, fill = corr)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = label), size = 4) +
  scale_fill_gradient2(low = "#1A9988", mid = "white", high = "#EB5600", midpoint = 0, limits = c(-1, 1)) +
  coord_fixed() +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(size = 13, face = "bold", hjust = 0.5)
  ) +
  labs(title = "Correlation Among Selected Candidate Predictors", x = NULL, y = NULL, fill = "Correlation")

To support specification decisions more systematically, we also identified highly correlated numeric predictor pairs using an absolute correlation threshold of 0.7.

Code
as.data.frame(as.table(corr_matrix)) %>%
  rename(var1 = Var1, var2 = Var2, corr = Freq) %>%
  filter(var1 != var2) %>%
  mutate(abs_corr = abs(corr)) %>%
  filter(abs_corr >= 0.7) %>%
  rowwise() %>%
  mutate(pair_id = paste(sort(c(var1, var2)), collapse = " || ")) %>%
  ungroup() %>%
  distinct(pair_id, .keep_all = TRUE) %>%
  arrange(desc(abs_corr)) %>%
  select(var1, var2, corr, abs_corr) %>%
  head(10) %>%
  kable(caption = "Highly correlated candidate predictor pairs (|r| >= 0.7)")
Highly correlated candidate predictor pairs (|r| >= 0.7)
var1 var2 corr abs_corr
prev_stop_sequence stop_sequence 1.0000000 1.0000000
prev_stop_load_factor prev_stop_onboard 0.8965757 0.8965757

3.1.5.3 Final Variable Set

The final variable set was selected to preserve coverage across the five explanatory domains while avoiding unnecessary duplication. When multiple variables represented similar information, we prioritized predictors that were easier to interpret operationally and more directly connected to the domain being represented.

After reviewing correlation patterns, we removed redundant candidate variables and retained a reduced final specification for the combined associative model.

Code
final_predictors_trimmed <- setdiff(
  final_predictors,
  c("prev_stop_sequence", "prev_stop_sched_hour_group")
)

tibble(
  model        = "Final combined associative model",
  n_predictors = length(final_predictors_trimmed)
) %>% kable(caption = "Predictor count in the final combined associative specification")
Predictor count in the final combined associative specification
model n_predictors
Final combined associative model 24
Code
fit_vif_model <- function(data, outcome, predictors) {
  model_df <- data %>% select(all_of(unique(c(outcome, predictors)))) %>% drop_na()
  lm(as.formula(paste(outcome, "~", paste(predictors, collapse = " + "))), data = model_df)
}

extract_vif_table <- function(model, model_name) {
  vif_vals <- car::vif(model)
  if (is.matrix(vif_vals)) {
    out <- as.data.frame(vif_vals)
    out$term <- rownames(out)
    rownames(out) <- NULL
    out$model <- model_name
    out %>% rename(gvif = GVIF) %>%
      mutate(adjusted_vif = gvif^(1 / (2 * Df))) %>%
      select(model, term, Df, gvif, adjusted_vif)
  } else {
    tibble(model = model_name, term = names(vif_vals), vif = as.numeric(vif_vals)) %>%
      arrange(desc(vif))
  }
}

vif_model_final <- fit_vif_model(df_imp, outcome_cont, final_predictors_trimmed)
vif_table_final <- extract_vif_table(vif_model_final, "Final combined associative model")

We then checked variance inflation factors for the final combined specification. For multi-level categorical predictors, we report the adjusted GVIF, which is more comparable across terms with different degrees of freedom.

Code
if ("adjusted_vif" %in% names(vif_table_final)) {
  vif_table_final %>%
    arrange(desc(adjusted_vif)) %>%
    head(10) %>%
    mutate(gvif = round(gvif, 2), adjusted_vif = round(adjusted_vif, 2)) %>%
    kable(caption = "Top VIF results for the final combined associative specification")
} else {
  vif_table_final %>%
    arrange(desc(vif)) %>%
    head(10) %>%
    mutate(vif = round(vif, 2)) %>%
    kable(caption = "Top VIF results for the final combined associative specification")
}
Top VIF results for the final combined associative specification
model term Df gvif adjusted_vif
Final combined associative model prev_stop_onboard 1 11.55 3.40
Final combined associative model prev_stop_load_factor 1 11.30 3.36
Final combined associative model vessel_capacity 1 4.37 2.09
Final combined associative model env_temperature_c 1 3.73 1.93
Final combined associative model stop_sequence 1 2.29 1.51
Final combined associative model prev_stop_boardings 1 2.22 1.49
Final combined associative model is_terminal_stop 1 1.77 1.33
Final combined associative model prev_stop_alightings 1 1.54 1.24
Final combined associative model env_narrows_current_residual 1 1.54 1.24
Final combined associative model env_narrows_current_obs 1 1.51 1.23

3.1.6 Estimation Strategy

The associative model was estimated in two steps. First, we fit separate domain-specific linear models to summarize how much stop-level lateness was associated with each explanatory domain on its own. Second, we fit a combined associative model that included predictors from all five domains.

To make model comparisons more consistent, all models were estimated on the same analytic sample used for the final combined specification.

3.1.6.1 Domain-Specific Models

We estimated separate linear models for the five explanatory domains: structural context, environmental conditions, temporal patterns, operational propagation, and vessel/load characteristics. Each model used the same dependent variable: stop-level arrival lateness in minutes.

These models were not treated as competing final models. Instead, they were used to compare the relative explanatory signal of each domain when considered separately.

Code
df_model <- df_imp %>%
  drop_na(all_of(c(outcome_cont, final_predictors_trimmed)))

model_results <- list(
  domain_a_structural   = fit_linear(df_model, "domain_a_structural",   outcome_cont, domain_a_predictors),
  domain_b_environment  = fit_linear(df_model, "domain_b_environment",  outcome_cont, domain_b_predictors),
  domain_c_temporal     = fit_linear(df_model, "domain_c_temporal",     outcome_cont, domain_c_predictors),
  domain_d_operational  = fit_linear(df_model, "domain_d_operational",  outcome_cont, domain_d_predictors),
  domain_e_vessel       = fit_linear(df_model, "domain_e_vessel",       outcome_cont, domain_e_predictors),
  combined_associative  = fit_linear(df_model, "combined_associative",  outcome_cont, final_predictors_trimmed)
)

3.1.6.2 Combined Associative Model

The combined associative model uses the final predictor set across all five domains. This model serves as the main explanatory specification. It allows us to interpret individual coefficients while also discussing how different types of factors contribute to observed delay patterns within one framework.

3.1.6.3 Comparison Logic

Model comparison focused on interpretation rather than predictive model selection. We compared adjusted R-squared and RMSE across the domain-specific models and the combined model. The domain-specific models show how much delay variation is associated with each factor group on its own, while the combined model shows the explanatory power of the full specification.

3.1.7 Results: Explanatory Contribution by Domain

This section compares how much stop-level delay variation is explained by each explanatory domain. The goal is not to choose a predictive model, but to understand which types of factors are most strongly associated with observed lateness.

The domain-specific models show a clear hierarchy. Operational propagation explains the largest share of delay variation, followed by structural context. Environmental conditions, temporal patterns, and vessel/load characteristics explain smaller shares on their own, but they remain useful for interpretation because they describe monitorable operating conditions.

Code
model_comparison_summary <- bind_rows(lapply(model_results, `[[`, "metrics")) %>%
  mutate(
    domain = dplyr::recode(
      model_name,
      combined_associative  = "Combined Associative Model",
      domain_a_structural   = "Domain A: Structural Context",
      domain_b_environment  = "Domain B: Environmental Conditions",
      domain_c_temporal     = "Domain C: Temporal Patterns",
      domain_d_operational  = "Domain D: Operational Propagation",
      domain_e_vessel       = "Domain E: Vessel / Load Characteristics"
    )
  ) %>%
  arrange(match(domain, c(
    "Domain A: Structural Context", "Domain B: Environmental Conditions",
    "Domain C: Temporal Patterns",  "Domain D: Operational Propagation",
    "Domain E: Vessel / Load Characteristics", "Combined Associative Model"
  )))

model_comparison_summary %>%
  select(domain, n, rsq, adj_rsq, rmse, aic, bic) %>%
  mutate(across(c(rsq, adj_rsq, rmse), ~ round(.x, 4)),
         across(c(aic, bic), ~ round(.x, 0))) %>%
  kable(caption = "Explained Variation by Domain and Final Combined Associative Model")
Explained Variation by Domain and Final Combined Associative Model
domain n rsq adj_rsq rmse aic bic
Domain A: Structural Context 592720 0.1342 0.1342 3.6334 3211498 3211656
Domain B: Environmental Conditions 592720 0.0169 0.0169 3.8718 3286834 3286936
Domain C: Temporal Patterns 592720 0.0129 0.0129 3.8796 3289224 3289314
Domain D: Operational Propagation 592720 0.5865 0.5865 2.5109 2773416 2773461
Domain E: Vessel / Load Characteristics 592720 0.0783 0.0783 3.7489 3248611 3248701
Combined Associative Model 592720 0.7413 0.7412 1.9863 2495659 2496054
Code
model_comparison_summary %>%
  filter(domain != "Combined Associative Model") %>%
  mutate(domain = factor(domain, levels = domain), adj_rsq = round(adj_rsq, 4)) %>%
  ggplot(aes(x = domain, y = adj_rsq)) +
  geom_col(width = 0.65) +
  geom_text(aes(label = round(adj_rsq, 3)), vjust = -0.3, size = 3.5) +
  labs(title = "Explanatory Contribution by Domain", x = NULL, y = "Adjusted R-squared") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1), plot.title = element_text(face = "bold"))

3.1.7.1 Structural Context

The structural-context model explains a meaningful share of stop-level delay variation on its own. This indicates that route and stop configuration establish an important baseline pattern of delay across the system. In substantive terms, some portions of the ferry network appear structurally more delay-prone than others even before environmental or operational carry-forward effects are considered.

3.1.7.2 Environmental Conditions

The environmental model explains a smaller share of total variation on its own. This does not mean environmental signals are unimportant. Instead, it suggests that weather and water conditions are secondary to propagated delay and structural context, while still offering useful monitorable signals for operational awareness.

3.1.7.3 Temporal Patterns

The temporal model explains a modest share of variation, indicating that day type, scheduled hour, and season capture recurring service rhythms but do not dominate the delay process on their own. These results suggest that delay varies systematically across time, but those temporal patterns are weaker than operational propagation and structural context.

3.1.7.4 Operational Propagation

The operational-propagation model explains the largest share of variation among the individual domains. This confirms that delay accumulation within and across trips is the strongest single explanatory component in the associative model. In other words, stop-level delay is highly path-dependent: once lateness has entered the operating chain, a substantial portion of subsequent delay can be understood as propagated rather than newly generated at each stop.

3.1.7.5 Vessel / Load Characteristics

The vessel/load model explains the smallest share of variation among the five domains. This suggests that vessel capacity and load-related differences are present but relatively limited when considered in isolation. These factors are still useful as operational context, but they are secondary to route structure, external conditions, and propagated delay in explaining stop-level lateness.

3.1.8 Focused Interpretation of Key Predictors

The previous section compared explanatory contribution across domains. This section focuses on selected predictors used for coefficient interpretation and final presentation findings. These results are intended to make the model findings easier to interpret operationally, not to replace the full combined associative model.

3.1.8.1 Focused Interpretive Specification

In addition to the full combined associative model, we estimated a smaller interpretive specification using predictors selected for coefficient interpretation. This focused model includes prior delay, route context, seasonal and day-type patterns, time bins, selected environmental conditions, and stop sequence.

This model is used to support coefficient interpretation in the final presentation. It should be read as a simplified interpretive view of the larger associative framework.

Code
selected_predictors <- intersect(c(
  "inherited_prev_trip_delay", "route_id", "schedule_season",
  "prev_stop_sched_hour_group", "schedule_daytype",
  "env_temperature_c", "env_wind_speed_mps", "env_precipitation_mm",
  "env_battery_water_level_obs", "stop_sequence"
), names(df_model))

selected_model_df <- prepare_model_df(df_model, outcome_cont, selected_predictors)

selected_formula <- as.formula(paste(
  outcome_cont, "~",
  paste(selected_predictors, collapse = " + "),
  "+ prev_stop_sched_hour_group:schedule_daytype"
))

selected_fit <- lm(selected_formula, data = selected_model_df)

broom::glance(selected_fit) %>%
  transmute(
    model   = "Focused interpretive specification",
    n       = nobs(selected_fit),
    rsq     = round(r.squared, 4),
    adj_rsq = round(adj.r.squared, 4),
    rmse    = round(sqrt(mean(residuals(selected_fit)^2)), 4),
    aic     = round(AIC(selected_fit), 0),
    bic     = round(BIC(selected_fit), 0)
  ) %>%
  kable(caption = "Focused Interpretive Specification Fit")
Focused Interpretive Specification Fit
model n rsq adj_rsq rmse aic bic
Focused interpretive specification 592720 0.4112 0.4112 2.9963 2983010 2983326

3.1.8.2 Raw Coefficient Interpretation

To support operational interpretation, we report raw coefficients from the focused interpretive specification. These coefficients show the estimated change in arrival lateness, in minutes, associated with each predictor while holding other variables in the model constant.

For numeric predictors, the coefficient is interpreted per one-unit increase. For categorical predictors, coefficients are interpreted relative to the reference category.

Code
broom::tidy(selected_fit) %>%
  filter(term != "(Intercept)") %>%
  mutate(abs_estimate = abs(estimate)) %>%
  arrange(desc(abs_estimate)) %>%
  select(term, estimate, std.error, statistic, p.value) %>%
  mutate(across(c(estimate, std.error), ~ round(.x, 4)),
         statistic = round(statistic, 2),
         p.value   = round(p.value, 4)) %>%
  kable(caption = "Raw Coefficients for Focused Interpretive Specification")
Raw Coefficients for Focused Interpretive Specification
term estimate std.error statistic p.value
route_idRW 3.2060 0.0197 162.75 0.0000
route_idGI 1.9491 0.0762 25.58 0.0000
route_idRR -1.5515 0.2317 -6.70 0.0000
route_idSG 0.9711 0.0167 58.17 0.0000
inherited_prev_trip_delay 0.8122 0.0014 561.44 0.0000
route_idSV 0.8064 0.0132 61.07 0.0000
prev_stop_sched_hour_group5_8:schedule_daytypeWEEKEND -0.5679 0.0451 -12.59 0.0000
stop_sequence 0.5202 0.0025 208.73 0.0000
prev_stop_sched_hour_group15_17 0.4389 0.0155 28.26 0.0000
prev_stop_sched_hour_group5_8 0.4243 0.0156 27.17 0.0000
prev_stop_sched_hour_group9_11 -0.2685 0.0156 -17.24 0.0000
route_idER -0.2424 0.0112 -21.65 0.0000
schedule_seasonSUMMER_1 -0.2001 0.0156 -12.83 0.0000
schedule_seasonSUMMER_2 0.1921 0.0166 11.54 0.0000
prev_stop_sched_hour_group18_22 -0.1628 0.0152 -10.71 0.0000
env_battery_water_level_obs -0.1483 0.0080 -18.61 0.0000
prev_stop_sched_hour_group18_22:schedule_daytypeWEEKEND -0.1227 0.0252 -4.88 0.0000
prev_stop_sched_hour_group9_11:schedule_daytypeWEEKEND 0.0883 0.0265 3.34 0.0009
schedule_seasonSPRING -0.0846 0.0152 -5.56 0.0000
prev_stop_sched_hour_group15_17:schedule_daytypeWEEKEND 0.0525 0.0255 2.06 0.0397
env_temperature_c 0.0284 0.0008 36.36 0.0000
route_idSB -0.0190 0.0127 -1.50 0.1334
env_wind_speed_mps 0.0180 0.0017 10.65 0.0000
env_precipitation_mm -0.0157 0.0023 -6.83 0.0000
schedule_daytypeWEEKEND -0.0123 0.0184 -0.67 0.5026
schedule_seasonWINTER -0.0016 0.0169 -0.10 0.9231

3.1.8.3 Supporting Exploratory Patterns

The following exploratory plots provide visual context for the coefficient interpretation. They show how delay accumulates, how lateness varies across time bins and day type, and how selected environmental and route-level patterns relate to average lateness.

Code
tibble::tibble(
  component = factor(
    c("Inherited previous delay", "Carried within-trip delay", "New delay at current stop"),
    levels = c("Inherited previous delay", "Carried within-trip delay", "New delay at current stop")
  ),
  minutes = c(6.24, 1.85, 1.68)
) %>%
  ggplot(aes(x = "Excessive delay example", y = minutes, fill = component)) +
  geom_col(width = 0.45) +
  geom_text(aes(label = paste0(minutes, " min")),
            position = position_stack(vjust = 0.5),
            color = "white", fontface = "bold", size = 4) +
  scale_fill_manual(values = c(
    "Inherited previous delay"   = "#EB5600",
    "Carried within-trip delay"  = "#1A9988",
    "New delay at current stop"  = "#C7C7C7"
  )) +
  labs(title    = "Where Does Excessive Delay Come From?",
       subtitle = "Most excessive delay is inherited or carried forward, not newly created at the current stop",
       x = NULL, y = "Minutes of delay", fill = NULL) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        panel.grid.major.x = element_blank(), panel.grid.minor = element_blank(),
        legend.position = "bottom", plot.title = element_text(face = "bold"))

Code
df_model %>%
  group_by(route_id, stop_sequence) %>%
  summarise(avg_lateness = mean(.data[[outcome_cont]], na.rm = TRUE), n = n(), .groups = "drop") %>%
  filter(n >= 100) %>%
  ggplot(aes(x = stop_sequence, y = avg_lateness, color = route_id)) +
  geom_line(linewidth = 0.9) +
  geom_point(size = 1.8) +
  labs(title    = "Delay Accumulation Along Route",
       subtitle = "Average lateness tends to change as trips move through stop sequence",
       x = "Stop sequence", y = "Average lateness (minutes)", color = "Route") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"), legend.position = "bottom")

Code
lateness_col <- c(outcome_cont, "arr_lateness", "arr_late_min", "arrival_lateness",
                  "delay_min", "lateness")[
  c(outcome_cont, "arr_lateness", "arr_late_min", "arrival_lateness",
    "delay_min", "lateness") %in% names(df_model)][1]

if (is.na(lateness_col)) stop("Could not find the lateness column.")

time_delay_summary <- df_model %>%
  mutate(lateness_for_plot = .data[[lateness_col]]) %>%
  filter(!is.na(prev_stop_sched_hour_group), !is.na(schedule_daytype), !is.na(lateness_for_plot)) %>%
  group_by(prev_stop_sched_hour_group, schedule_daytype) %>%
  summarise(avg_lateness = mean(lateness_for_plot, na.rm = TRUE), n = n(), .groups = "drop") %>%
  mutate(prev_stop_sched_hour_group = factor(
    prev_stop_sched_hour_group, levels = c("5_8", "9_11", "12_14", "15_17", "18_22")
  ))
Code
ggplot(time_delay_summary, aes(x = prev_stop_sched_hour_group, y = avg_lateness, fill = schedule_daytype)) +
  geom_col(position = position_dodge(width = 0.75), width = 0.6) +
  scale_fill_manual(values = c("WEEKDAY" = "#EB5600", "WEEKEND" = "#1A9988")) +
  labs(title = "Average Lateness by Time Bin and Day Type",
       x = "Scheduled time bin", y = "Average lateness (minutes)", fill = "Day type") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"), legend.position = "bottom")

Code
temp_delay_summary <- df_model %>%
  filter(!is.na(env_temperature_c), !is.na(.data[[outcome_cont]])) %>%
  mutate(temp_bin = cut(env_temperature_c,
                        breaks = quantile(env_temperature_c, probs = seq(0, 1, 0.1), na.rm = TRUE),
                        include.lowest = TRUE)) %>%
  group_by(temp_bin) %>%
  summarise(avg_temperature = mean(env_temperature_c, na.rm = TRUE),
            avg_lateness    = mean(.data[[outcome_cont]], na.rm = TRUE),
            n = n(), .groups = "drop")
Code
ggplot(temp_delay_summary, aes(x = avg_temperature, y = avg_lateness)) +
  geom_point(color = "#1A9988", size = 2.5) +
  geom_smooth(method = "lm", se = FALSE, color = "#EB5600") +
  labs(title = "Temperature and Average Lateness",
       x = "Average temperature (°C)", y = "Average lateness (minutes)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

Code
if ("expected_water_min" %in% names(df_model)) {
  df_model %>%
    group_by(route_id) %>%
    summarise(avg_delay              = mean(.data[[outcome_cont]], na.rm = TRUE),
              avg_expected_water_min = mean(expected_water_min, na.rm = TRUE),
              n = n(), .groups = "drop") %>%
    ggplot(aes(x = avg_expected_water_min, y = avg_delay, label = route_id)) +
    geom_point(size = 3) +
    geom_text(nudge_y = 0.15) +
    geom_smooth(method = "lm", se = FALSE) +
    labs(title = "Route Travel Time vs Average Lateness",
         x = "Average expected water travel time (minutes)\n(proxy for route length / exposure)",
         y = "Average lateness (minutes)") +
    theme_minimal()
}

3.1.8.4 Monitorable Signals Beyond Prior Delay

Although operational propagation explains the largest share of delay variation, smaller signals are still useful when they are monitorable. Environmental conditions, time patterns, and vessel/load context may not dominate the model, but they describe conditions that operators can check before or during service.

Code
domain_e_plot_predictors <- intersect(c(
  domain_e_predictors, "prev_stop_boardings", "prev_stop_alightings",
  "prev_stop_onboard", "prev_stop_load_factor"
), names(df_model))

bce_predictors       <- unique(c(domain_b_predictors, domain_c_predictors, domain_e_plot_predictors))
plot_model_predictors <- unique(c(domain_a_predictors, domain_b_predictors, domain_c_predictors,
                                   domain_d_predictors, domain_e_plot_predictors))
numeric_predictors   <- plot_model_predictors[sapply(df_model[plot_model_predictors], is.numeric)]

df_model_std <- df_model %>%
  mutate(across(all_of(numeric_predictors), ~ as.numeric(scale(.x))))

combined_std_fit <- lm(
  as.formula(paste(outcome_cont, "~", paste(plot_model_predictors, collapse = " + "))),
  data = df_model_std
)

label_lookup <- c(
  env_temperature_c              = "Temperature",
  env_wind_speed_mps             = "Wind speed",
  env_visibility_m               = "Visibility",
  env_precipitation_mm           = "Precipitation",
  env_battery_water_level_obs    = "Water level",
  env_battery_water_level_anomaly = "Water-level anomaly",
  env_narrows_current_obs        = "Current",
  env_narrows_current_residual   = "Current residual",
  prev_stop_sched_hour           = "Previous scheduled hour",
  prev_stop_sched_hour_group     = "Previous scheduled hour group",
  schedule_season                = "Schedule season",
  schedule_daytype               = "Schedule day type",
  vessel_capacity                = "Vessel capacity",
  vessel_trip_seq_in_chain       = "Vessel trip sequence",
  prev_stop_boardings            = "Boardings",
  prev_stop_alightings           = "Alightings",
  prev_stop_onboard              = "Onboard count",
  prev_stop_load_factor          = "Load factor"
)

get_base_var <- function(term, vars) {
  hits <- vars[vapply(vars, function(v) startsWith(term, v), logical(1))]
  if (length(hits) == 0) return(NA_character_)
  hits[which.max(nchar(hits))]
}

coef_plot_df <- broom::tidy(combined_std_fit) %>%
  filter(term != "(Intercept)") %>%
  rowwise() %>%
  mutate(variable = get_base_var(term, bce_predictors)) %>%
  ungroup() %>%
  filter(!is.na(variable)) %>%
  mutate(
    domain = case_when(
      variable %in% domain_b_predictors       ~ "Environmental Conditions",
      variable %in% domain_c_predictors       ~ "Temporal Patterns",
      variable %in% domain_e_plot_predictors  ~ "Demand and Vessel Context",
      TRUE ~ "Other"
    )
  ) %>%
  filter(domain != "Other",
         !str_detect(term, "^schedule_season"),
         !variable %in% c("prev_stop_load_factor", "prev_stop_onboard", "vessel_trip_seq_in_chain")) %>%
  mutate(
    clean_label = if_else(
      variable %in% names(label_lookup),
      unname(label_lookup[variable]),
      variable
    ),
    clean_label = case_when(
      variable == "schedule_daytype"           ~ str_replace(term, variable, "Schedule day type "),
      variable == "prev_stop_sched_hour_group" ~ str_replace(term, variable, "Previous scheduled hour "),
      TRUE ~ clean_label
    ),
    abs_estimate = abs(estimate)
  )

coef_plot_top <- bind_rows(
  coef_plot_df %>% filter(domain == "Environmental Conditions") %>% slice_max(abs_estimate, n = 4, with_ties = FALSE),
  coef_plot_df %>% filter(domain == "Temporal Patterns")        %>% slice_max(abs_estimate, n = 3, with_ties = FALSE),
  coef_plot_df %>% filter(domain == "Demand and Vessel Context") %>% slice_max(abs_estimate, n = 4, with_ties = FALSE)
) %>%
  distinct(term, .keep_all = TRUE) %>%
  arrange(abs_estimate)

ggplot(coef_plot_top, aes(x = reorder(clean_label, abs_estimate), y = abs_estimate, fill = domain)) +
  geom_col(width = 0.72) +
  coord_flip() +
  scale_fill_manual(values = c(
    "Environmental Conditions"  = "#1A9988",
    "Temporal Patterns"         = "#7A7A7A",
    "Demand and Vessel Context" = "#EB5600"
  )) +
  labs(title = "Monitorable Signals Beyond Prior Delay",
       x = NULL, y = "Absolute standardized coefficient", fill = NULL) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom", panel.grid.major.y = element_blank(),
        plot.title = element_text(face = "bold"))

3.1.8.5 Operational Interpretation

The focused coefficient results and supporting plots reinforce three operational takeaways.

First, prior delay remains the strongest signal. This means delay management needs to account for accumulation across stops and linked trips, not only conditions at the current stop.

Second, route and stop structure matter. Some parts of the system appear to have a less favorable baseline, meaning they may be more vulnerable to delay even before short-run conditions are considered.

Third, weather, water, time, and vessel/load patterns are smaller signals, but they are monitorable. They can help operators understand when conditions may be less favorable and where extra attention may be useful.

3.2 Predictive Model

By: Guangze ‘Simon’ Sun

3.2.1 Feature Table Construction

The prediction feature table consolidates schedule structure, stop context, lagged demand, vessel traffic (AIS), and environmental conditions into a single modeling dataset. Each row represents one stop-level trip observation. The table below lists the required predictors; the build chunk is set to eval: false since it writes output files and depends on the full source parquet.

Code
source_parquet <- "../exploratory_code/modeling/baseline_prediction_table_stop_level_v2.parquet"
out_csv     <- "../exploratory_code/modeling/final_predictive_feature_table.csv"
out_parquet <- "../exploratory_code/modeling/final_predictive_feature_table.parquet"
Code
feature_table <- arrow::read_parquet(source_parquet) %>% as_tibble()

required_predictors <- c(
  "alightings_lag_7d", "boardings_lag_7d", "curr_stop_shared_stop_flag",
  "curr_vessel_count_lag_7d", "direction", "expected_water_min",
  "is_terminal_stop", "planned_turnaround_gap_min", "prev_stop_sched_hour",
  "prev_stop_shared_stop_flag", "prev_vessel_count_lag_7d", "schedule_daytype",
  "schedule_season", "scheduled_headway_min", "stop_sequence", "vessel_capacity",
  "vessel_trip_seq_in_chain", "env_battery_water_level_obs", "env_narrows_current_obs",
  "env_precipitation_mm", "env_temperature_c", "env_visibility_m",
  "env_wind_speed_mps", "from_stop_id", "to_stop_id", "segment_hourly_vessel_count_all"
)

required_columns <- c("trip_date", "lateness_vs_original_schedule_min", required_predictors)
missing_columns  <- setdiff(required_columns, names(feature_table))
if (length(missing_columns) > 0) stop("Missing required columns: ", paste(missing_columns, collapse = ", "))
stopifnot(nrow(feature_table) > 0)
stopifnot(anyDuplicated(names(feature_table)) == 0)
stopifnot("segment_hourly_vessel_count_all" %in% names(feature_table))
stopifnot("prev_stop_sched_hour" %in% names(feature_table))

readr::write_csv(feature_table, out_csv)
arrow::write_parquet(feature_table, out_parquet)
stopifnot(file.exists(out_csv), file.exists(out_parquet))

3.2.2 Baseline Model-Family Comparison

We used a chronological train / validation / test split. The most recent 30 days were held out as the test set, the previous 30 days were used as validation, and earlier observations were used for training. This matches the operational setting where future trips are predicted from historical data.

The baseline screen compared five model families: Logistic Regression, SVM, DNN, Random Forest, and XGBoost. SVM used a capped 20,000-row stratified training sample, so it should be interpreted as lightweight screening evidence. The baseline screen supported carrying forward Random Forest and XGBoost for focused finalist comparison.

Code
model_dir <- "../exploratory_code/modeling"

baseline_file   <- file.path(model_dir, "baseline_model_family_comparison.csv")
prediction_file <- file.path(model_dir, "final_rf_xgb_test_predictions.csv")
importance_file <- file.path(model_dir, "final_rf_top10_feature_importance.csv")

threshold_source_note <- "Test predictions were used for the recall >= 0.20 operating-point table because validation prediction probabilities were not available in the final delivery package."
rf_xgb_highlight_note <- "Yellow shading marks recall ranges where RF has higher precision than XGBoost and at least one model has precision >= 0.10."
Code
if (!file.exists(baseline_file)) {
  warning("Missing baseline comparison file: ", baseline_file)
} else {
  baseline_tbl <- read_csv(baseline_file, show_col_types = FALSE) %>%
    mutate(model_family = fct_reorder(model_family, validation_roc_auc))

  ggplot(baseline_tbl, aes(x = validation_roc_auc, y = model_family)) +
    geom_col(fill = "#1A9988", width = 0.68) +
    scale_x_continuous(labels = percent_format(accuracy = 1), limits = c(0, NA)) +
    labs(title = "Baseline Model-Family Screen", x = "Validation ROC-AUC", y = NULL) +
    theme_minimal(base_size = 15) +
    theme(plot.title = element_text(face = "bold", size = 18), panel.grid.minor = element_blank())
}

Validation ROC-AUC from the baseline screen. SVM used a capped training sample and is included as lightweight screening evidence.

3.2.3 Final RF vs XGBoost Comparison

After the baseline screen, we focused on Random Forest and XGBoost. Both model families can capture nonlinear relationships and interactions among schedule structure, stop context, lagged demand, vessel traffic, and environmental conditions.

The final tuning process used trees = 1500 as the practical central setting. Around that tree count, RF tuning mainly adjusted mtry and min_n. XGBoost tuning adjusted tree depth, min_n, learning rate, loss reduction, and feature sampling. The final comparison retained one RF operational candidate and one XGBoost benchmark candidate.

XGBoost remains a strong predictive benchmark, but the dashboard use case emphasizes conservative alerting rather than global ranking alone. In the higher-precision region, Random Forest provides a more useful recall balance, so we select RF as the operational model and retain XGBoost as the benchmark.

Code
if (!file.exists(prediction_file)) {
  warning("Missing final RF/XGB prediction file: ", prediction_file)
} else {
  pred_tbl <- read_csv(prediction_file, show_col_types = FALSE) %>%
    mutate(delay_gt_10 = factor(delay_gt_10, levels = c("no", "yes")))

  pr_tbl <- bind_rows(
    pred_tbl %>% transmute(model = "Final RF",  delay_gt_10, .pred_yes = rf_pred_yes)  %>%
      group_by(model) %>% pr_curve(truth = delay_gt_10, .pred_yes, event_level = "second"),
    pred_tbl %>% transmute(model = "Final XGB", delay_gt_10, .pred_yes = xgb_pred_yes) %>%
      group_by(model) %>% pr_curve(truth = delay_gt_10, .pred_yes, event_level = "second")
  )

  prep_curve <- function(dat, model_name) {
    dat %>%
      filter(model == model_name, is.finite(recall), is.finite(precision)) %>%
      group_by(recall) %>%
      summarise(precision = max(precision, na.rm = TRUE), .groups = "drop") %>%
      arrange(recall)
  }

  interp_precision <- function(curve, recall_grid) {
    approx(curve$recall, curve$precision, xout = recall_grid, rule = 2, ties = "ordered")$y
  }

  rf_curve  <- prep_curve(pr_tbl, "Final RF")
  xgb_curve <- prep_curve(pr_tbl, "Final XGB")
  max_shared_recall <- min(max(rf_curve$recall, na.rm = TRUE), max(xgb_curve$recall, na.rm = TRUE))
  recall_grid <- seq(0, max_shared_recall, length.out = 600)

  highlight_grid <- tibble(
    recall        = recall_grid,
    rf_precision  = interp_precision(rf_curve,  recall_grid),
    xgb_precision = interp_precision(xgb_curve, recall_grid)
  ) %>%
    mutate(
      high_precision_region = pmax(rf_precision, xgb_precision, na.rm = TRUE) >= 0.10,
      rf_above_xgb = rf_precision > xgb_precision,
      highlight    = rf_above_xgb & high_precision_region
    )

  if (!any(highlight_grid$highlight, na.rm = TRUE)) {
    highlight_grid <- highlight_grid %>% mutate(highlight = rf_above_xgb)
    rf_xgb_highlight_note <- "No RF-above-XGB intervals met the precision >= 0.10 rule, so the plot highlights all recall ranges where RF precision is above XGBoost."
  }

  highlight_intervals <- highlight_grid %>%
    mutate(run_id = cumsum(highlight != lag(highlight, default = first(highlight)))) %>%
    filter(highlight) %>%
    group_by(run_id) %>%
    summarise(xmin = min(recall), xmax = max(recall), .groups = "drop") %>%
    filter(xmax > xmin)

  ggplot(pr_tbl, aes(x = recall, y = precision, color = model)) +
    geom_rect(data = highlight_intervals,
              aes(xmin = xmin, xmax = xmax, ymin = 0, ymax = 1),
              inherit.aes = FALSE, fill = "#F2E8CF", alpha = 0.55) +
    geom_path(linewidth = 1.15) +
    scale_color_manual(values = c("Final RF" = "#D95F59", "Final XGB" = "#1A9988")) +
    scale_x_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
    scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, NA)) +
    labs(title = "Final RF vs XGBoost Precision-Recall Curve",
         x = "Recall", y = "Precision", color = NULL) +
    theme_minimal(base_size = 15) +
    theme(plot.title = element_text(face = "bold", size = 18),
          panel.grid.minor = element_blank(), legend.position = "bottom")
}

Yellow shading marks recall ranges where RF has higher precision than XGBoost.

3.2.4 Final RF Feature Importance

We inspected feature importance from the final Random Forest model to understand what types of advance-available information the operational model relies on. This importance is model-based and descriptive; it is not a causal estimate.

The top features combine environmental / water-condition variables and recurring operational patterns.

Code
if (!file.exists(importance_file)) {
  warning("Missing RF feature importance file: ", importance_file)
} else {
  importance_tbl <- read_csv(importance_file, show_col_types = FALSE) %>%
    arrange(desc(importance)) %>%
    slice_head(n = 10) %>%
    mutate(
      display_label = dplyr::recode(
        feature,
        segment_hourly_vessel_count_all  = "Segment vessel count, 7-day lag",
        boardings_lag_7d                 = "Boardings, 7-day lag",
        alightings_lag_7d                = "Alightings, 7-day lag",
        curr_vessel_count_lag_7d         = "Current-stop vessel count, 7-day lag",
        prev_vessel_count_lag_7d         = "Previous-stop vessel count, 7-day lag",
        env_temperature_c                = "Temperature",
        env_battery_water_level_obs      = "Battery water level",
        env_narrows_current_obs          = "Narrows current",
        env_precipitation_mm             = "Precipitation",
        env_visibility_m                 = "Visibility",
        env_wind_speed_mps               = "Wind speed",
        expected_water_min               = "Expected water time",
        planned_turnaround_gap_min       = "Planned turnaround gap",
        prev_stop_sched_hour             = "Scheduled hour",
        .default = str_to_sentence(str_replace_all(feature, "_", " "))
      )
    )

  ggplot(importance_tbl, aes(x = importance, y = fct_reorder(display_label, importance))) +
    geom_col(fill = "#1A9988", width = 0.72) +
    labs(title = "Final RF Feature Importance", x = "Importance", y = NULL) +
    theme_minimal(base_size = 15) +
    theme(plot.title = element_text(face = "bold", size = 18), panel.grid.minor = element_blank())
}

The final RF relies on a mix of environmental conditions, water-current context, and recurring operational demand/vessel-pattern features.

4 Concluding Remarks

4.1 Takeaways

The analysis combined two complementary modeling approaches to understand and anticipate service delays in the NYC Ferry system. A stacked linear regression model — trained on stop-level observations across five feature domains — explained 74% of arrival lateness variation. The dominant driver was operational propagation: inherited delay from a prior trip alone accounted for 59% of explanatory power, with each minute of prior delay carrying forward 0.81 minutes, and each additional stop in sequence adding 0.52 minutes. Structural route context (notably the Rockaway route at +3.2 minutes above reference), temporal patterns (afternoon 15–17 windows, summer service periods), and environmental conditions (temperature, wind, water level) contributed smaller but monitorable signals.

The predictive component used a Random Forest classifier — selected over logistic regression, SVM, deep neural network, and XGBoost based on precision-oriented evaluation — to assign each trip an excessive-delay probability (defined as >10 minutes) up to 48 hours before departure, using only features available at that planning horizon. Same-day realized outcomes were excluded in favor of 7-day lag ridership and vessel counts alongside forecast-based environmental conditions. Predicted probabilities were mapped to low, medium, and high risk bands to support operator decision-making around vessel allocation and landing staffing.

Key limitations include class imbalance in training data, a training window restricted to January 2024 through June 2025, and dependency on the pre-December 2025 route structure.

4.2 Future Work

A major area for future work is operationalizing the data pipeline. In its current form, the project processes data from large CSVs using R scripts, which constrained both collaboration and the volume of data that could be ingested — limiting the training window to January 2024 through June 2025. Moving data into a database accessible via an API would enable a proper machine learning pipeline, allowing for faster model iteration and likely improved performance.

This infrastructure improvement would also directly benefit the web application. Currently, the app relies on a static precomputation of all possible Random Forest outputs, a workaround made necessary by treating temperature and precipitation as discrete bins rather than continuous values. It also substitutes historical lag variables from 2024–2025 in place of true 7-day rolling counts. A live data pipeline would resolve both limitations and, critically, allow new trip data to be incorporated into model training on an ongoing basis — particularly important given the December 2025 system redesign, which rendered the existing training data partially obsolete.

Finally, there is meaningful untapped signal in the captain notes recorded in the trip data. Exploratory analysis indicated that coast guard restrictions and landing closures measurably affect delays, but no structured ground-truth labels currently exist for these events. Combining natural language analysis of captain notes with operational variables such as marine traffic density could unlock a richer understanding of delay causes that the current modeling framework is unable to capture.

5 Acknowledgements

Thank you to Rebekah Adams of the NYC Economic Development Corporation — our client representative and MUSA alum — for helping put this project together and hosting us at the EDC headquarters. Her support was instrumental in shaping the project’s direction. Thank you as well to the entire NYC Ferry team for providing invaluable operational insights into the system. We all came away with a much deeper understanding of how the ferry network functions day to day.

Finally, thank you to Professor Michael Fichman and Matt Harris of the MUSA Program for guiding us through the practicum course. Their candid feedback helped us iterate on this project throughout the semester, culminating in a final presentation we felt proud of.

6 Code Appendix

Through this project process, we wrote a lot of code that is not meaningful toward our final results and biggest takeaways. This code lives in various markdown files in our Github Repo.