Analyzing and Predicting Delays for the NYC Ferry System

2026 MUSA Practicum Project Supporting the NYC Economic Development Corportation

Author

Yiming ‘Ming’ Cao, Sujan Kakumanu, Tyler Maynard, Guangze ‘Simon’ Sun

Code

# Set options
options(scipen = 999, dplyr.summarise.inform = FALSE)

# Packages
if(!require(pacman)){install.packages("pacman"); library(pacman, quietly = T)}
p_load(tidyverse, sf, viridis, lubridate, kableExtra, gridExtra, ggthemes, ggspatial, leaflet, spatstat, spdep, htmlwidgets, concaveman, patchwork, yardstick, arrow, broom, car, scales)

1 Use Case

NYC Ferry launched on May 1st, 2017 with a broad goal: connect underserved communities, complement existing transit, and support growing neighborhoods. That mission is reflected in how the system is designed — single-seat rides are possible between communities as far apart as the Bronx and the Rockaways. Since launch, the system has crossed 50 million riders and today spans 5 routes operated by 38 custom-built vessels across New York Harbor.

This project was done in partnership with the NYC Economic Development Corporation, the operator of NYC Ferry, to support on-time performance through better operational forecasting. The use case centers on an operational manager who needs to allocate resources — upland agents, sweep boats, vessel assignments — ahead of anticipated delays.
This breaks down into two key deliverables. First, an associative model exploring the historical contributing factors to delays between landings. Second, a predictive model that estimates the probability of an excessive delay (>10 minutes) between landings up to 48 hours ahead of a scheduled departure. The latter feeds into a web-based application designed for operational use.

2 Exploratory Data Analysis

2.1 Ridership and Trip Data

By: Guangze ‘Simon’ Sun and Yiming ‘Ming’ Cao

Definition of Delay

Arrival Time Variance (minutes): Actual Arrival − Scheduled Arrival

Thresholds:

Delayed: > 5 minutes
Excessively delayed: > 10 minutes

Rationale: Based on NYC Ferry operational definitions. The >5 minute threshold is used for stop-level late-rate analysis throughout this section, reflecting stable classification with less noise than stricter cutoffs.

Code

tz_local <- "America/New_York"

ridership_files <- c(
  "../exploratory_code/ridership_eda/NYCF_ridership_details_2024.01_06.csv",
  "../exploratory_code/ridership_eda/NYCF_ridership_details_2024.07_12.csv",
  "../exploratory_code/ridership_eda/NYCF_ridership_details_2025.01_06.csv"
)
stop_files <- c(
  "../exploratory_code/ridership_eda/NYCF_stop_data_2024.csv",
  "../exploratory_code/ridership_eda/NYCF_stop_data_2025.01_06.csv"
)

2.1.1 Overview

2.1.1.1 Scope & sample definition

This notebook performs exploratory data analysis (EDA) on compiled NYC Ferry stop-level operations data.

All downstream analysis uses records from 2024-01-01 to 2025-06-30 (inclusive), and excludes 2024-04-25 for route SB due to a known schedule mismatch issue. Passenger activity is measured using boarding_wake and alightings_wake throughout.

2.1.1.2 What this notebook produces

The notebook constructs a cleaned stop-level table, derives operational and reliability metrics (dwell, segment “water time,” delay, and late/early flags), and generates summary tables and plots across routes, months, and hours.

The final analysis sample is stored as dat (stop-level), which is also the only dataset exported at the end for downstream modeling and reporting.

2.1.2 Data Preparation

2.1.2.1 Import Data

The following code loads the compiled stop-level dataset and the ridership details dataset. Ridership details are used only to obtain vessel capacity for computing load factor.

Code

standardize_cols_ridership <- function(df) {
  nm <- names(df)
  nm <- ifelse(nm == "Original_On",  "Original_ON",  nm)
  nm <- ifelse(nm == "Original_Off", "Original_OFF", nm)
  nm <- ifelse(nm == "WheelChair_On",  "WheelChair_ON",  nm)
  nm <- ifelse(nm == "WheelChair_Off", "WheelChair_OFF", nm)
  names(df) <- nm
  df
}

ridership <- purrr::map_dfr(
  ridership_files,
  ~ readr::read_csv(.x, show_col_types = FALSE) %>%
    standardize_cols_ridership()
)

stops <- purrr::map_dfr(
  stop_files,
  ~ readr::read_csv(.x, show_col_types = FALSE)
)

glimpse(ridership)

Rows: 961,890
Columns: 15
$ Date1           <chr> "1/1/2024", "1/1/2024", "1/1/2024", "1/1/2024", "1/1/2…
$ Route           <chr> "ER", "ER", "ER", "ER", "ER", "ER", "ER", "ER", "ER", …
$ Direction       <chr> "NB", "NB", "NB", "NB", "NB", "NB", "NB", "SB", "SB", …
$ Trip_Pattern    <chr> "er_p11_gp_hps", "er_p11_gp_hps", "er_p11_gp_hps", "er…
$ Trip_ID         <dbl> 1101, 1101, 1101, 1101, 1101, 1101, 1101, 1102, 1102, …
$ Vessel_Id       <chr> "H201", "H201", "H201", "H201", "H201", "H201", "H201"…
$ Vessel_Capacity <dbl> 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150,…
$ Stop            <chr> "Hunters Point South", "Wall St/Pier 11", "Dumbo/Fulto…
$ Stop_Sequence   <dbl> NA, 1, 2, 3, 4, 5, 6, NA, 1, 2, 3, 4, 5, 6, NA, 1, 2, …
$ Departure_Time  <time> 10:53:00, 10:05:00, 10:13:00, 10:25:00, 10:33:00, 10:…
$ Original_ON     <dbl> 0, 8, 3, 1, 10, 16, 0, 0, 25, 37, 14, 13, 14, 12, 0, 2…
$ Original_OFF    <dbl> 0, 0, 3, 2, 2, 4, 27, 35, 0, 13, 2, 15, 6, 44, 15, 0, …
$ Left_Behind     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ WheelChair_ON   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ WheelChair_OFF  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Code

glimpse(stops)

Rows: 753,951
Columns: 27
$ Row_Number             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
$ `Sort Key`             <chr> "2024-01-011101ER  1", "2024-01-011101ER2", "20…
$ vehicle_id             <chr> "H201", "H201", "H201", "H201", "H201", "H201",…
$ trip_date              <chr> "1/1/2024", "1/1/2024", "1/1/2024", "1/1/2024",…
$ trip_no                <dbl> 1101, 1101, 1101, 1101, 1101, 1101, 1101, 1102,…
$ `Block Id`             <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,…
$ route_no               <chr> "ER", "ER", "ER", "ER", "ER", "ER", "ER", "ER",…
$ route_direction        <chr> "NB", "NB", "NB", "NB", "NB", "NB", "NB", "SB",…
$ route_pattern          <chr> "er_p11_gp_hps", "er_p11_gp_hps", "er_p11_gp_hp…
$ route_name             <chr> "East River", "East River", "East River", "East…
$ stop_name              <chr> "Wall St/Pier 11", "Dumbo/Fulton Ferry", "South…
$ scheduled_departure_dt <chr> "1/1/2024 10:05", "1/1/2024 10:13", "1/1/2024 1…
$ scheduled_arrival_dt   <chr> NA, "1/1/2024 10:13", "1/1/2024 10:25", "1/1/20…
$ actual_arrival_dt      <chr> NA, "1/1/2024 10:10", "1/1/2024 10:24", "1/1/20…
$ actual_departure_dt    <chr> "1/1/2024 10:06", "1/1/2024 10:15", "1/1/2024 1…
$ `Stop Sequence`        <dbl> 0, 1, 2, 3, 4, 5, 6, 2, 4, 5, 0, 1, 3, 6, 0, 5,…
$ boarding_wake          <dbl> 8, 3, 1, 10, 16, 0, 0, 14, 14, 12, 25, 37, 13, …
$ alightings_wake        <dbl> 0, 3, 2, 2, 4, 27, 0, 2, 6, 44, 0, 13, 15, 35, …
$ alightings             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ boardings              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lbr                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ `Dwell Time`           <time> 00:04:49, 00:02:35, 00:03:06, 00:03:25, 00:03:…
$ dept_variance_flag     <chr> "On_Time", "On_Time", "On_Time", "On_Time", "On…
$ DepartureTimeVariance  <time> 00:01:16, 00:02:10, 00:02:06, 00:03:26, 00:04:…
$ ArrivalTimeVariance    <time>       NA, 00:02:39, 00:00:29, 00:00:20, 00:00:…
$ stop_id                <dbl> 87, 20, 8, 19, 18, 17, 4, 18, 8, 20, 4, 17, 19,…
$ ...27                  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

2.1.2.2 Clean stops & build basic features

This step standardizes key fields and column names, cleans Sort Key, derives weekday/weekend indicators, and computes an onboard count per stop using cumulative boarding_wake minus alightings_wake within each trip ordered by Stop_Sequence.

Code

stops <- stops %>%
  mutate(`Sort Key` = str_replace_all(`Sort Key`, "\\s+", "")) %>%
  rename_with(~ str_replace_all(.x, "\\s+", "_")) %>%
  mutate(
    trip_no = as.character(trip_no),
    route_no = as.character(route_no),
    route_direction = as.character(route_direction),
    route_pattern = as.character(route_pattern),
    stop_name = as.character(stop_name),
    Stop_Sequence = as.integer(Stop_Sequence),
    sort_key_adj = paste0(str_sub(Sort_Key, 1, -2), Stop_Sequence),
    trip_date = mdy(trip_date),
    weekday = wday(trip_date, label = TRUE, abbr = FALSE),
    is_weekday = !wday(trip_date) %in% c(1, 7),
    lbr = coalesce(lbr, 0)
  ) %>%
  group_by(trip_date, trip_no, route_no, route_direction) %>%
  arrange(Stop_Sequence, .by_group = TRUE) %>%
  mutate(
    onboard = cumsum(coalesce(boarding_wake, 0)) - cumsum(coalesce(alightings_wake, 0))
  ) %>%
  ungroup() %>%
  select(-any_of(c("Row_Number", "Sort_Key", "route_name", "...27")))

2.1.2.3 Parse datetimes (NY local time)

Stop timestamps appear in different string formats across years. We parse both formats into POSIXct and keep them in America/New_York so that time-of-day patterns reflect local operations.

Code

parse_dt_nycf <- function(x, tz_local) {
  x_chr <- as.character(x)
  x_chr <- str_trim(na_if(x_chr, ""))

  has_ampm <- str_detect(x_chr, regex("\\b(am|pm)\\b", ignore_case = TRUE))
  has_ampm[is.na(has_ampm)] <- FALSE

  out <- as.POSIXct(rep(NA, length(x_chr)), tz = tz_local)

  idx_2024 <- !has_ampm & !is.na(x_chr)
  idx_2025 <-  has_ampm & !is.na(x_chr)

  out[idx_2024] <- mdy_hm(x_chr[idx_2024], tz = tz_local, quiet = TRUE)
  out[idx_2025] <- parse_date_time(
    x_chr[idx_2025],
    orders = c("mdY IMSp", "mdY IMp"),
    tz = tz_local,
    quiet = TRUE
  )

  out
}

stops <- stops %>%
  mutate(
    scheduled_arrival_dt   = parse_dt_nycf(scheduled_arrival_dt, tz_local),
    actual_arrival_dt      = parse_dt_nycf(actual_arrival_dt, tz_local),
    scheduled_departure_dt = parse_dt_nycf(scheduled_departure_dt, tz_local),
    actual_departure_dt    = parse_dt_nycf(actual_departure_dt, tz_local)
  )

2.1.2.4 Enforce last-stop departure as missing

For each trip, the final stop should not have a meaningful departure timestamp. We explicitly set actual_departure_dt to missing for last stops to avoid incorrect dwell and segment calculations.

Code

stops <- stops %>%
  arrange(trip_date, trip_no, route_no, route_direction, Stop_Sequence) %>%
  mutate(
    actual_departure_dt = if_else(
      row_number() == n(),
      as.POSIXct(NA, tz = tz_local),
      actual_departure_dt
    ),
    .by = c(trip_date, trip_no, route_no, route_direction)
  )

2.1.2.5 Compute delay / segment / dwell metrics

We compute arrival/departure delay relative to schedule, “water travel” time between consecutive stops (previous departure to current arrival), and dwell time at stops (arrival to departure). These derived metrics support distribution checks and late-rate analysis.

Code

stops <- stops %>%
  select(-any_of("dept_variance_flag")) %>%
  mutate(
    arr_diff = as.numeric(difftime(actual_arrival_dt, scheduled_arrival_dt, units = "mins")),
    dep_diff = as.numeric(difftime(actual_departure_dt, scheduled_departure_dt, units = "mins"))
  ) %>%
  arrange(trip_date, trip_no, route_no, route_direction, Stop_Sequence) %>%
  mutate(
    sched_water_min = as.numeric(difftime(scheduled_arrival_dt, lag(scheduled_departure_dt), units = "mins")),
    act_water_min   = as.numeric(difftime(actual_arrival_dt,    lag(actual_departure_dt),    units = "mins")),
    water_diff      = act_water_min - sched_water_min,
    .by = c(trip_date, trip_no, route_no, route_direction)
  ) %>%
  mutate(
    act_dwell_min = as.numeric(difftime(actual_departure_dt, actual_arrival_dt, units = "mins"))
  )

2.1.2.6 Define lateness / earliness flags

We define Late arrival as arr_diff > 5 minutes and Early departure as dep_diff < -1 minute. These thresholds are used consistently throughout the reliability summaries and plots.

Code

stops <- stops %>%
  mutate(
    arr_status = case_when(
      is.na(arr_diff) ~ NA_character_,
      arr_diff > 5    ~ "Late",
      TRUE            ~ "Not_Late"
    ),
    dep_status = case_when(
      is.na(dep_diff) ~ NA_character_,
      dep_diff < -1   ~ "Early",
      TRUE            ~ "Not_Early"
    )
  )

2.1.2.7 Join vessel capacity and compute load factor

We join Vessel_Capacity using four keys (trip_date, route_pattern, trip_no, stop_name) and compute load_factor = onboard / Vessel_Capacity.

Code

ridership_capacity_lookup <- ridership %>%
  transmute(
    trip_date     = mdy(Date1),
    route_pattern = as.character(Trip_Pattern),
    trip_no       = as.character(Trip_ID),
    stop_name     = as.character(Stop),
    Vessel_Capacity
  ) %>%
  filter(!is.na(Vessel_Capacity)) %>%
  distinct(trip_date, route_pattern, trip_no, stop_name, Vessel_Capacity)

stops <- stops %>%
  left_join(
    ridership_capacity_lookup,
    by = c(
      "trip_date" = "trip_date",
      "route_pattern" = "route_pattern",
      "trip_no" = "trip_no",
      "stop_name" = "stop_name"
    )
  ) %>%
  mutate(load_factor = onboard / Vessel_Capacity)

2.1.2.8 Define analysis sample

We construct the final analysis table dat by applying the study period filter and removing the known problematic day/route subset. All EDA that follows is based on dat unless explicitly noted.

Code

dat0 <- stops %>%
  filter(trip_date >= ymd("2024-01-01"),
         trip_date <= ymd("2025-06-30"))

removed_0425_sb <- dat0 %>%
  filter(trip_date == ymd("2024-04-25") & route_no == "SB")

dat <- dat0 %>%
  filter(!(trip_date == ymd("2024-04-25") & route_no == "SB"))

tibble(
  min_trip_date = min(dat$trip_date, na.rm = TRUE),
  max_trip_date = max(dat$trip_date, na.rm = TRUE),
  n_rows_in_range_before_drop = nrow(dat0),
  n_rows_removed_2024_04_25_SB = nrow(removed_0425_sb),
  n_rows_final = nrow(dat)
)

# A tibble: 1 × 5
  min_trip_date max_trip_date n_rows_in_range_before_drop n_rows_removed_2024_…¹
  <date>        <date>                              <int>                  <int>
1 2024-01-01    2025-06-30                         753951                    366
# ℹ abbreviated name: ¹n_rows_removed_2024_04_25_SB
# ℹ 1 more variable: n_rows_final <int>

2.1.3 EDA Tables

2.1.3.1 Derived tables

We create two helper tables to support EDA at different aggregation levels:

trip_tbl summarizes one row per trip (useful for route/month/trip-start timing summaries).
dist_dat keeps stop-level records with convenience fields (e.g., stop_volume, nonnegative onboard/load factor) for distributions and late-rate calculations.

Code

trip_tbl <- dat %>%
  mutate(trip_key = paste(trip_date, route_no, route_direction, trip_no, sep = "|")) %>%
  group_by(trip_key, trip_date, route_no, route_direction) %>%
  arrange(Stop_Sequence, .by_group = TRUE) %>%
  summarise(
    n_stops = n_distinct(stop_id),
    trip_boardings  = sum(boarding_wake, na.rm = TRUE),
    trip_alightings = sum(alightings_wake, na.rm = TRUE),
    trip_total_events = sum(boarding_wake + alightings_wake, na.rm = TRUE),
    trip_start_departure = first(actual_departure_dt),
    is_weekday = first(is_weekday),
    month = floor_date(first(trip_date), "month"),
    .groups = "drop"
  ) %>%
  mutate(
    day_type = if_else(is_weekday, "Weekday", "Weekend"),
    dep_hour = hour(trip_start_departure)
  )

dist_dat <- dat %>%
  mutate(
    boarding_wake = coalesce(boarding_wake, 0),
    alightings_wake = coalesce(alightings_wake, 0),
    stop_volume = boarding_wake + alightings_wake,
    dwell_min = act_dwell_min,
    onboard_nonneg = pmax(onboard, 0),
    load_factor_nonneg = onboard_nonneg / Vessel_Capacity
  )

2.1.3.2 Route overview (Weekday vs Weekend)

Route-level summaries compare weekday and weekend patterns to distinguish commute-driven demand from leisure and discretionary travel. We report total trips, total boardings, and average boardings per trip by route.

Code

route_weekpart <- trip_tbl %>%
  group_by(route_no, day_type) %>%
  summarise(
    total_trips = n(),
    total_boardings = sum(trip_boardings, na.rm = TRUE),
    .groups = "drop"
  )

route_avg_weekpart <- trip_tbl %>%
  group_by(route_no, day_type) %>%
  summarise(
    avg_boardings_per_trip = mean(trip_boardings, na.rm = TRUE),
    n_trips = n(),
    .groups = "drop"
  )

route_summary_by_stops <- trip_tbl %>%
  group_by(route_no, route_direction, n_stops) %>%
  summarise(
    n_trips = n(),
    total_boardings = sum(trip_boardings, na.rm = TRUE),
    avg_boardings_per_trip = total_boardings / n_trips,
    .groups = "drop"
  ) %>%
  arrange(route_no, route_direction, n_stops)

route_weekpart %>% arrange(desc(total_boardings))

# A tibble: 16 × 4
   route_no day_type total_trips total_boardings
   <chr>    <chr>          <int>           <dbl>
 1 ER       Weekday        24863         2730172
 2 AS       Weekday        15047         1533359
 3 ER       Weekend         7541         1302644
 4 SV       Weekday        14933          919075
 5 SB       Weekday        18874          709380
 6 AS       Weekend         4732          692798
 7 RW       Weekday        11950          655102
 8 SG       Weekday        19744          653251
 9 SB       Weekend         5189          482218
10 RW       Weekend         4582          334244
11 SV       Weekend         4617          326769
12 SG       Weekend         6534          253360
13 GI       Weekend         1292           49853
14 GI       Weekday          444           22963
15 RR       Weekend           84            8951
16 RR       Weekday            6             900

Code

route_summary_by_stops

# A tibble: 28 × 6
   route_no route_direction n_stops n_trips total_boardings
   <chr>    <chr>             <int>   <int>           <dbl>
 1 AS       NB                    7    9705         1014103
 2 AS       SB                    7   10074         1212054
 3 ER       NB                    7   16037         2012111
 4 ER       SB                    7   16367         2020705
 5 GI       NB                    2     897           30485
 6 GI       SB                    2     839           42331
 7 RR       NB                    3      45            4085
 8 RR       SB                    3      45            5766
 9 RW       NB                    3    8511          505836
10 RW       SB                    3    8021          483510
# ℹ 18 more rows
# ℹ 1 more variable: avg_boardings_per_trip <dbl>

2.1.3.3 Stop totals

Stop totals aggregate boardings and alightings across the analysis period to identify the highest-activity stops. These rankings help interpret which terminals or transfer points dominate system usage.

Code

stop_totals <- dat %>%
  group_by(stop_id, stop_name) %>%
  summarise(
    total_boardings = sum(boarding_wake, na.rm = TRUE),
    total_alightings = sum(alightings_wake, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(total_boardings))

stop_totals

# A tibble: 25 × 4
   stop_id stop_name           total_boardings total_alightings
     <dbl> <chr>                         <dbl>            <dbl>
 1      87 Wall St/Pier 11             2163272          2562409
 2      17 East 34th Street            1404634          1566502
 3      20 Dumbo/Fulton Ferry           990583           884817
 4      19 North Williamsburg           560850           520351
 5     113 East 90th St                 533628           489918
 6      88 Rockaway                     438062           435374
 7      18 Greenpoint                   431993           388197
 8       4 Hunters Point South          417259           397537
 9      89 Astoria                      414117           342461
10     137 St. George                   387301           331679
# ℹ 15 more rows

2.1.3.4 Monthly patterns

Monthly summaries provide a simple view of seasonality and demand shifts over time. We summarize trips and ridership by month, split by weekday vs weekend, and also compute average boardings per trip.

Code

monthly_weekpart <- trip_tbl %>%
  filter(month >= ymd("2024-01-01"), month <= ymd("2025-06-01")) %>%
  group_by(month, day_type) %>%
  summarise(
    total_trips = n(),
    total_boardings = sum(trip_boardings, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(month, day_type)

monthly_avg_weekpart <- trip_tbl %>%
  filter(month >= ymd("2024-01-01"), month <= ymd("2025-06-01")) %>%
  group_by(month, day_type) %>%
  summarise(
    avg_boardings_per_trip = mean(trip_boardings, na.rm = TRUE),
    n_trips = n(),
    .groups = "drop"
  ) %>%
  arrange(month, day_type)

monthly_weekpart

# A tibble: 36 × 4
   month      day_type total_trips total_boardings
   <date>     <chr>          <int>           <dbl>
 1 2024-01-01 Weekday         5912          254505
 2 2024-01-01 Weekend         1176           49751
 3 2024-02-01 Weekday         5547          250456
 4 2024-02-01 Weekend         1176           76538
 5 2024-03-01 Weekday         5671          307774
 6 2024-03-01 Weekend         1610          142854
 7 2024-04-01 Weekday         6015          392494
 8 2024-04-01 Weekend         1799          179812
 9 2024-05-01 Weekday         6290          485912
10 2024-05-01 Weekend         1982          233380
# ℹ 26 more rows

Code

monthly_avg_weekpart

# A tibble: 36 × 4
   month      day_type avg_boardings_per_trip n_trips
   <date>     <chr>                     <dbl>   <int>
 1 2024-01-01 Weekday                    43.0    5912
 2 2024-01-01 Weekend                    42.3    1176
 3 2024-02-01 Weekday                    45.2    5547
 4 2024-02-01 Weekend                    65.1    1176
 5 2024-03-01 Weekday                    54.3    5671
 6 2024-03-01 Weekend                    88.7    1610
 7 2024-04-01 Weekday                    65.3    6015
 8 2024-04-01 Weekend                   100.0    1799
 9 2024-05-01 Weekday                    77.3    6290
10 2024-05-01 Weekend                   118.     1982
# ℹ 26 more rows

2.1.3.5 Time-of-day patterns

Hourly patterns are summarized in two ways: (1) trip-level departure hour, and (2) stop-level departure hour. This helps distinguish “when trips start” from “when stop activity occurs” across the day.

Code

hour_weekpart <- trip_tbl %>%
  filter(!is.na(dep_hour)) %>%
  mutate(dep_hour = as.integer(dep_hour)) %>%
  group_by(dep_hour, day_type) %>%
  summarise(
    total_boardings = sum(trip_boardings, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(dep_hour, day_type)

hour_breaks <- sort(unique(hour_weekpart$dep_hour))

stop_hour_weekpart <- dat %>%
  filter(!is.na(actual_departure_dt)) %>%
  mutate(
    dep_hour = hour(actual_departure_dt),
    day_type = if_else(is_weekday, "Weekday", "Weekend")
  ) %>%
  filter(!is.na(dep_hour)) %>%
  group_by(dep_hour, day_type) %>%
  summarise(
    n_stop_departures = n(),
    total_boardings = sum(boarding_wake, na.rm = TRUE),
    avg_boardings_per_stop = total_boardings / n_stop_departures,
    .groups = "drop"
  ) %>%
  arrange(dep_hour, day_type)

stop_hour_breaks <- sort(unique(stop_hour_weekpart$dep_hour))

stop_hour_weekpart

# A tibble: 35 × 5
   dep_hour day_type n_stop_departures total_boardings avg_boardings_per_stop
      <int> <chr>                <int>           <dbl>                  <dbl>
 1        5 Weekday               3952           36452                   9.22
 2        6 Weekday              20969          150588                   7.18
 3        7 Weekday              34867          481665                  13.8 
 4        7 Weekend                226            2073                   9.17
 5        8 Weekday              35072          661987                  18.9 
 6        8 Weekend               2531           18147                   7.17
 7        9 Weekday              36180          405600                  11.2 
 8        9 Weekend               8745          107551                  12.3 
 9       10 Weekday              29757          310091                  10.4 
10       10 Weekend               9459          161616                  17.1 
# ℹ 25 more rows

2.1.3.6 Late rate patterns

Late rate is the primary reliability outcome in this EDA. We report overall shares of late arrivals and early departures, then summarize late rates by route, month, weekday vs weekend, and departure hour.

Code

arr_status_tab <- dist_dat %>%
  filter(!is.na(arr_status)) %>%
  count(arr_status, name = "n") %>%
  mutate(pct = n / sum(n)) %>%
  arrange(desc(n))

dep_status_tab <- dist_dat %>%
  filter(!is.na(dep_status)) %>%
  count(dep_status, name = "n") %>%
  mutate(pct = n / sum(n)) %>%
  arrange(desc(n))

arr_status_tab

# A tibble: 2 × 3
  arr_status      n    pct
  <chr>       <int>  <dbl>
1 Not_Late   558419 0.931 
2 Late        41247 0.0688

Code

dep_status_tab

# A tibble: 2 × 3
  dep_status      n     pct
  <chr>       <int>   <dbl>
1 Not_Early  599844 0.998  
2 Early        1155 0.00192

Code

late_dat <- dist_dat %>%
  filter(!is.na(arr_status)) %>%
  mutate(
    is_late = arr_status == "Late",
    month = floor_date(trip_date, "month"),
    day_type = if_else(is_weekday, "Weekday", "Weekend"),
    dep_hour = hour(actual_departure_dt)
  )

late_by_route <- late_dat %>%
  group_by(route_no) %>%
  summarise(n = n(), pct_late = mean(is_late), .groups = "drop") %>%
  arrange(desc(pct_late))

late_by_month <- late_dat %>%
  group_by(month, day_type) %>%
  summarise(n = n(), pct_late = mean(is_late), .groups = "drop") %>%
  arrange(month, day_type)

late_by_daytype <- late_dat %>%
  group_by(day_type) %>%
  summarise(n = n(), pct_late = mean(is_late), .groups = "drop")

late_by_hour <- late_dat %>%
  filter(!is.na(dep_hour)) %>%
  group_by(dep_hour, day_type) %>%
  summarise(n = n(), pct_late = mean(is_late), .groups = "drop") %>%
  arrange(dep_hour, day_type)

late_hour_breaks <- sort(unique(late_by_hour$dep_hour))

late_by_route %>% head(30)

# A tibble: 8 × 3
  route_no      n pct_late
  <chr>     <int>    <dbl>
1 RW        31870   0.185 
2 SV        95635   0.0900
3 RR          172   0.0872
4 GI         1622   0.0660
5 SB       111056   0.0648
6 ER       191089   0.0578
7 AS       116669   0.0526
8 SG        51553   0.0435

Code

late_by_month

# A tibble: 36 × 4
   month      day_type     n pct_late
   <date>     <chr>    <int>    <dbl>
 1 2024-01-01 Weekday  24647   0.0394
 2 2024-01-01 Weekend   4960   0.0756
 3 2024-02-01 Weekday  23016   0.0374
 4 2024-02-01 Weekend   5052   0.0954
 5 2024-03-01 Weekday  23804   0.0436
 6 2024-03-01 Weekend   7033   0.152 
 7 2024-04-01 Weekday  25555   0.0492
 8 2024-04-01 Weekend   8033   0.0919
 9 2024-05-01 Weekday  25981   0.0609
10 2024-05-01 Weekend   8548   0.0930
# ℹ 26 more rows

Code

late_by_daytype

# A tibble: 2 × 3
  day_type      n pct_late
  <chr>     <int>    <dbl>
1 Weekday  451367   0.0607
2 Weekend  148299   0.0934

Code

late_by_hour

# A tibble: 34 × 4
   dep_hour day_type     n pct_late
      <int> <chr>    <int>    <dbl>
 1        5 Weekday   2549   0.0118
 2        6 Weekday  14253   0.0149
 3        7 Weekday  25748   0.0116
 4        8 Weekday  26783   0.0286
 5        8 Weekend   1582   0.0107
 6        9 Weekday  27496   0.0293
 7        9 Weekend   6560   0.0194
 8       10 Weekday  24007   0.0276
 9       10 Weekend   6970   0.0191
10       11 Weekday  18792   0.0188
# ℹ 24 more rows

2.1.4 EDA Figures

2.1.4.1 Route-level figures

These figures visualize route-level differences in service volume and demand. Weekday vs weekend splits highlight routes that are more commute-oriented versus leisure-oriented.

Code

ggplot(route_weekpart,
       aes(x = reorder(route_no, total_boardings, FUN = sum),
           y = total_boardings, fill = day_type)) +
  geom_col(position = "stack") +
  coord_flip() +
  labs(x = "Route", y = "Total boardings", fill = NULL,
       title = "Total ridership by route (Weekday vs Weekend)")

Total boardings are highest on East River (ER) and Astoria (AS).

Code

ggplot(route_avg_weekpart,
       aes(x = reorder(route_no, avg_boardings_per_trip, FUN = max),
           y = avg_boardings_per_trip, fill = day_type)) +
  geom_col(position = position_dodge(width = 0.8), width = 0.75) +
  coord_flip() +
  labs(x = "Route", y = "Avg boardings per trip", fill = NULL,
       title = "Average ridership per trip by route (Weekday vs Weekend)")

Average boardings per trip tend to be higher on weekends than weekdays. For the seasonal routes, weekday averages can appear high because weekday service days are holiday/peak-operation days.

2.1.4.2 Monthly figures

Monthly charts provide a high-level view of seasonality and temporal shifts in trips, total ridership, and average ridership per trip.

Code

ggplot(monthly_weekpart, aes(x = month, y = total_boardings, fill = day_type)) +
  geom_col(position = "stack") +
  scale_x_date(date_breaks = "1 month", date_labels = "%Y-%m") +
  labs(x = NULL, y = "Total boardings", fill = NULL,
       title = "Monthly ridership (Weekday vs Weekend)") +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Ridership varies much more strongly by season than trips do, especially on weekends.

2.1.4.3 Time-of-day figures

Time-of-day plots show how ridership and activity vary across hours.

Code

ggplot(hour_weekpart, aes(x = dep_hour, y = total_boardings, fill = day_type)) +
  geom_col(position = "stack") +
  scale_x_continuous(breaks = hour_breaks) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Total ridership by trip departure hour (Weekday vs Weekend)",
       x = "Departure hour", y = "Total boardings", fill = NULL)

On weekdays, stop activity shows morning and evening peaks consistent with commuting patterns. On weekends, higher activity is concentrated from midday through early evening.

2.1.4.4 Distributions

Distribution plots are used as diagnostics to understand skew, outliers, and reasonable ranges for modeling. These figures are intended to guide feature engineering and robustness checks.

Code

ggplot(dist_dat %>% filter(!is.na(stop_volume), stop_volume >= 0, stop_volume <= 350),
       aes(x = stop_volume)) +
  geom_histogram(bins = 50) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Distribution of stop activity (board_wake + alight_wake)",
       x = "stop_volume", y = "Count")

Stop activity is highly right-skewed: most stops have relatively low combined boardings/alightings, while a small number of high-activity stops drive the upper tail. The majority of observations are below ~50 per stop.

2.1.4.5 Late-rate figures

These figures visualize how late rate varies across routes, months, and hours.

Code

ggplot(late_by_route, aes(x = reorder(route_no, pct_late), y = pct_late)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Late rate by route", x = "Route", y = "Share Late (arr_status)")

Rockaway (RW) has the highest late rate by a wide margin relative to other routes. It is the longest route and more exposed to high seasonal peaks and variable marine conditions.

Code

ggplot(late_by_month, aes(x = month, y = pct_late, color = day_type)) +
  geom_line() +
  geom_point() +
  scale_x_date(date_breaks = "1 month", date_labels = "%Y-%m") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Late rate by month", x = NULL, y = "Share Late", color = NULL) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Late rates are generally higher in summer than winter, and higher on weekends than weekdays.

Code

ggplot(late_by_hour, aes(x = dep_hour, y = pct_late, color = day_type)) +
  geom_line() +
  geom_point() +
  scale_x_continuous(breaks = late_hour_breaks) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Late rate by stop departure hour", x = "Hour", y = "Share Late", color = NULL)

Late rates tend to peak in the afternoon.

2.1.4.6 Late propagation proxy figures

To proxy delay propagation along a trip, we relate a stop’s lateness to conditions at the previous stop. We show late rates binned by previous-stop dwell time and by previous-stop passenger activity.

Code

prev_dat <- dist_dat %>%
  arrange(trip_date, route_no, route_direction, trip_no, Stop_Sequence) %>%
  group_by(trip_date, route_no, route_direction, trip_no) %>%
  mutate(
    prev_dwell_min   = lag(dwell_min),
    prev_stop_volume = lag(stop_volume)
  ) %>%
  ungroup() %>%
  filter(!is.na(arr_status),
         !is.na(prev_dwell_min), prev_dwell_min >= 0, prev_dwell_min <= 60,
         !is.na(prev_stop_volume), prev_stop_volume >= 0)

Code

prev_dwell_late_tbl <- prev_dat %>%
  mutate(
    prev_dwell_bin = cut(prev_dwell_min,
                         breaks = c(0, 1, 2, 3, 5, 8, 12, 16, Inf),
                         include.lowest = TRUE)
  ) %>%
  group_by(prev_dwell_bin) %>%
  summarise(n = n(), pct_late = mean(arr_status == "Late", na.rm = TRUE), .groups = "drop")

prev_dwell_late_tbl

# A tibble: 8 × 3
  prev_dwell_bin      n pct_late
  <fct>           <int>    <dbl>
1 [0,1]            2041   0.111 
2 (1,2]           26936   0.108 
3 (2,3]          117033   0.0843
4 (3,5]          239932   0.0633
5 (5,8]           68604   0.0814
6 (8,12]           4537   0.232 
7 (12,16]            98   0.612 
8 (16,Inf]           46   0.978

Code

ggplot(prev_dwell_late_tbl, aes(x = prev_dwell_bin, y = pct_late)) +
  geom_col() +
  geom_text(aes(label = paste0("n=", scales::comma(n))), vjust = -0.3, size = 3) +
  scale_y_continuous(labels = scales::percent_format(), expand = expansion(mult = c(0, 0.12))) +
  labs(title = "Late rate by previous-stop dwell time bin",
       x = "Previous stop dwell time bin (min)", y = "Late rate (current stop)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Within typical dwell time ranges, the relationship between previous-stop dwell and current-stop late rate is relatively weak. Once previous-stop dwell becomes unusually long (e.g., > ~8 minutes), late rates increase sharply.

Code

prev_vol_late_tbl <- prev_dat %>%
  mutate(
    prev_vol_bin = cut(prev_stop_volume,
                       breaks = c(0, 10, 20, 30, 50, 100, 150, 200, Inf),
                       include.lowest = TRUE)
  ) %>%
  group_by(prev_vol_bin) %>%
  summarise(n = n(), pct_late = mean(arr_status == "Late", na.rm = TRUE), .groups = "drop")

prev_vol_late_tbl

# A tibble: 8 × 3
  prev_vol_bin      n pct_late
  <fct>         <int>    <dbl>
1 [0,10]       134481   0.0540
2 (10,20]      105747   0.0544
3 (20,30]       71734   0.0677
4 (30,50]       78063   0.0873
5 (50,100]      56688   0.129 
6 (100,150]     11025   0.220 
7 (150,200]      1332   0.335 
8 (200,Inf]       157   0.573

Code

ggplot(prev_vol_late_tbl, aes(x = prev_vol_bin, y = pct_late)) +
  geom_col() +
  geom_text(aes(label = paste0("n=", scales::comma(n))), vjust = -0.3, size = 3) +
  scale_y_continuous(labels = scales::percent_format(), expand = expansion(mult = c(0, 0.12))) +
  labs(title = "Late rate by previous-stop activity bin (board_wake + alight_wake)",
       x = "Previous stop activity bin", y = "Late rate (current stop)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

When previous-stop activity is moderate, late rates change gradually; beyond higher activity levels (e.g., > ~50), late rates rise more noticeably.

Taken together, these patterns suggest that delay is driven less by commuter peaks and more by high-demand recreational periods, particularly on routes serving waterfront leisure destinations.

Code

export_dir <- "../exploratory_code/ridership_eda"
readr::write_csv(dat, file.path(export_dir, "dat_stops_2024-01-01_2025-06-30_drop0425SB.csv.gz"))
saveRDS(dat, file.path(export_dir, "dat_stops_2024-01-01_2025-06-30_drop0425SB.rds"))

2.1.5 Operational Analysis

By: Guangze ‘Simon’ Sun and Yiming ‘Ming’ Cao

This operational EDA complements the associative model by examining how ferry delay is recorded, decomposed, generated, and recovered. The four sections ask what issues are recorded in trip notes, how stop-level delay can be decomposed, where large delays are observed versus generated upstream, and whether inherited delay can be absorbed before the next trip begins.

Code

trip_files <- c(
  "../exploratory_code/ridership_eda/NYCF_trip_data_2024.csv",
  "../exploratory_code/ridership_eda/NYCF_trip_data_2025.01_06.csv"
)
decomp_file <- "../exploratory_code/modeling/stop_level_arrival_decomposition_table_corrected_second_stop.csv"

accent       <- "#1A9988"
gold         <- "#F2C14E"
coral        <- "#D95F59"
blue         <- "#7BA7C7"
muted_palette <- c(
  "Inherited previous-trip delay" = "#1A9988",
  "Carried within-trip delay"     = "#F2C14E",
  "Upstream landside delay"       = "#7BA7C7",
  "Upstream waterside delay"      = "#D95F59"
)
plot_theme <- theme_minimal(base_size = 14) +
  theme(
    plot.title    = element_text(face = "bold", size = 17),
    panel.grid.minor = element_blank(),
    legend.title  = element_blank(),
    axis.text     = element_text(color = "#333333")
  )

fmt_pct <- function(x) percent(x, accuracy = 0.1)
clean_kable <- function(x, caption = NULL, align = NULL) {
  if (is.null(align)) {
    align <- paste0(if_else(map_lgl(x, is.character), "l", "r"), collapse = "")
  }
  kable(x, caption = caption, align = align)
}

2.1.5.1 Text Analysis

Trip notes were extracted from raw NYC Ferry trip-level files and parsed into standardized categories. One trip can have multiple categories, so category shares are presence rates rather than mutually exclusive percentages. Results are descriptive associations, not causal effects.

Code

parse_note_segments <- function(trip_uid, note_text) {
  note_text <- ifelse(is.na(note_text), "", note_text)
  if (str_trim(note_text) == "") return(tibble(trip_uid = integer(), category = character(), description = character()))
  matches <- str_match_all(note_text, "\\[(\\d+)\\]\\s*([^:\\[]+):\\s*([^\\[]*)")[[1]]
  if (nrow(matches) == 0) return(tibble(trip_uid = trip_uid, category = "Unparsed", description = note_text))
  tibble(trip_uid = trip_uid, category = matches[, 3], description = matches[, 4])
}

clean_category <- function(x) {
  out <- str_squish(x) %>% str_replace_all("\\s+", " ") %>% str_to_title()
  dplyr::recode(out, "Uscg Restriction" = "USCG Restriction", "Nyc Ferry" = "NYC Ferry", "Ny Waterway" = "NY Waterway", .default = out)
}

raw_trips <- map_dfr(trip_files, ~ read_csv(.x, show_col_types = FALSE) %>% mutate(source_file = basename(.x))) %>%
  select(-matches("^\\.\\.\\."))

trips <- raw_trips %>%
  mutate(
    trip_uid = row_number(),
    route = if ("route_no" %in% names(.)) route_no else NA_character_,
    service_date = if ("trip_date" %in% names(.)) mdy(trip_date) else as.Date(NA),
    note_text = coalesce(notes, ""),
    scheduled_arrival_ts = mdy_hm(scheduled_arrival_dt, quiet = TRUE),
    actual_arrival_ts    = mdy_hm(actual_arrival_dt,    quiet = TRUE),
    arrival_delay_min_raw = as.numeric(difftime(actual_arrival_ts, scheduled_arrival_ts, units = "mins")),
    status_text = str_squish(str_c(coalesce(trip_status, ""), coalesce(trip_status_actual, ""), sep = " | ")),
    cancelled            = str_detect(str_to_lower(status_text), "trip not completed|cancel") | coalesce(`(16) Trip Cancelled`, FALSE),
    partial_completion   = str_detect(str_to_lower(status_text), "partial") | coalesce(`(17) Partially Completed`, FALSE),
    cancel_or_partial_completion = cancelled | partial_completion,
    completed_trip = !cancel_or_partial_completion & (str_detect(str_to_lower(status_text), "completed") | coalesce(`(10)  Completed`, FALSE)),
    valid_delay         = completed_trip & !is.na(arrival_delay_min_raw),
    delayed_5           = valid_delay & arrival_delay_min_raw > 5,
    excessive_delay_10  = valid_delay & arrival_delay_min_raw > 10
  )

note_segments <- map2_dfr(trips$trip_uid, trips$note_text, parse_note_segments) %>%
  mutate(category = clean_category(category), description = str_squish(description)) %>%
  filter(!is.na(category), category != "")

trip_category <- note_segments %>%
  distinct(trip_uid, category) %>%
  left_join(trips %>% select(trip_uid, route, service_date, note_text, arrival_delay_min_raw,
                             delayed_5, excessive_delay_10, cancel_or_partial_completion), by = "trip_uid") %>%
  filter(category != "Completed")

category_presence <- trip_category %>%
  group_by(category) %>%
  summarise(
    n_trips_with_category = n_distinct(trip_uid),
    share_all_trips = n_trips_with_category / nrow(trips),
    share_among_delayed_5 = n_distinct(trip_uid[delayed_5]) / sum(trips$delayed_5, na.rm = TRUE),
    share_among_excessive_delay_10 = n_distinct(trip_uid[excessive_delay_10]) / sum(trips$excessive_delay_10, na.rm = TRUE),
    share_among_cancel_or_partial_completion = n_distinct(trip_uid[cancel_or_partial_completion]) / sum(trips$cancel_or_partial_completion, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(pmax(share_among_excessive_delay_10, share_among_cancel_or_partial_completion, na.rm = TRUE)))

category_rates <- trip_category %>%
  group_by(category) %>%
  summarise(
    n_trips_with_category = n_distinct(trip_uid),
    delayed_5_rate = mean(delayed_5, na.rm = TRUE),
    excessive_delay_10_rate = mean(excessive_delay_10, na.rm = TRUE),
    cancel_or_partial_completion_rate = mean(cancel_or_partial_completion, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(pmax(excessive_delay_10_rate, cancel_or_partial_completion_rate, na.rm = TRUE)))

support_threshold <- if_else(sum(category_rates$n_trips_with_category >= 50) >= 8, 50, 20)

presence_display <- category_presence %>%
  slice_head(n = 12) %>%
  transmute(
    Category = category,
    `All trips` = fmt_pct(share_all_trips),
    `Delay >5` = fmt_pct(share_among_delayed_5),
    `Delay >10` = fmt_pct(share_among_excessive_delay_10),
    `Cancel / partial` = fmt_pct(share_among_cancel_or_partial_completion)
  )

rates_display <- category_rates %>%
  filter(n_trips_with_category >= support_threshold) %>%
  slice_head(n = 12) %>%
  transmute(
    Category = category,
    Trips = n_trips_with_category,
    `Delay >5 rate` = fmt_pct(delayed_5_rate),
    `Delay >10 rate` = fmt_pct(excessive_delay_10_rate),
    `Cancel / partial rate` = fmt_pct(cancel_or_partial_completion_rate)
  )

Code

clean_kable(presence_display, caption = "Category presence among outcome groups", align = "lrrrr")

Category presence among outcome groups
Category	All trips	Delay >5	Delay >10	Cancel / partial
Passenger Delays	10.8%	39.2%	62.3%	14.5%
Marine Traffic	11.7%	44.0%	34.0%	16.1%
Max Out	1.5%	5.2%	10.2%	0.8%
Partially Completed	0.8%	0.0%	0.0%	10.2%
Vessel Mechanical	0.8%	3.5%	7.7%	2.0%
Bypassed Landing	0.6%	0.0%	0.0%	7.5%
Landing Issue	2.2%	5.5%	6.7%	6.1%
Weather	0.4%	0.7%	1.7%	4.1%
Connectivity Issues	2.5%	2.4%	2.6%	3.7%
Trip Cancelled	0.3%	0.0%	0.0%	3.7%
USCG Restriction	0.2%	0.6%	1.8%	1.8%
Hny	0.3%	1.0%	1.6%	0.4%

Code

clean_kable(rates_display, caption = paste0("Outcome rates when category appears (support threshold n >= ", support_threshold, ")"), align = "lrrrr")

Outcome rates when category appears (support threshold n >= 50)
Category	Trips	Delay >5 rate	Delay >10 rate	Cancel / partial rate
Partially Completed	1160	0.0%	0.0%	100.0%
Trip Cancelled	426	0.0%	0.0%	100.0%
Bypassed Landing	881	0.3%	0.0%	96.9%
Weather	589	11.9%	8.0%	79.8%
USCG Restriction	347	19.3%	13.8%	60.2%
Emergency	80	25.0%	17.5%	38.8%
Landing Issue	3066	18.7%	5.8%	22.6%
Vessel Mechanical	1071	33.8%	19.4%	20.8%
Max Out	2167	24.9%	12.7%	4.1%
Connectivity Issues	3573	7.1%	2.0%	11.9%
Hny	392	25.3%	11.0%	11.7%
Marine Traffic	16381	27.9%	5.6%	11.2%

The first table shows which note categories appear most often among delayed or cancelled/partially completed trips. The second table flips the question: when a category appears, how often is that trip associated with delay or cancellation?

Passenger delays and marine traffic are the most visible delay-associated note categories, but they point to different mechanisms. Passenger delays become especially concentrated among delay >10 trips, while marine traffic is more prominent in the broader delay >5 group. Rare categories such as vessel mechanical issues, weather, and USCG restrictions appear less often overall but have high disruption rates when they occur, making them operationally important despite their low frequency.

2.1.5.2 Delay Decomposition

This is a median-based expected-time decomposition. A stop-level delay is decomposed into inherited previous-trip delay, carried within-trip delay, upstream landside delay, upstream waterside delay, and upstream wait when available. Vessel chains are reconstructed by vessel and operational sequence; consecutive trips by the same vessel are linked when they are operationally continuous. This is accounting/decomposition, not causal inference.

Code

decomp <- read_csv(decomp_file, show_col_types = FALSE)
component_cols <- c("inherited_prev_trip_delay", "carried_within_trip_delay", "upstream_landside_delay", "upstream_waterside_delay", "upstream_wait")
display_components <- c("inherited_prev_trip_delay", "carried_within_trip_delay", "upstream_landside_delay", "upstream_waterside_delay")
component_labels <- c(
  inherited_prev_trip_delay = "Inherited previous-trip delay",
  carried_within_trip_delay = "Carried within-trip delay",
  upstream_landside_delay   = "Upstream landside delay",
  upstream_waterside_delay  = "Upstream waterside delay",
  upstream_wait             = "Upstream wait"
)

eligible <- decomp %>%
  mutate(eligible_flag = eligible_arrival_decomp %in% c(TRUE, 1, "TRUE", "true", "1")) %>%
  filter(eligible_flag)

decomp_long <- eligible %>%
  mutate(`Delay >5` = arrival_delay_min > 5, `Delay >10` = arrival_delay_min > 10) %>%
  select(trip_key, stop_id, stop_sequence, `Delay >5`, `Delay >10`, all_of(component_cols)) %>%
  pivot_longer(all_of(component_cols), names_to = "component", values_to = "component_minutes") %>%
  mutate(Component = dplyr::recode(component, !!!component_labels), displayed = component %in% display_components)

decomp_groups <- bind_rows(
  decomp_long %>% filter(`Delay >5`)  %>% mutate(delay_group = "Delay >5"),
  decomp_long %>% filter(`Delay >10`) %>% mutate(delay_group = "Delay >10")
)

component_summary <- decomp_groups %>%
  group_by(delay_group, component, Component, displayed) %>%
  summarise(avg_min = mean(component_minutes, na.rm = TRUE),
            positive_avg_min = pmax(avg_min, 0), .groups = "drop") %>%
  group_by(delay_group) %>%
  mutate(share = if_else(displayed, positive_avg_min / sum(positive_avg_min[displayed], na.rm = TRUE), NA_real_)) %>%
  ungroup()

component_table_display <- component_summary %>%
  select(delay_group, Component, avg_min, share) %>%
  pivot_wider(names_from = delay_group, values_from = c(avg_min, share), names_glue = "{delay_group}_{.value}") %>%
  mutate(Component = factor(Component, levels = c("Inherited previous-trip delay", "Carried within-trip delay",
                                                   "Upstream landside delay", "Upstream waterside delay", "Upstream wait"))) %>%
  arrange(Component) %>%
  mutate(Component = as.character(Component)) %>%
  transmute(
    Component,
    `Delay >5 avg min`  = round(`Delay >5_avg_min`, 2),
    `Delay >5 share`    = if_else(is.na(`Delay >5_share`),  "—", fmt_pct(`Delay >5_share`)),
    `Delay >10 avg min` = round(`Delay >10_avg_min`, 2),
    `Delay >10 share`   = if_else(is.na(`Delay >10_share`), "—", fmt_pct(`Delay >10_share`))
  )

plot_components <- component_summary %>%
  filter(displayed) %>%
  mutate(Component = factor(Component, levels = rev(c("Inherited previous-trip delay", "Carried within-trip delay",
                                                       "Upstream landside delay", "Upstream waterside delay"))))

decomp_bar <- ggplot(plot_components, aes(x = delay_group, y = positive_avg_min, fill = Component)) +
  geom_col(width = 0.65) +
  scale_fill_manual(values = muted_palette) +
  labs(title = "Average Delay Components", x = NULL, y = "Average component minutes") +
  plot_theme

Code

print(decomp_bar)

Average component minutes among stop observations with delay greater than 5 or 10 minutes.

Code

clean_kable(component_table_display, caption = "Average delay components and displayed-component shares", align = "lrrrr")

Average delay components and displayed-component shares
Component	Delay >5 avg min	Delay >5 share	Delay >10 avg min	Delay >10 share
Inherited previous-trip delay	6.33	64.7%	11.16	68.4%
Carried within-trip delay	2.20	22.4%	3.45	21.1%
Upstream landside delay	0.56	5.7%	0.96	5.9%
Upstream waterside delay	0.71	7.2%	0.74	4.5%
Upstream wait	0.02	—	0.01	—

Average component minutes among stop observations with delay greater than 5 or 10 minutes.

The decomposition shows that most delayed stop observations are not mainly created at the current stop or segment. Inherited previous-trip delay is the largest component in both groups, accounting for 64.7% of displayed component minutes among delay >5 observations and 68.4% among delay >10 observations. Carried within-trip delay is the second-largest component, contributing about one-fifth of displayed delay in both groups. Newly generated upstream landside and waterside components are much smaller on average.

The severity difference is also important. Delay >10 observations have much larger inherited delay than delay >5 observations, while the shares of landside and waterside delay remain comparatively small. This suggests that severe delay is often the result of propagation through the vessel chain rather than a local event at the delayed stop.

2.1.5.3 Delay Genesis

This section traces delayed stop observations backward through the same vessel chain. Targets use arrival_delay_min > 5 and arrival_delay_min > 10. A major source is a single landside event greater than 5 minutes or a single waterside event greater than 5 minutes.

Code

stop_label      <- if ("stop_name" %in% names(decomp)) "stop_name" else "stop_id"
prev_stop_label <- if ("prev_stop_name" %in% names(decomp)) "prev_stop_name" else "prev_stop_id"
vessel_col      <- if ("vessel_id" %in% names(decomp)) "vessel_id" else if ("vessel" %in% names(decomp)) "vessel" else NA_character_
if (is.na(vessel_col)) stop("A vessel identifier is required for delay genesis tracing.")

timeline0 <- decomp %>%
  mutate(
    scheduled_ts  = suppressWarnings(as.POSIXct(scheduled_time_original, tz = "America/New_York")),
    eligible_flag = eligible_arrival_decomp %in% c(TRUE, 1, "TRUE", "true", "1")
  )

trip_bounds <- timeline0 %>%
  group_by(.data[[vessel_col]], trip_key) %>%
  summarise(trip_start_ts = min(scheduled_ts, na.rm = TRUE),
            trip_end_ts   = max(scheduled_ts, na.rm = TRUE), .groups = "drop") %>%
  arrange(.data[[vessel_col]], trip_start_ts, trip_end_ts, trip_key) %>%
  group_by(.data[[vessel_col]]) %>%
  mutate(
    prev_trip_end_ts = lag(trip_end_ts),
    trip_gap_min     = as.numeric(difftime(trip_start_ts, prev_trip_end_ts, units = "mins")),
    chain_break      = is.na(trip_gap_min) | trip_gap_min > 30,
    vessel_chain_id  = paste(.data[[vessel_col]], cumsum(chain_break), sep = "_")
  ) %>%
  ungroup() %>%
  select(.data[[vessel_col]], trip_key, vessel_chain_id, trip_start_ts, trip_gap_min)

timeline <- timeline0 %>%
  left_join(trip_bounds, by = c(vessel_col, "trip_key")) %>%
  mutate(
    positive_landside_delay  = pmax(replace_na(upstream_landside_delay,  0), 0),
    positive_waterside_delay = pmax(replace_na(upstream_waterside_delay, 0), 0),
    observed_landing  = coalesce(as.character(.data[[stop_label]]),      as.character(stop_id)),
    source_from_landing = coalesce(as.character(.data[[prev_stop_label]]), as.character(prev_stop_id), "Origin"),
    source_to_landing = observed_landing,
    observed_segment  = paste(source_from_landing, "->", observed_landing),
    source_segment    = paste(source_from_landing, "->", observed_landing),
    row_top_generated_delay = pmax(positive_landside_delay, positive_waterside_delay),
    row_top_source_type = case_when(
      row_top_generated_delay <= 0 ~ NA_character_,
      positive_landside_delay >= positive_waterside_delay ~ "Landside",
      TRUE ~ "Waterside"
    ),
    major_landside  = positive_landside_delay  > 5,
    major_waterside = positive_waterside_delay > 5
  ) %>%
  arrange(vessel_chain_id, scheduled_ts, trip_start_ts, trip_key, stop_sequence) %>%
  group_by(vessel_chain_id) %>%
  mutate(
    row_order_within_chain       = row_number(),
    cumulative_positive_generated = cumsum(positive_landside_delay + positive_waterside_delay),
    cumulative_major_sources      = cumsum(major_landside) + cumsum(major_waterside)
  ) %>%
  ungroup()

top_source_by_chain <- function(chain_df) {
  chain_df <- chain_df %>% arrange(row_order_within_chain)
  best_delay <- numeric(nrow(chain_df)); best_type <- character(nrow(chain_df))
  best_landing <- character(nrow(chain_df)); best_segment <- character(nrow(chain_df))
  current_best <- 0; current_type <- NA_character_; current_landing <- NA_character_; current_segment <- NA_character_
  for (i in seq_len(nrow(chain_df))) {
    if (!is.na(chain_df$row_top_generated_delay[i]) && chain_df$row_top_generated_delay[i] > current_best) {
      current_best    <- chain_df$row_top_generated_delay[i]
      current_type    <- chain_df$row_top_source_type[i]
      current_landing <- chain_df$source_from_landing[i]
      current_segment <- chain_df$source_segment[i]
    }
    best_delay[i] <- current_best; best_type[i] <- current_type
    best_landing[i] <- current_landing; best_segment[i] <- current_segment
  }
  chain_df %>% mutate(
    top_source_generated_minutes = best_delay,
    top_source_type    = na_if(best_type, ""),
    top_source_landing = na_if(best_landing, ""),
    top_source_segment = na_if(best_segment, "")
  )
}

timeline <- timeline %>% group_split(vessel_chain_id) %>% map_dfr(top_source_by_chain)

observed_landing_den  <- timeline %>% filter(eligible_flag) %>% count(observed_landing,    name = "Observed total observations")
observed_segment_den  <- timeline %>% filter(eligible_flag) %>% count(observed_segment,    name = "Observed total observations")
generated_landing_den <- timeline %>% filter(eligible_flag) %>% count(source_from_landing, name = "Generated total source observations")
generated_segment_den <- timeline %>% filter(eligible_flag) %>% count(source_segment,      name = "Generated total source observations")

target_base <- timeline %>%
  filter(eligible_flag, arrival_delay_min > 5) %>%
  transmute(
    target_id_base = row_number(),
    observed_landing, observed_segment, arrival_delay_min,
    cumulative_positive_generated,
    num_major_sources        = cumulative_major_sources,
    top_source_generated_minutes,
    top_source_type, top_source_landing, top_source_segment,
    top_source_share = if_else(cumulative_positive_generated > 0,
                               top_source_generated_minutes / cumulative_positive_generated, NA_real_)
  )

genesis_targets <- bind_rows(
  target_base %>% mutate(delay_group = "delay_gt5",  delay_group_label = "Delay >5"),
  target_base %>% filter(arrival_delay_min > 10) %>% mutate(delay_group = "delay_gt10", delay_group_label = "Delay >10")
) %>%
  mutate(
    target_id    = paste(delay_group, target_id_base, sep = "_"),
    genesis_type = case_when(
      cumulative_positive_generated <= 0 | is.na(cumulative_positive_generated) ~ "No traced positive source",
      num_major_sources == 1 & top_source_share >= 0.5 ~ "Single major",
      num_major_sources >= 2 ~ "Multiple major",
      TRUE ~ "Cumulative"
    )
  )

classified_targets <- genesis_targets %>% filter(genesis_type %in% c("Single major", "Multiple major", "Cumulative"))

genesis_type_summary <- classified_targets %>%
  count(delay_group, delay_group_label, genesis_type, name = "n_targets") %>%
  group_by(delay_group) %>%
  mutate(share_of_classified_targets = n_targets / sum(n_targets)) %>%
  ungroup()

genesis_type_bar <- ggplot(genesis_type_summary, aes(x = delay_group_label, y = share_of_classified_targets, fill = genesis_type)) +
  geom_col(width = 0.65) +
  scale_y_continuous(labels = percent_format()) +
  scale_fill_manual(values = c("Single major" = accent, "Multiple major" = gold, "Cumulative" = blue)) +
  labs(title = "Delay Genesis Type", x = NULL, y = "Share of classified delayed observations") +
  plot_theme

top_source_event_summary <- classified_targets %>%
  count(delay_group, delay_group_label, top_source_type, name = "n_targets") %>%
  group_by(delay_group) %>%
  mutate(share_of_targets = n_targets / sum(n_targets)) %>%
  ungroup()

top_source_event_plot <- ggplot(top_source_event_summary, aes(x = delay_group_label, y = share_of_targets, fill = top_source_type)) +
  geom_col(width = 0.65) +
  scale_y_continuous(labels = percent_format()) +
  scale_fill_manual(values = c("Landside" = accent, "Waterside" = coral)) +
  labs(title = "Top Source Event Type", x = NULL, y = "Share of classified delayed observations") +
  plot_theme

make_location_tables <- function(geography = c("landing", "segment")) {
  geography   <- match.arg(geography)
  obs_col     <- if (geography == "landing") "observed_landing" else "observed_segment"
  gen_col     <- if (geography == "landing") "top_source_landing" else "top_source_segment"
  obs_den     <- if (geography == "landing") observed_landing_den else observed_segment_den
  gen_den     <- if (geography == "landing") generated_landing_den else generated_segment_den
  obs_den_col <- if (geography == "landing") "observed_landing" else "observed_segment"
  gen_den_col <- if (geography == "landing") "source_from_landing" else "source_segment"

  observed <- genesis_targets %>%
    filter(delay_group == "delay_gt5") %>%
    count(Location = .data[[obs_col]], name = "Observed delayed count") %>%
    left_join(rename(obs_den, Location = all_of(obs_den_col)), by = "Location") %>%
    mutate(`Observed delay rate` = `Observed delayed count` / `Observed total observations`)

  generated <- classified_targets %>%
    filter(delay_group == "delay_gt5", !is.na(.data[[gen_col]])) %>%
    count(Location = .data[[gen_col]], name = "Generated linked delay count") %>%
    left_join(rename(gen_den, Location = all_of(gen_den_col)), by = "Location") %>%
    mutate(`Generated source-linked rate` = `Generated linked delay count` / `Generated total source observations`)

  obs_support <- if (sum(observed$`Observed total observations` >= 100, na.rm = TRUE) >= 5) 100 else 50
  gen_support <- if (sum(generated$`Generated total source observations` >= 100, na.rm = TRUE) >= 5) 100 else 50

  count_table <- full_join(
    observed  %>% arrange(desc(`Observed delayed count`))       %>% mutate(Rank = row_number()) %>% slice_head(n = 5) %>% select(Rank, `Observed delay location` = Location, `Observed delayed count`),
    generated %>% arrange(desc(`Generated linked delay count`)) %>% mutate(Rank = row_number()) %>% slice_head(n = 5) %>% select(Rank, `Generated source location` = Location, `Generated linked delay count`),
    by = "Rank"
  ) %>% arrange(Rank)

  rate_table <- full_join(
    observed  %>% filter(`Observed total observations`        >= obs_support) %>% arrange(desc(`Observed delay rate`),         desc(`Observed delayed count`))       %>% mutate(Rank = row_number()) %>% slice_head(n = 5) %>% select(Rank, `Observed delay location` = Location, `Observed delayed count`, `Observed total observations`, `Observed delay rate`),
    generated %>% filter(`Generated total source observations` >= gen_support) %>% arrange(desc(`Generated source-linked rate`), desc(`Generated linked delay count`)) %>% mutate(Rank = row_number()) %>% slice_head(n = 5) %>% select(Rank, `Generated source location` = Location, `Generated linked delay count`, `Generated total source observations`, `Generated source-linked rate`),
    by = "Rank"
  ) %>% arrange(Rank)

  list(count = count_table, rate = rate_table)
}

landing_gt5 <- make_location_tables("landing")
segment_gt5 <- make_location_tables("segment")

Code

print(genesis_type_bar)

Genesis types among classified delayed observations.

Most delayed observations are classified as cumulative rather than being explained by one large source event. This is especially true for the broader delay >5 group. For delay >10 observations, the share of single-major and multiple-major cases increases, indicating that severe delays are more likely to involve at least one large upstream landside or waterside increment.

Code

print(top_source_event_plot)

Top source event type for classified delayed observations.

For delay >5 observations, top source events are roughly balanced, with waterside sources slightly more prominent. For delay >10 observations, the split shifts more toward landside sources, suggesting that the most severe traced delays may be more connected to landing, dwell, boarding, or turnaround-related processes at the upstream source location.

Observed versus generated locations. The observed-location tables show where delays appear if we only look at delayed arrivals. The generated-source tables show where the top traced source events occur after applying the decomposition logic.

Code

format_rate_table <- function(x) {
  x %>% mutate(
    `Observed delay rate`       = fmt_pct(`Observed delay rate`),
    `Generated source-linked rate` = fmt_pct(`Generated source-linked rate`)
  )
}

clean_kable(landing_gt5$count %>% rename(`Observed delay landing` = `Observed delay location`, `Generated source landing` = `Generated source location`), caption = "Delay >5 landing comparison by count", align = "rlrlr")

Delay >5 landing comparison by count
Rank	Observed delay landing	Observed delayed count	Generated source landing	Generated linked delay count
1	Wall St/Pier 11	5565	Wall St/Pier 11	9898
2	East 34th Street	5068	East 34th Street	3448
3	Dumbo/Fulton Ferry	3330	Dumbo/Fulton Ferry	2851
4	Sunset Park/BAT	2774	East 90th St	2588
5	Greenpoint	2390	Hunters Point South	2038

Code

clean_kable(format_rate_table(landing_gt5$rate) %>% rename(`Observed delay landing` = `Observed delay location`, `Generated source landing` = `Generated source location`), caption = "Delay >5 landing comparison by local rate", align = "rlrrlrrr")

Delay >5 landing comparison by local rate
Rank	Observed delay landing	Observed delayed count	Observed total observations	Observed delay rate	Generated source landing	Generated linked delay count	Generated total source observations	Generated source-linked rate
1	Rockaway	869	7693	11.3%	Rockaway	2023	8173	24.8%
2	Ferry Point Park	1042	9265	11.2%	Corlears Hook	1688	8445	20.0%
3	Soundview	1929	18883	10.2%	Wall St/Pier 11	9898	63024	15.7%
4	Corlears Hook	853	8584	9.9%	Hunters Point South	2038	15911	12.8%
5	Sunset Park/BAT	2774	29165	9.5%	Bay Ridge	860	8364	10.3%

Code

clean_kable(segment_gt5$count %>% rename(`Observed delay segment` = `Observed delay location`, `Generated source segment` = `Generated source location`), caption = "Delay >5 segment comparison by count", align = "rlrlr")

Delay >5 segment comparison by count
Rank	Observed delay segment	Observed delayed count	Generated source segment	Generated linked delay count
1	Wall St/Pier 11 -> Dumbo/Fulton Ferry	1810	Wall St/Pier 11 -> Dumbo/Fulton Ferry	4423
2	East 34th Street -> Greenpoint	1577	Wall St/Pier 11 -> Stuyvesant Cove	2694
3	Dumbo/Fulton Ferry -> Wall St/Pier 11	1570	Hunters Point South -> East 34th Street	2038
4	Greenpoint -> North Williamsburg	1426	Rockaway -> Sunset Park/BAT	2020
5	North Williamsburg -> South Williamsburg	1306	Corlears Hook -> Dumbo/Fulton Ferry	1688

Code

clean_kable(format_rate_table(segment_gt5$rate) %>% rename(`Observed delay segment` = `Observed delay location`, `Generated source segment` = `Generated source location`), caption = "Delay >5 segment comparison by local rate", align = "rlrrlrrr")

Delay >5 segment comparison by local rate
Rank	Observed delay segment	Observed delayed count	Observed total observations	Observed delay rate	Generated source segment	Generated linked delay count	Generated total source observations	Generated source-linked rate
1	Sunset Park/BAT -> Wall St/Pier 11	1066	8148	13.1%	Wall St/Pier 11 -> Stuyvesant Cove	2694	9474	28.4%
2	East 90th St -> Soundview	1170	9498	12.3%	Bay Ridge -> Gov. Island/Yankee Pier	73	265	27.5%
3	Rockaway -> Sunset Park/BAT	961	8133	11.8%	Rockaway -> Sunset Park/BAT	2020	8133	24.8%
4	Sunset Park/BAT -> Red Hook/Atlantic Basin	461	4009	11.5%	Corlears Hook -> Dumbo/Fulton Ferry	1688	8445	20.0%
5	Soundview -> Ferry Point Park	1042	9265	11.2%	Wall St/Pier 11 -> Dumbo/Fulton Ferry	4423	24237	18.2%

The count-based location tables show that the places where delay is observed are not always the same as the places where delay is generated. Wall St/Pier 11 and East 34th Street are major observed delay locations by count, given their central role in the network. However, the generated-source rankings shift attention upstream to locations such as East 90th St and Hunters Point South. The rate-based tables identify locations that are delay-prone relative to their own activity volume — useful for diagnosing reliability risk at a specific landing or segment.

2.1.5.4 Delay Recovery

This section focuses only on trip pairs where the next trip inherits more than 5 minutes of delay. Origin recovery means the next trip departs within 1 minute of the median-based expected departure time. Scheduled slack is recomputed as current trip scheduled origin time minus the previous trip terminal scheduled time for the same vessel.

Code

terminal_rows <- decomp %>%
  group_by(trip_key) %>%
  slice_max(order_by = stop_sequence, n = 1, with_ties = FALSE) %>%
  ungroup() %>%
  transmute(trip_key, terminal_scheduled_time_original = scheduled_time_original)

origin_context <- decomp %>%
  group_by(trip_key) %>%
  slice_min(order_by = stop_sequence, n = 1, with_ties = FALSE) %>%
  ungroup() %>%
  transmute(
    trip_key, trip_date = trip_date_et, route_id, direction, vessel_id,
    schedule_daytype, schedule_season,
    origin_scheduled_time_original = prev_stop_scheduled_time_original,
    origin_departure_delay_total, inherited_prev_trip_delay
  )

recovery_pairs_recomputed <- origin_context %>%
  left_join(terminal_rows, by = "trip_key") %>%
  arrange(vessel_id, origin_scheduled_time_original, terminal_scheduled_time_original, trip_key) %>%
  group_by(vessel_id) %>%
  mutate(
    previous_terminal_scheduled_time_original = lag(terminal_scheduled_time_original),
    planned_turnaround_gap_min = as.numeric(difftime(origin_scheduled_time_original,
                                                      previous_terminal_scheduled_time_original, units = "mins"))
  ) %>%
  ungroup() %>%
  mutate(
    inherited_prev_trip_delay     = pmax(as.numeric(inherited_prev_trip_delay), 0, na.rm = FALSE),
    origin_departure_delay_total  = as.numeric(origin_departure_delay_total),
    origin_recovered_1min         = origin_departure_delay_total <= 1,
    origin_recovered_minutes      = inherited_prev_trip_delay - origin_departure_delay_total
  )

inherit_levels <- c("5–10", "10–15", "15–20", "20–25", "25–30", "30+")
slack_levels   <- c("5–10", "10–15", "15–20", "20–25", "25–30")
mask_lookup <- tibble(
  inherited_delay_bin = factor(inherit_levels, levels = inherit_levels),
  inherited_floor     = c(5, 10, 15, 20, 25, 30)
) %>%
  crossing(tibble(slack_bin = factor(slack_levels, levels = slack_levels), slack_ceiling = c(10, 15, 20, 25, 30))) %>%
  mutate(display_cell = inherited_floor < slack_ceiling)

recovery_base <- recovery_pairs_recomputed %>%
  mutate(
    inherited_delay = inherited_prev_trip_delay,
    slack           = planned_turnaround_gap_min,
    origin_recovered = origin_recovered_1min,
    inherited_delay_bin = case_when(
      inherited_delay <= 10 ~ "5–10",  inherited_delay <= 15 ~ "10–15",
      inherited_delay <= 20 ~ "15–20", inherited_delay <= 25 ~ "20–25",
      inherited_delay <= 30 ~ "25–30", TRUE ~ "30+"
    ),
    slack_bin = case_when(
      slack <= 10 ~ "5–10",  slack <= 15 ~ "10–15",
      slack <= 20 ~ "15–20", slack <= 25 ~ "20–25",
      slack <= 30 ~ "25–30", TRUE ~ NA_character_
    )
  ) %>%
  filter(inherited_delay > 5, !is.na(slack), slack >= 5, slack <= 30, !is.na(origin_recovered))

recovery_heatmap_data <- recovery_base %>%
  mutate(inherited_delay_bin = factor(inherited_delay_bin, levels = inherit_levels),
         slack_bin           = factor(slack_bin,           levels = slack_levels)) %>%
  group_by(inherited_delay_bin, slack_bin) %>%
  summarise(
    n = n(),
    origin_recovery_rate          = mean(origin_recovered,         na.rm = TRUE),
    median_inherited_delay        = median(inherited_delay,        na.rm = TRUE),
    median_slack                  = median(slack,                  na.rm = TRUE),
    median_origin_departure_delay = median(origin_departure_delay_total, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  left_join(mask_lookup, by = c("inherited_delay_bin", "slack_bin")) %>%
  filter(display_cell)

recovery_heatmap <- ggplot(recovery_heatmap_data,
                           aes(x = slack_bin, y = inherited_delay_bin, fill = origin_recovery_rate)) +
  geom_tile(color = "white", linewidth = 0.7) +
  geom_text(aes(label = paste0(percent(origin_recovery_rate, accuracy = 1), "\n", "n=", n)), size = 3.3) +
  scale_fill_gradient(low = "#F6DCA8", high = accent, labels = percent_format(), limits = c(0, 1)) +
  labs(title = "Origin Recovery by Inherited Delay and Slack",
       x = "Scheduled slack (minutes)", y = "Inherited delay (minutes)", fill = "Recovered") +
  plot_theme

scatter_data <- recovery_pairs_recomputed %>%
  mutate(inherited_delay  = inherited_prev_trip_delay,
         slack            = planned_turnaround_gap_min,
         origin_recovered = origin_recovered_1min) %>%
  filter(inherited_delay > 5, !is.na(slack), slack >= 5, slack <= 30,
         inherited_delay <= 40, !is.na(origin_recovered))

recovery_scatter <- ggplot(scatter_data, aes(x = inherited_delay, y = slack, color = origin_recovered)) +
  geom_point(alpha = 0.3, size = 1.2) +
  coord_cartesian(ylim = c(5, 30), xlim = c(5, 40)) +
  scale_color_manual(values = c(`TRUE` = accent, `FALSE` = coral),
                     labels = c(`TRUE` = "Recovered", `FALSE` = "Not recovered")) +
  labs(title = "Origin Recovery Scatter",
       x = "Inherited delay (minutes)", y = "Scheduled slack (minutes)", color = NULL) +
  plot_theme

Code

print(recovery_heatmap)

Origin recovery among trips inheriting more than 5 minutes of delay.

Code

print(recovery_scatter)

The recovery heatmap shows a clear interaction between inherited delay and scheduled slack. When inherited delay is only 5–10 minutes, recovery becomes much more likely as slack increases: recovery is 13% with 5–10 minutes of slack, 31% with 10–15 minutes, and reaches 70–83% once slack is above 15 minutes. The scatter plot confirms the heatmap pattern: recovered trips cluster where inherited delay is small and scheduled slack is large.

2.1.5.5 Conclusion

Together, these EDA modules show that ferry delay is operationally structured. Trip notes identify recurring disruption categories. The decomposition shows that delayed stop observations are dominated by inherited and carried delay rather than only newly generated local delay. Genesis tracing further shows that the place where delay is observed is not always the place where delay is generated; upstream landings and segments can create delay that is recorded later downstream. Finally, the recovery analysis shows that scheduled slack can absorb moderate inherited delay, but recovery becomes unlikely when inherited delay is large or when slack is only marginal.

2.2 Automatic Identification System (AIS) Data

By: Sujan Kakumanu

2.2.1 Background

At a high level, AIS (Automatic Identification System) allows large marine vehicles to report their status (location, speed, heading, voyage data, etc). This is used to communicate with other vessels and authorities on-shore (traffic management, for example).

Depending on the size of the vessel, data can be transmitted as frequently as every 2 seconds. As a result, the provided data sets are incredibly large. The analysis code below focuses on one CSV (AIS from June 2025 in NY Harbor). This file is still too large, and will not be uploaded alongside this script.

Sources:

AIS Data Dictionary: https://coast.noaa.gov/data/marinecadastre/ais/data-dictionary.pdf Vessel type codes: https://coast.noaa.gov/data/marinecadastre/ais/VesselTypeCodes2018.pdf US Coast Guard AIS Description: https://www.navcen.uscg.gov/automatic-identification-system-overview

Columns of Interest (grouped by loose categories):

Vessel features:
- mmsi: a unique identifying integer
- vessel_name: name of the ship, seems to include the operator name
- vessel_type: A code that defines vessel type, cargo type, etc. See link above.
- length; width: in meters
Travel features:
- sog: speed in knots
- cog: course in degrees (course is the actual direction of the vessel, taking into account wind and current)
- base_date_time: UTC date time
- heading: true heading in degrees (the direction of the bow w.r.t true north)
- latitude; longitude: in decimal degree

2.2.2 Distributions of Select Variables

For this initial exploration, I want to trim the data down to a specific time frame to help with analysis speed. I picked a Wednesday in June to hopefully get a typical working day of data in the harbor.

Code

# Loading the data. Trimming it to Wednesday June 4th for easier analysis at this point.
ais_06_2025 <- read.csv("AIS_2025.06.csv")

ais_06_16_2025_trimmed <- ais_06_2025 %>%
  mutate(base_date_time = ymd_hms(base_date_time)) %>%
  filter(date(base_date_time) == ymd("2025-06-16"))

write.csv(ais_06_16_2025_trimmed, "AIS_2025_06_16_filtered.csv", row.names = FALSE)

Code

ais_06_16_2025_trimmed <- read.csv("../exploratory_code/ais-eda/AIS_2025_06_16_filtered.csv")

Frequency of Vessel Types Below is a visual of how many pings certain vessel types are sending.

Code

ggplot(ais_06_16_2025_trimmed, aes(x = as.factor(vessel_type))) +
  geom_bar(fill = "#1A9988") +
  labs(title = "Frequency of Vessel Types",
       subtitle = "June 16, 2025 - NY Harbor",
       x = "Vessel Type Code",
       y = "Number of Pings") +
  theme_minimal()

Most Common Vessel codes:

31: Tug and Tow - Towing
37: Pleasure craft
52: Tug and Tow - Tugging
60: All ships of passenger type

Vessel Size

Code

ggplot(ais_06_16_2025_trimmed, aes(x = length)) +
  geom_histogram(binwidth = 10, fill = "#1A9988", color = "white") +
  labs(title = "Distribution of Vessel Lengths",
       x = "Length (meters)",
       y = "Count") +
  theme_minimal()

Traffic Volume

Code

# Note that I ignore docking/docked ships. Noticed a lot of zeros
ais_06_16_2025_trimmed %>%
  filter(sog > 0.5) %>% 
  ggplot(aes(x = sog)) +
  geom_histogram(binwidth = 1, fill = "#1A9988", color = "white") +
  labs(title = "Speed Distribution of Moving Vessels",
       subtitle = "Excludes stationary vessels (SOG > 0.5 knots)",
       x = "Speed Over Ground (knots)",
       y = "Number of Pings") +
  theme_minimal()

2.2.3 Hexbin Visualizations

We can aggregate traffic to hexbin cells to understand traffic patterns. Instead of counting pings (raw transmissions) per hexbin, we count unique vessels (distinct MMSIs). This matters because a slow-moving tanker anchored for hours generates far more pings than a fast tug crossing the harbor — ping count reflects dwell time, not traffic volume. Vessel count is a more honest measure of how busy a given area of the harbor actually is.

Data Loading Same June 16–22 subset as before.

Code

ais <- read.csv("../exploratory_code/ais-eda/AIS_2025_06_16_filtered.csv") %>%
  mutate(
    base_date_time = ymd_hms(base_date_time),
    date = as.Date(base_date_time),
    hour = hour(base_date_time)
  )

We created a geojson containing a stable hexbin grid to be used for our project.

Code

ais_sf <- ais %>%
  filter(sog > 0.5) %>%
  drop_na(longitude, latitude) %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(32118)

fishnet <- st_read("../exploratory_code/hex_grid_100m.geojson") %>%
  st_transform(32118) %>%
  mutate(grid_id = row_number())

Reading layer `hex_grid_100m' from data source 
  `/Users/sujankakumanu/Documents/musa-8010-nycedc/exploratory_code/hex_grid_100m.geojson' 
  using driver `GeoJSON'
Simple feature collection with 70742 features and 0 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 293123.6 ymin: 43240.14 xmax: 314873.6 ymax: 71501.44
Projected CRS: NAD83 / New York Long Island

Aggregating to Vessel Count

We filter to moving vessels (SOG > 0.5 knots) to exclude anchored or docked ships. Count distinct MMSIs per cell per hour — not raw pings.

Code

traffic <- fishnet %>%
  st_join(ais_sf) %>%
  filter(!is.na(mmsi)) %>%
  group_by(grid_id, date, hour) %>%
  summarise(
    vessel_count = n_distinct(mmsi),
    ping_count = n(),
    .groups = "drop"
  )

Ping Count vs. Vessel Count This is a quick sanity check to make sure we’re on the right path. Most hexbin cells have more pings than unique vessels, which is exactly what we expect. Vessels passing through a cell generate multiple pings.

Code

traffic %>%
  st_drop_geometry() %>%
  group_by(grid_id) %>%
  summarise(
    total_pings = sum(ping_count),
    total_vessels = sum(vessel_count),
    .groups = "drop"
  ) %>%
  filter(total_pings > 0) %>%
  ggplot(aes(x = total_pings, y = total_vessels)) +
  geom_point(alpha = 0.3, color = "#1A9988") +
  geom_abline(linetype = "dashed", color = "#D95F59") +
  labs(
    title = "Ping Count vs. Vessel Count per Hexbin Cell",
    x = "Total Pings",
    y = "Total Unique Vessels"
  ) +
  theme_minimal()

2.2.3.1 Harbor-Wide Vessel Density

Aggregated across the full day to show overall traffic patterns in the harbor. Note the use of a log scale — vessel counts vary dramatically across cells, and a log transformation allows us to see variation in both busy and quiet areas. Log10 is used for better interpretability.

Code

traffic_day <- traffic %>%
  st_drop_geometry() %>%
  group_by(grid_id) %>%
  summarise(vessel_count = sum(vessel_count), .groups = "drop")

fishnet %>%
  left_join(traffic_day, by = "grid_id") %>%
  ggplot() +
  geom_sf(aes(fill = vessel_count), color = NA) +
  scale_fill_viridis_c(option = "inferno", trans = "log10", na.value = "grey90") +
  labs(
    title = "Harbor-Wide Vessel Density (Unique Vessels)",
    subtitle = "NY Harbor — June 16, 2025",
    fill = "Vessel Count (Log10)"
  ) +
  theme_minimal() +
  theme(axis.text = element_blank(), panel.grid = element_blank())

Here, we can see some clear patterns that are validated by our domain knowledge. The East River shows a lot of traffic in a tight corridor. Same with the Hudson River, flowing into the area north of the Staten Island.

2.2.3.2 Sample Stop-Level Traffic Exploration

Just to explore, I take 3 major ferry stops (hunters point south isn’t “major”, but just an interesting one): Question: how does vessel traffic in the surrounding area vary across the day? NOTE: Stops are hard coded here. This is only for exploration purposes. NOTE 2: We’re using 500m buffers to understand patterns further away from stops. The fishnet is still 200m.

Code

ferry_stops <- tibble(
  stop_name = c("Wall St / Pier 11", "East 34th St", "Hunters Point South"),
  longitude = c(-74.00625, -73.97095, -73.96137),
  latitude  = c(40.70322,  40.74409,  40.74185)
) %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(32118)

stop_buffers <- ferry_stops %>%
  mutate(geometry = st_buffer(geometry, dist = 500))

stop_cells <- fishnet %>%
  st_join(stop_buffers) %>%
  filter(!is.na(stop_name)) %>%
  st_drop_geometry() %>%
  select(grid_id, stop_name)

stop_traffic <- traffic %>%
  st_drop_geometry() %>%
  inner_join(stop_cells, by = "grid_id") %>%
  group_by(stop_name, date, hour) %>%
  summarise(vessel_count = sum(vessel_count), .groups = "drop") %>%
  mutate(
    date = as.Date(date),
    datetime = ymd_h(paste(date, hour))
  )

ggplot(stop_traffic, aes(x = datetime, y = vessel_count, color = stop_name)) +
  geom_line() +
  geom_point(size = 1.5) +
  labs(
    title = "Vessel Traffic Near Ferry Stops",
    subtitle = "500m buffer — June 16, 2025",
    x = NULL,
    y = "Unique Vessels per Hour",
    color = "Stop"
  ) +
  scale_color_manual(values = c(
    "Wall St / Pier 11"    = "#1A9988",
    "East 34th St"         = "#D95F59",
    "Hunters Point South"  = "#7BA7C7"
  )) +
  theme_minimal() +
  theme(legend.position = "bottom")

2.2.3.3 In-Water Traffic Along Route Corridors

Beyond stop-level traffic, we want to look at traffic ferries encounter as they transit between stops. This is particularly valuable for understanding in-water delays.

The route geometry below is a straight-line placeholder between East 34th St and Hunters Point South. NOTE: This will be replaced with real SWIFTLY route geometry once that data is wrangled.

Code

# Placeholder: straight line East 34th St -> Hunters Point South
# To be replaced with SWIFTLY route geometry
route_placeholder <- st_sfc(
  st_linestring(
    matrix(c(-73.97095, 40.74409,
             -73.96137, 40.74185),
           ncol = 2, byrow = TRUE)
  ),
  crs = 4326
) %>%
  st_transform(32118) %>%
  st_sf() %>%
  mutate(route_id = "east_34th_to_hunters_point")

# Buffer 250m along the route corridor
route_buffer <- route_placeholder %>%
  st_buffer(dist = 250)

# Find fishnet cells intersecting the corridor
route_cells <- fishnet %>%
  st_join(route_buffer) %>%
  filter(!is.na(route_id)) %>%
  st_drop_geometry() %>%
  select(grid_id, route_id)

# Vessel count along corridor by hour
route_traffic <- traffic %>%
  st_drop_geometry() %>%
  inner_join(route_cells, by = "grid_id") %>%
  group_by(route_id, date, hour) %>%
  summarise(vessel_count = sum(vessel_count), .groups = "drop") %>%
  mutate(datetime = ymd_h(paste(date, hour)))

ggplot(route_traffic, aes(x = datetime, y = vessel_count)) +
  geom_line(color = "#1A9988") +
  geom_point(size = 1.5, color = "#1A9988") +
  labs(
    title = "Vessel Traffic Along Route Segment",
    subtitle = "Segment: East 34th St to Hunters Point South — 250m buffer, June 16, 2025",
    x = NULL,
    y = "Unique Vessels per Hour"
  ) +
  theme_minimal()

2.2.3.4 Sample Modeling Row

To make the feature concrete, the tables below show what joined observations will look like in the final modeling dataset. The first table shows the stop-level feature, the second shows the route corridor feature. In practice these will be joined to the stop-level delay data by stop, route, date, and hour.

Code

stop_traffic %>%
  filter(hour %in% c(8, 12, 17)) %>%
  select(stop_name, date, hour, vessel_count) %>%
  arrange(stop_name, hour) %>%
  knitr::kable(caption = "Sample: AIS Traffic Feature at Stop Level")

Sample: AIS Traffic Feature at Stop Level
stop_name	date	hour	vessel_count
East 34th St	2025-06-16	8	8
East 34th St	2025-06-16	12	70
East 34th St	2025-06-16	17	48
Hunters Point South	2025-06-16	8	7
Hunters Point South	2025-06-16	12	27
Hunters Point South	2025-06-16	17	15
Wall St / Pier 11	2025-06-16	8	1
Wall St / Pier 11	2025-06-16	12	104
Wall St / Pier 11	2025-06-16	17	63

Code

route_traffic %>%
  filter(hour %in% c(8, 12, 17)) %>%
  select(route_id, date, hour, vessel_count) %>%
  arrange(hour) %>%
  knitr::kable(caption = "Sample: AIS Traffic Feature Along Route Corridor")

Sample: AIS Traffic Feature Along Route Corridor
route_id	date	hour	vessel_count
east_34th_to_hunters_point	2025-06-16	8	8
east_34th_to_hunters_point	2025-06-16	12	73
east_34th_to_hunters_point	2025-06-16	17	46

Code

# 500m context buffer
crop_extent <- route_placeholder %>%
  st_buffer(500) %>%
  st_bbox()

corridor_traffic <- traffic %>%
  st_drop_geometry() %>%
  inner_join(route_cells, by = "grid_id") %>%
  filter(hour %in% c(8, 12, 17)) %>%
  group_by(grid_id, hour) %>%
  summarise(vessel_count = sum(vessel_count), .groups = "drop") %>%
  left_join(fishnet, by = "grid_id") %>%
  st_as_sf()

ggplot() +
  geom_sf(data = fishnet %>% st_crop(crop_extent), fill = "grey95", color = NA) +
  geom_sf(data = corridor_traffic, aes(fill = vessel_count), color = NA) +
  geom_sf(data = route_placeholder, color = "white", linewidth = 0.8) +
  scale_fill_viridis_c(option = "inferno", na.value = "grey80") +
  coord_sf(xlim = c(crop_extent["xmin"], crop_extent["xmax"]),
           ylim = c(crop_extent["ymin"], crop_extent["ymax"])) +
  facet_wrap(~hour, nrow = 1, labeller = as_labeller(c(
    "8"  = "08:00 AM",
    "12" = "12:00 PM",
    "17" = "05:00 PM"
  ))) +
  labs(
    title = "Vessel Traffic Along Route Corridor by Hour",
    subtitle = "East 34th St to Hunters Point South — 250m buffer, June 16, 2025",
    fill = "Vessel Count"
  ) +
  theme_minimal() +
  theme(axis.text = element_blank(), panel.grid = element_blank())

2.3 Swiftly

By: Tyler Maynard

2.3.1 Background

The Swiftly GPS dataset contains location data every 10-60 seconds from the NYC Ferry AVL system. Columns include vessel, route, trip, block, and coordinates. This will be especially useful in calculating the speed of the vehicle at various points in time, helping identify locations, routes, and trips where the system struggles with on-time performance.

Reference Week: June 16-22, 2025

For the sake of processing power, we selected a “sample week” to conduct EDA for Swiftly GPS and AIS data. Based on conversations with NYCEDC, ridership is much higher in the summer months, especially on weekends. Thus, we chose a week during the summer, also including a holiday to evaluate any unique characteristics (if any) during special service days.

Code

special_service <- data.frame(
  dates = as.Date(c("2025-06-19")))

2.3.2 Data Import

The Swiftly GPS data set is available by month in CSV files. The following chunk reads the June 2025 CSV file, filters to the dates included in the sample week, and parses out each record’s date and time into separate columns.

Code

# Original csv
jun2025_original <- read.csv("../exploratory_code/swiftly_eda/swiftly_gps.2025_06.csv")

# Removing extra 'data.' from column names, removing unneeded columns
jun2025 <- jun2025_original %>%
  rename_with(~ gsub("data.", "", .x)) %>%
  select(-(1:3), -(7))

# Converting to datetime
jun2025$time <- ymd_hms(jun2025$time)
jun2025$timeProcessed <- ymd_hms(jun2025$timeProcessed)

# Seeing how many unique values each column has
jun2025 %>%
  summarise(across(everything(), n_distinct)) %>%
  pivot_longer(cols = everything()) %>%
  kable()

name	value
time	1510826
lat	63986
lon	31883
ispredictable	2
isLayover	2
isWaitStop	2
timeProcessed	2740360
vehicleId	33
headsign	16
blockassignmentorigin	3
assignmentId	2166
assignmentTime	2834
blockAssignmentMethod	8
blockId	28
directionId	3
headwayMsec	1779585
previousVehicleId	34
previousVehicleSchedAdhMsec	611794
routeId	8
routeShortName	8
schedAdh	3212
schedAdhMsec	758901
scheduledHeadwayMsec	193
serviceId	7
serviceName	7
tripId	2169
tripShortName	395
unpredictablestatus.type	2
unpredictablestatus.details.blockId	27
unpredictablestatus.details.serviceDate	31

Code

# Parse dates and times
jun2025 <- jun2025 %>%
  mutate(
    date_time = ymd_hms(time, tz = "America/New_York"),
    processed_date_time = ymd_hms(timeProcessed, tz = "America/New_York"),

    date = as.Date(date_time),
    time = format(date_time, "%H:%M:%S"),
    hour = substr(time,1,2),

    dateProcessed = as.Date(processed_date_time),
    timeProcessed = format(processed_date_time, "%H:%M:%S"),

    dayOfWeek = if_else(wday(date, week_start = 1) %in% c(6,7), "Weekend", "Weekday"),
    dayOfWeek = case_when(
      date %in% special_service$dates ~ "Holiday",
      TRUE ~ dayOfWeek)) %>%
  select(date, time, dateProcessed, timeProcessed, dayOfWeek, headsign, tripShortName, everything(), date_time, processed_date_time)

jun2025_reference_week <- jun2025 %>%
  filter(date %in% c("2025-06-16","2025-06-17","2025-06-18","2025-06-19","2025-06-20","2025-06-21","2025-06-22")) %>%
  filter(!is.na(tripId)) %>%
  arrange(vehicleId, date_time)
str(jun2025_reference_week$time)

 chr [1:851238] "06:21:24" "06:21:29" "06:21:44" "06:21:59" "06:22:14" ...

2.3.3 Shapefiles

As a way to add more spatial context to this analysis, the landings, routes, hydrology, and boroughs were extracted from the NYC Ferry and NYC OpenData websites. The hydrology object was trimmed to only include hydrology contained within the bounding box created around the study area.

Code

landings <- read_sf("../exploratory_code/swiftly_eda/Shapefiles/Landings/landings.shp") %>%
  st_transform(32118)

landings_east_river <- landings %>%
  filter(!stop_id %in% c("37","39","43","45","46","48","49","50",
    "51","52","55","56","57","58","61","62",
    "16","88","103","104","105","142",
    "145","149","150","151")) %>%
  st_transform(32118)

routes <- read_sf("../exploratory_code/swiftly_eda/Shapefiles/Routes/routes.shp") %>%
  group_by(route_id) %>%
  summarise() %>%
  st_transform(32118)

boroughs <- read_sf("../exploratory_code/swiftly_eda/Shapefiles/Boroughs/boroughs.shp") %>%
  st_transform(32118)

bbox_east_river <- st_bbox(landings_east_river)
bbox_expanded_east_river <- bbox_east_river
bbox_expanded_east_river["xmin"] <- bbox_east_river["xmin"] - 15000
bbox_expanded_east_river["ymin"] <- bbox_east_river["ymin"] - 15000
bbox_expanded_east_river["xmax"] <- bbox_east_river["xmax"] + 15000
bbox_expanded_east_river["ymax"] <- bbox_east_river["ymax"] + 15000
bbox_poly_east_river <- st_as_sfc(bbox_expanded_east_river)

water <- read_sf("../exploratory_code/swiftly_eda/Shapefiles/Hydrology/WBDHU2.shp") %>%
  st_transform(32118) %>%
  summarise(geometry = st_union(geometry)) %>%
  st_intersection(st_transform(st_as_sfc(st_bbox(bbox_expanded_east_river)), 32118))

Code

bbox <- st_bbox(c(
  xmin = -74.30603027343751,
  ymin = 40.492915026895815,
  xmax = -73.64959716796876,
  ymax = 40.919739051062216
), crs = 4326)

bbox_area <- st_as_sfc(bbox) %>%
  st_transform(32118)

bbox_service_area <- boroughs %>%
  st_transform(32118) %>%
  st_bbox()

2.3.4 Data Cleaning

The following chunk creates sf objects for each data point and computes new fields for feature engineering. Features created include:

Knots: based on the time and distance of the previous node of the same trip.
Key identifiers: for different analyses, generally including date, trip, and route.
OTP: classifies on-time performance at each node based on the schedule adherence column (schedAdhMsec).
Excessively Late: a binary field indicating whether a record is more than 10 minutes past the expected time.
First Record Late: indicates if the trip starts five minutes past its scheduled time.
OTP Transition: identifies trips where the OTP changes from one status to another (e.g., on-time to late).

Of the 851,238 data points within this sample week, 25,558 were identified as excessively late — approximately three percent of all records.

Code

pts <- st_as_sf(jun2025_reference_week, coords = c("lon", "lat"), crs = 4326, remove = FALSE) %>%
  st_transform(32118)

xy <- st_coordinates(pts)

stopifnot(nrow(jun2025_reference_week) == nrow(xy))

jun2025_reference_week <- jun2025_reference_week %>%
  mutate(x = xy[, 1], y = xy[, 2]) %>%
  group_by(date, blockId, routeId, tripId, directionId, dayOfWeek) %>%
  arrange(date_time, .by_group = TRUE) %>%
  mutate(
    directionId = case_when(
      directionId == 1 ~ "NB",
      TRUE ~ "SB"
    ),
    isFirst = row_number() == 1,
    isSecond = row_number() == 2,
    dist_to_prev_m = round(sqrt((x - lag(x))^2 + (y - lag(y))^2), 2),
    time_diff_sec  = as.numeric(difftime(date_time, lag(date_time), units = "secs")),
    m_per_sec      = if_else(is.na(time_diff_sec) | time_diff_sec <= 0,
                             0,
                             round(dist_to_prev_m / time_diff_sec, 2)),
    knots = m_per_sec * 1.94384,
    date_route_trip_direction_vehicle_dow = paste(date, routeId, tripId, directionId, vehicleId, dayOfWeek, sep = " "),
    key_identifier = paste(date, blockId, routeId, tripId, directionId, dayOfWeek, sep = " "),
    dailyTrip = paste(dayOfWeek, routeId, tripId, sep = " "),
    countDailyTrip = n(),
    OTP = case_when(
      schedAdhMsec < -60000 ~ "Early",
      schedAdhMsec > 600000 ~ "Excessively Late",
      schedAdhMsec > 300000 ~ "Late",
      TRUE ~ "Ontime"
    ),
    excessivelyLate = case_when(
      OTP == "Excessively Late" ~ 1,
      TRUE ~ 0
    ),
    timeOfDay = as.integer(hour(date_time)),
    firstRecordLate = if_else(isSecond == 1 & (OTP == "Late" | OTP == "Excessively Late"), "Yes", "No"),
    OTP_prev = lag(OTP),
    OTP_changed = coalesce(OTP != OTP_prev, FALSE),
    OTP_transition = case_when(
      isFirst ~ "Start",
      is.na(OTP_prev) ~ "Start",
      OTP == OTP_prev ~ "No Change",
      TRUE ~ paste0(OTP_prev, " -> ", OTP)
    )
  ) %>%
  ungroup() %>%
  select(-x, -y, -m_per_sec) %>%
  select(dailyTrip, key_identifier, date_route_trip_direction_vehicle_dow, date, time, dateProcessed, timeProcessed, OTP, OTP_transition, excessivelyLate, timeOfDay, schedAdh, dayOfWeek,
         dist_to_prev_m, time_diff_sec, knots, headsign, tripShortName,
         everything(), date_time, processed_date_time)

summary_table <- data.frame(
  Statistic = names(summary(jun2025_reference_week$knots)),
  Value     = as.numeric(summary(jun2025_reference_week$knots)),
  row.names = NULL
)

jun2025_reference_week_sf <- st_as_sf(jun2025_reference_week, coords = c("lon", "lat"), crs = 4326, remove = FALSE) %>%
  st_transform(32118)

kable(summary_table, digits = 2, caption = "Summary of Knots") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  ) %>%
  column_spec(1:2, width = "150px")

Summary of Knots
Statistic	Value
Min.	0.00
1st Qu.	3.11
Median	10.77
Mean	11.92
3rd Qu.	20.57
Max.	4308.46

Code

jun2025_reference_week %>%
  count(excessivelyLate, sort = TRUE)

# A tibble: 2 × 2
  excessivelyLate      n
            <dbl>  <int>
1               0 825681
2               1  25557

Code

jun2025_reference_week_excessively_lates <- jun2025_reference_week %>%
  filter(OTP == "Excessively Late")

jun2025_reference_week_excessively_lates_sf <- st_as_sf(jun2025_reference_week_excessively_lates, coords = c("lon", "lat"), crs = 4326, remove = FALSE) %>%
  st_transform(32118)

hull <- concaveman(jun2025_reference_week_excessively_lates_sf)
bbox_hull <- st_bbox(hull)

ggplot() +
  geom_sf(data = boroughs, color = "white") +
  geom_sf(data = hull, color = "#D95F59", fill = NA, size = 3) +
  coord_sf(
    xlim = c(bbox_east_river["xmin"] - 10000, bbox_east_river["xmax"] + 10000),
    ylim = c(bbox_east_river["ymin"] - 10000, bbox_east_river["ymax"] + 10000),
    expand = FALSE
  ) +
  labs(title = "Excessively Late 'Zone' - Concave Polygon") +
  theme_solid() +
  theme(
    plot.title = element_text(size = 20),
    plot.title.position = "plot"
  )

Code

jun2025_reference_week %>%
  count(routeId)

# A tibble: 7 × 2
  routeId      n
  <chr>    <int>
1 AS      121703
2 ER      189153
3 GI        9921
4 RW      152398
5 SB      129553
6 SG      118302
7 SV      130208

2.3.5 OTP by Direction

This chunk computes on-time performance by direction (NB or SB), filtering to excessively late records. Northbound and southbound trips comprise roughly the same share of all trips; however, the higher number of excessively late southbound data points could be indicative of other issues.

Code

jun2025_reference_week_dir <- jun2025_reference_week %>%
  select(dailyTrip, directionId, OTP, routeId) %>%
  filter(OTP == "Excessively Late")

jun2025_reference_week_dir %>%
  count(directionId)

# A tibble: 2 × 2
  directionId     n
  <chr>       <int>
1 NB          12069
2 SB          13488

Code

jun2025_reference_week_dir %>%
  count(directionId, routeId)

# A tibble: 12 × 3
   directionId routeId     n
   <chr>       <chr>   <int>
 1 NB          AS         51
 2 NB          ER       2182
 3 NB          GI         58
 4 NB          RW       8300
 5 NB          SB        671
 6 NB          SG          5
 7 NB          SV        802
 8 SB          AS        472
 9 SB          ER       4569
10 SB          RW       6040
11 SB          SB        696
12 SB          SV       1711

2.3.6 Bar Charts

The bar charts below visualize key engineered features. The first chart shows the distribution of excessively late trips by the time spent excessively late (as a percentage of total data points per trip).

Three key takeaways from the second plot:

Even though weekdays account for the majority of data points, they account for under half of excessively late data points. Weekend service is the only category where excessively late counts surpass non-excessively late counts.
Nearly 80 percent of excessively late data points are attributed to the East River and Rockaway routes.
The PM peak period (approximately 1pm to 10pm) accounts for the overwhelming majority of problematic data points.

Code

excessivelyLatePlotVariables <-
  jun2025_reference_week %>%
  as.data.frame() %>%
  select(excessivelyLate, dayOfWeek, routeId, timeOfDay, blockId, directionId, firstRecordLate) %>%
  gather(variable, value, -excessivelyLate) %>%
  filter(!is.na(value))

jun2025_reference_week_wide <- jun2025_reference_week %>%
  count(key_identifier, OTP) %>%
  pivot_wider(
    names_from = OTP,
    values_from = n,
    values_fill = 0
  ) %>%
  mutate(Total = rowSums(across(-key_identifier)))

jun2025_reference_week_wide

# A tibble: 2,133 × 6
   key_identifier                   Early Ontime  Late `Excessively Late` Total
   <chr>                            <int>  <int> <int>              <int> <dbl>
 1 2025-06-16 11 ER 1220 NB Weekday    72    129     0                  0   201
 2 2025-06-16 11 ER 1224 NB Weekday   191    189     0                  0   380
 3 2025-06-16 11 ER 1257 SB Weekday   260    210    28                  0   498
 4 2025-06-16 11 ER 4104 NB Weekday    86    309     0                  0   395
 5 2025-06-16 11 ER 4109 NB Weekday    24    223     0                  0   247
 6 2025-06-16 11 ER 4113 NB Weekday    52    330     0                  0   382
 7 2025-06-16 11 ER 4116 NB Weekday    19    265    25                  0   309
 8 2025-06-16 11 ER 4119 NB Weekday    85    217     0                  0   302
 9 2025-06-16 11 ER 4123 NB Weekday     0    116   166                 24   306
10 2025-06-16 11 ER 4128 NB Weekday     0     32    30                293   355
# ℹ 2,123 more rows

Code

jun2025_reference_week_percents <- jun2025_reference_week_wide %>%
  mutate(round(across(-c(key_identifier, Total), ~ . / Total * 100), 2)) %>%
  pivot_longer(
    cols = -c(key_identifier, Total),
    names_to = "OTP",
    values_to = "Percent"
  ) %>%
  filter(OTP == "Excessively Late", Percent > 0) %>%
  mutate(Percent_Rounded = round(Percent, 0))

mean_val <- mean(jun2025_reference_week_percents$Percent_Rounded)
median_val <- median(jun2025_reference_week_percents$Percent_Rounded)

ggplot() +
  geom_histogram(data = jun2025_reference_week_percents, color = "white", fill = "#1A9988", linewidth = 0.3, aes(x = Percent_Rounded), binwidth = 10) +
  geom_vline(xintercept = mean_val, color = "#D95F59", linetype = "dashed", linewidth = 1) +
  geom_vline(xintercept = median_val, color = "#F2C14E", linetype = "dashed", linewidth = 1) +
  annotate("text", x = mean_val, y = 15, label = "Mean", color = "#D95F59", angle = 90, vjust = -0.5) +
  annotate("text", x = median_val, y = 15, label = "Median", color = "#F2C14E", angle = 90, vjust = -0.5) +
  labs(
    title = "Distribution of Excessively Late Trips by Time Spent Excessively Late",
    x = "Percent",
    y = "Count"
  ) +
  theme_excel_new() +
  theme(
    axis.title = element_text(size = 14),
    plot.title = element_text(size = 20),
    plot.subtitle = element_text(size = 16),
    plot.title.position = "plot"
  )

Code

plot_df <- excessivelyLatePlotVariables %>%
  group_by(excessivelyLate, variable, value) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(excessivelyLate, variable) %>%
  mutate(prop = count / sum(count)) %>%
  ungroup()

ggplot(plot_df, aes(x = ifelse(variable == "timeOfDay", as.numeric(value), value),
                    y = prop, fill = as.factor(excessivelyLate))) +
  geom_col(position = "dodge") +
  facet_wrap(~ variable, scales = "free_x") +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = c("0" = "#1A9988", "1" = "#D95F59"),
                    labels = c("Not Excessively Late", "Excessively Late"),
                    name = "") +
  labs(x = NULL, y = "Percent within group") +
  theme_excel_new()

Code

jun2025_reference_week <- jun2025_reference_week %>%
  mutate(day_route = paste(dayOfWeek, routeId, sep = " "))

speed_by_route <- jun2025_reference_week %>%
  group_by(day_route, dayOfWeek, routeId) %>%
  summarise(
    avgKnots = mean(knots, na.rm = TRUE),
    medKnots = median(knots, na.rm = TRUE),
    n = dplyr::n(),
    .groups = "drop"
  )

ggplot() +
  geom_col(data = speed_by_route, aes(x = day_route, y = avgKnots, fill = dayOfWeek)) +
  labs(
    title = "Average Speed by Route and Service Level",
    x = "Service Level",
    y = "Average Knots"
  ) +
  scale_fill_manual(
    values = c(
      "Holiday" = "#F2C14E",
      "Weekday" = "#1A9988",
      "Weekend" = "#D95F59"
    )
  ) +
  theme_excel_new() +
  theme(
    plot.title = element_text(size = 20),
    plot.subtitle = element_text(size = 16),
    plot.title.position = "plot",
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

2.3.7 Landings Map

Code

ggplot() +
  geom_sf(data = boroughs, color = "white") +
  geom_sf(data = landings, color = "black", size = 3) +
  labs(title = "NYC Ferry Landings") +
  theme_solid() +
  theme(
    plot.title = element_text(size = 16),
    plot.title.position = "plot"
  )

2.3.8 Fishnet and Kernel Density

Since these datasets contain thousands of points per day, aggregation is the most efficient way to process and observe the data. A fishnet was created using a grid cell side of 250 meters. The figure represents the data for the sample week, excluding any grid cell with a density of 0.

Code

fishnet_swiftly <-
  st_make_grid(water, cellsize = 250) %>%
  st_sf() %>%
  mutate(uniqueID = rownames(.))

jun2025_net <- st_join(
  fishnet_swiftly, jun2025_reference_week_excessively_lates_sf, join = st_intersects) %>%
  group_by(uniqueID) %>%
  summarise(mean_speed = mean(knots, na.rm = TRUE),
            n = n(), .groups = "drop") %>%
  filter(!is.na(mean_speed))

jun2025_ppp <- as.ppp(st_coordinates(jun2025_reference_week_excessively_lates_sf), W = st_bbox(jun2025_net))
jun2025_KD <- density.ppp(jun2025_ppp, sigma = 1000)

as.data.frame(jun2025_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(jun2025_net)) %>%
  aggregate(., jun2025_net, mean) %>%
  ggplot() +
    geom_sf(aes(fill = value), color = NA, linewidth = 0) +
    geom_sf(data = landings, color = "black", size = 1, alpha = 0.5) +
    scale_fill_viridis(name = "Density") +
    labs(title = "Kernel Density of Excessively Late Data Points, 6/16/25 to 6/22/25") +
    theme_solid() +
    theme(
      plot.title = element_text(size = 16),
      plot.title.position = "plot"
    )

2.3.9 Mean Speed

The following figure shows the average speed of each data point in knots per grid cell, regardless of on-time performance designation.

Code

ggplot(jun2025_net) +
  geom_sf(data = boroughs, color = "white") +
  geom_sf(data = landings, color = "black", size = 2, alpha = 0.5) +
  geom_sf(aes(fill = mean_speed), color = NA) +
  coord_sf(
    xlim = c(bbox_hull["xmin"] - 100, bbox_hull["xmax"] + 100),
    ylim = c(bbox_hull["ymin"] - 200, bbox_hull["ymax"] + 200),
    expand = FALSE
  ) +
  scale_fill_viridis_c(option = "D", name = "Knots") +
  labs(title = "Mean Vessel Speed by Cell") +
  theme_solid() +
  theme(
    plot.title = element_text(size = 16),
    plot.title.position = "plot"
  )

2.3.10 Kernel Density Plots — Hourly

To analyze how OTP changes spatially throughout the PM peak, kernel densities of excessively late data points were mapped for each hour between 1pm and 10pm.

Code

make_hourly_kd <- function(hour_val) {
  net <- st_join(fishnet_swiftly, jun2025_reference_week_excessively_lates_sf, join = st_intersects) %>%
    group_by(uniqueID) %>%
    filter(hour == hour_val) %>%
    summarise(mean_speed = mean(knots, na.rm = TRUE), n = n(), .groups = "drop")
  kd <- density.ppp(as.ppp(st_coordinates(net), W = st_bbox(jun2025_net)), sigma = 1000)
  kd
}

kd_list <- list(
  '1PM'  = make_hourly_kd("13"),
  '2PM'  = make_hourly_kd("14"),
  '3PM'  = make_hourly_kd("15"),
  '4PM'  = make_hourly_kd("16"),
  '5PM'  = make_hourly_kd("17"),
  '6PM'  = make_hourly_kd("18"),
  '7PM'  = make_hourly_kd("19"),
  '8PM'  = make_hourly_kd("20"),
  '9PM'  = make_hourly_kd("21"),
  '10PM' = make_hourly_kd("22")
)

all_values <- unlist(lapply(kd_list, \(kd) as.data.frame(kd)$value))
global_limits <- range(all_values, na.rm = TRUE)

plots <- imap(kd_list, ~{
  sf_pts <- st_as_sf(as.data.frame(.x), coords = c("x", "y"), crs = st_crs(jun2025_net))
  agg <- aggregate(sf_pts["value"], jun2025_net, mean, na.rm = TRUE)

  ggplot() +
    geom_sf(data = boroughs, color = "white") +
    geom_sf(data = agg, aes(fill = value), color = NA, linewidth = 0) +
    coord_sf(
      xlim = c(bbox_hull["xmin"] - 100, bbox_hull["xmax"] + 100),
      ylim = c(bbox_hull["ymin"] - 100, bbox_hull["ymax"] + 100),
      expand = FALSE
    ) +
    scale_fill_viridis_c(name = "Kernel Density", limits = global_limits, oob = scales::squish) +
    labs(title = paste("KD -", .y)) +
    theme_solid() +
    theme(
      plot.title = element_text(size = 12),
      plot.title.position = "plot"
    )
})

wrap_plots(plots, ncol = 5) +
  plot_layout(guides = "collect") +
  plot_annotation(
    title = "Excessively Late Across TOD",
    caption = "Unified KD Scale, 250m Grid Cell Size"
  ) &
  theme(
    legend.position = "right",
    legend.box = "horizontal",
    plot.margin = margin(1, 1, 1, 1)
  )

One potentially useful consideration is to observe locations where trips change OTP status. The following figure includes only instances where OTP status changes to either late or excessively late.

Code

jun2025_reference_week_late_sf <- jun2025_reference_week_sf %>%
  filter(OTP_transition %in% c("Early -> Late", "Late -> Excessively Late",
                                "Ontime -> Late", "Ontime -> Excessively Late"))

ggplot() +
  geom_sf(data = boroughs, color = "white") +
  geom_sf(data = landings, color = "black", size = 2, alpha = 0.2) +
  geom_sf(data = jun2025_reference_week_late_sf,
          aes(color = OTP_transition), size = 2, alpha = 0.2) +
  coord_sf(
    xlim = c(bbox_east_river["xmin"] - 10000, bbox_east_river["xmax"] + 10000),
    ylim = c(bbox_east_river["ymin"] - 10000, bbox_east_river["ymax"] + 10000),
    expand = FALSE
  ) +
  labs(title = "OTP Transition Locations") +
  scale_color_manual(
    name = "Transition Type",
    values = c(
      "Early -> Late"               = "#7BA7C7",
      "Ontime -> Late"              = "#1A9988",
      "Ontime -> Excessively Late"  = "#F2C14E",
      "Late -> Excessively Late"    = "#D95F59"
    ),
    labels = c(
      "Early -> Late"               = "Early to Late",
      "Ontime -> Late"              = "Ontime to Late",
      "Ontime -> Excessively Late"  = "Ontime to Excessively Late",
      "Late -> Excessively Late"    = "Late to Excessively Late"
    )
  ) +
  facet_wrap(~OTP_transition) +
  theme_solid() +
  theme(
    plot.title = element_text(size = 20),
    plot.title.position = "plot"
  )

2.3.11 Leaflet Distribution

Code

landings_buffers_100ft <- st_buffer(landings, dist = 30.48)
counts <- lengths(st_intersects(landings_buffers_100ft, jun2025_reference_week_excessively_lates_sf))
landings_buffers_100ft$record_count <- counts

points_plot  <- st_transform(landings, 4326)
buffers_plot <- st_transform(landings_buffers_100ft, 4326)
routes_plot  <- st_transform(routes, 4326)
boroughs_plot <- st_transform(boroughs, 4326)

jun2025_reference_week_elate_sf <- jun2025_reference_week_sf %>%
  filter(OTP == "Excessively Late") %>%
  st_transform(4326)

jun2025_reference_week_not_elate_sf <- jun2025_reference_week_sf %>%
  filter(OTP != "Excessively Late") %>%
  st_transform(4326)

leaflet() %>%
  addProviderTiles("Esri.WorldImagery", group = "Imagery",
                   options = tileOptions(opacity = 0.6)) %>%
  addPolygons(data = boroughs_plot, color = "#828282", weight = 1, fillOpacity = 0.75, group = "Boroughs") %>%
  addPolylines(data = routes_plot, color = "#1A9988", weight = 2, opacity = 0.2, group = "Routes") %>%
  addPolygons(data = buffers_plot, color = "#F2C14E", weight = 10, fillOpacity = 0.75, group = "Landings",
              label = ~paste0("Records: ", record_count)) %>%
  addCircleMarkers(data = points_plot, radius = 3, color = "#F2C14E", opacity = 0.6, group = "Landings") %>%
  addCircleMarkers(data = jun2025_reference_week_elate_sf, radius = 0.5, color = "#D95F59",
    opacity = 0.6, fillOpacity = 0.4, group = "Excessively Late",
    label = ~paste0("Speed: ", round(knots, 2)),
    popup = ~paste0("<b>Speed:</b> ", round(knots, 2), " knots",
                    "<br><b>Route:</b> ", routeId,
                    "<br><b>Trip:</b> ", tripId,
                    "<br><b>Vessel:</b> ", vehicleId,
                    "<br><b>Time:</b> ", time)) %>%
  addCircleMarkers(data = jun2025_reference_week_not_elate_sf, radius = 0.5, color = "#7BA7C7",
    opacity = 0.6, fillOpacity = 0.4, group = "Not Excessively Late",
    label = ~paste0("Speed: ", round(knots, 2)),
    popup = ~paste0("<b>Speed:</b> ", round(knots, 2), " knots",
                    "<br><b>Route:</b> ", routeId,
                    "<br><b>Trip:</b> ", tripId,
                    "<br><b>Vessel:</b> ", vehicleId,
                    "<br><b>Time:</b> ", time)) %>%
  addLayersControl(
    overlayGroups = c("Boroughs", "Routes", "Landings", "Excessively Late", "Not Excessively Late"),
    baseGroups = c("Imagery"),
    options = layersControlOptions(collapsed = FALSE)
  )

This Leaflet map shows the data with an imagery layer for context. For the sake of processing power, the results are not shown by default — remove the eval: false chunk option to render it.

3 Modeling

3.1 Associative Model

By: Yiming ‘Ming’ Cao

3.1.1 Objective and Design

This section presents an associative model for explaining stop-level arrival lateness in the NYC Ferry system. The goal is interpretation rather than advance prediction. Instead of asking only whether a future trip will be delayed, this model asks which historical factors are associated with observed delay patterns.

The unit of analysis is a stop-level observation. The dependent variable is arrival lateness in minutes relative to the original schedule. Predictors are organized into five explanatory domains: structural context, environmental conditions, temporal patterns, operational propagation, and vessel characteristics.

This grouped design allows us to compare how different types of factors contribute to delay variation while keeping the final model interpretable for operational use.

Code

set.seed(7900)

input_parquet <- "../exploratory_code/modeling/associative/baseline_modeling_table_final_direct.parquet"
outcome_cont  <- "lateness_vs_original_schedule_min"

week_window_days <- 3

candidate_predictors <- c(
  "route_id", "direction", "stop_sequence", "prev_stop_sched_hour", "prev_stop_sched_hour_group",
  "schedule_season", "schedule_daytype", "vessel_trip_seq_in_chain", "vessel_capacity",
  "is_terminal_stop", "prev_stop_shared_stop_flag", "curr_stop_shared_stop_flag",
  "planned_turnaround_gap_min", "scheduled_headway_min", "expected_water_min",
  "prev_stop_sequence", "prev_stop_boardings", "prev_stop_alightings", "prev_stop_onboard",
  "prev_stop_load_factor", "env_temperature_c", "env_wind_speed_mps", "env_visibility_m",
  "env_precipitation_mm", "env_battery_water_level_anomaly", "env_narrows_current_residual",
  "env_battery_water_level_obs", "env_narrows_current_obs"
)
prior_delay_predictors <- c("inherited_prev_trip_delay", "carried_within_trip_delay")

3.1.2 Data Scope

The model uses the stop-level baseline modeling table as its input dataset. After loading the data, we confirmed the number of records and the availability of candidate predictors before fitting the associative models.

Code

if (!file.exists(input_parquet)) stop("Missing input parquet: ", input_parquet)
df_raw <- read_parquet(input_parquet) %>% as_tibble()

df <- df_raw %>%
  mutate(
    trip_date = as.Date(substr(as.character(trip_key), 1, 8), format = "%Y%m%d")
  )

predictors_no_prior   <- intersect(candidate_predictors, names(df))
predictors_with_prior <- unique(c(predictors_no_prior, intersect(prior_delay_predictors, names(df))))

tibble(
  input_rows              = nrow(df_raw),
  rows_after_load         = nrow(df),
  predictors_no_prior_n   = length(predictors_no_prior),
  predictors_with_prior_n = length(predictors_with_prior)
) %>% kable(caption = "Data Scope and Predictor Availability")

Data Scope and Predictor Availability
input_rows	rows_after_load	predictors_no_prior_n	predictors_with_prior_n
613071	613071	28	30

3.1.3 Predictor Domains

To support interpretation, predictors in the associative model were organized into five explanatory domains. This grouped structure separates baseline route and stop context from environmental conditions, temporal patterns, operational propagation, and vessel-related characteristics. The goal is to interpret delay association at the domain level rather than focusing only on isolated individual coefficients.

3.1.3.1 Domain A: Structural Context

The structural domain captures baseline differences in how delay varies across the ferry network. These predictors describe route, stop position, and stop-context features that reflect how service is organized spatially and operationally. Variables in this domain include route and direction identifiers, stop sequence, terminal status, and shared-stop context. We interpret this domain as representing the structural baseline of delay rather than short-term changing conditions.

3.1.3.2 Domain B: Environmental Conditions

The environmental domain captures external natural conditions that may affect vessel movement and docking performance. These predictors include weather and water-related variables such as temperature, wind speed, visibility, precipitation, water-level, and current-related measures. This domain is especially important for the client because it helps distinguish external operating conditions from delay that is simply inherited from earlier parts of the trip.

3.1.3.3 Domain C: Temporal Patterns

The temporal domain captures recurring calendar and time-of-day structure in delay. These predictors describe when service occurs rather than where it occurs or what the vessel has already experienced. Variables in this domain include scheduled hour, scheduled hour group, day type, and season. We use this domain to represent regular service rhythms such as weekday versus weekend patterns, seasonal demand differences, and time-of-day effects.

3.1.3.4 Domain D: Operational Propagation

The operational propagation domain captures delay carried forward through the operating chain. These predictors reflect whether lateness has already accumulated earlier in the trip or in a prior linked trip on the same vessel. Variables in this domain include inherited previous-trip delay and carried within-trip delay. This domain is operationally meaningful, but it is interpreted as one component of the associative model rather than the sole basis for explaining delay.

3.1.3.5 Domain E: Vessel Characteristics

The vessel domain captures systematic differences associated with vessel capacity and operating role. These predictors help assess whether some delay patterns are associated with the characteristics of the vessel being used rather than only with route structure or environmental conditions. Variables in this domain include vessel capacity and related vessel-level operational identifiers retained in the final specification.

3.1.4 Data Preparation

Before model estimation, the stop-level modeling table was prepared to support a consistent and interpretable associative model. This section summarizes the key preparation steps: variable construction, missing-value treatment for environmental fields, and helper functions used for model fitting and summary extraction.

3.1.4.1 Data Cleaning and Variable Construction

The input table was converted into a stop-level analytic dataset. We created a trip date field from the trip identifier, checked which candidate predictors were available, and retained variables needed for the outcome and domain-based model structure.

3.1.4.2 Environmental Variable Imputation

Missing environmental values were imputed using a local time-slot rolling rule. The primary fill used observations from the same scheduled hour within a centered seven-day window around the trip date. If no value was available, the method fell back to the same scheduled hour group within the same centered window.

This approach keeps environmental predictors in the model without relying only on complete-case filtering. It also keeps the imputed values tied to nearby dates and comparable operating periods.

Code

build_week_fill_map <- function(day_slot_tbl, group_col, date_col = "trip_date", value_col = "slot_day_median", window_days = 3) {
  if (nrow(day_slot_tbl) == 0) return(tibble())
  split_tbl <- split(day_slot_tbl, day_slot_tbl[[group_col]], drop = TRUE)
  out <- lapply(split_tbl, function(gdf) {
    gdf <- gdf[order(gdf[[date_col]]), , drop = FALSE]
    d <- gdf[[date_col]]
    v <- gdf[[value_col]]
    week_fill <- vapply(seq_along(v), function(i) {
      idx <- abs(as.integer(d - d[i])) <= window_days
      val <- suppressWarnings(median(v[idx], na.rm = TRUE))
      ifelse(is.finite(val), val, NA_real_)
    }, numeric(1))
    gdf$week_fill <- week_fill
    gdf[, c(group_col, date_col, "week_fill"), drop = FALSE]
  })
  bind_rows(out)
}

impute_by_local_slot_week <- function(data, var_name, window_days = 3) {
  x <- data
  miss_before <- sum(is.na(x[[var_name]]))
  if (miss_before == 0) {
    return(list(data = x, summary = tibble(variable = var_name, n_missing_before = 0L, n_filled_primary = 0L, n_filled_fallback = 0L, n_missing_after = 0L)))
  }

  primary_daily <- x %>%
    filter(!is.na(.data[[var_name]]), !is.na(trip_date), !is.na(prev_stop_sched_hour)) %>%
    group_by(prev_stop_sched_hour, trip_date) %>%
    summarise(slot_day_median = median(.data[[var_name]], na.rm = TRUE), .groups = "drop")
  primary_fill <- build_week_fill_map(primary_daily, group_col = "prev_stop_sched_hour", value_col = "slot_day_median", window_days = window_days) %>%
    rename(fill_primary = week_fill)

  fallback_daily <- x %>%
    filter(!is.na(.data[[var_name]]), !is.na(trip_date), !is.na(prev_stop_sched_hour_group)) %>%
    group_by(prev_stop_sched_hour_group, trip_date) %>%
    summarise(slot_day_median = median(.data[[var_name]], na.rm = TRUE), .groups = "drop")
  fallback_fill <- build_week_fill_map(fallback_daily, group_col = "prev_stop_sched_hour_group", value_col = "slot_day_median", window_days = window_days) %>%
    rename(fill_fallback = week_fill)

  x <- x %>%
    left_join(primary_fill, by = c("prev_stop_sched_hour", "trip_date")) %>%
    left_join(fallback_fill, by = c("prev_stop_sched_hour_group", "trip_date"))

  was_missing      <- is.na(x[[var_name]])
  fill_primary_idx  <- was_missing & !is.na(x$fill_primary)
  fill_fallback_idx <- was_missing & is.na(x$fill_primary) & !is.na(x$fill_fallback)

  x[[var_name]][fill_primary_idx]  <- x$fill_primary[fill_primary_idx]
  x[[var_name]][fill_fallback_idx] <- x$fill_fallback[fill_fallback_idx]
  x <- x %>% select(-fill_primary, -fill_fallback)

  list(
    data = x,
    summary = tibble(
      variable          = var_name,
      n_missing_before  = miss_before,
      n_filled_primary  = sum(fill_primary_idx, na.rm = TRUE),
      n_filled_fallback = sum(fill_fallback_idx, na.rm = TRUE),
      n_missing_after   = sum(is.na(x[[var_name]]))
    )
  )
}

impute_vars <- intersect(
  c("env_temperature_c", "env_wind_speed_mps", "env_visibility_m",
    "env_precipitation_mm", "env_narrows_current_residual",
    "env_battery_water_level_obs", "env_narrows_current_obs"),
  names(df)
)

impute_summary_list <- list()
df_imp <- df
for (v in impute_vars) {
  res <- impute_by_local_slot_week(df_imp, v, window_days = week_window_days)
  df_imp <- res$data
  impute_summary_list[[length(impute_summary_list) + 1]] <- res$summary
}
imputation_summary <- bind_rows(impute_summary_list)

Code

imputation_summary %>%
  mutate(
    pct_missing_before = round(n_missing_before / nrow(df) * 100, 2),
    pct_missing_after  = round(n_missing_after  / nrow(df_imp) * 100, 2)
  ) %>%
  kable(caption = "Summary of Environmental Variable Imputation", digits = 2)

Summary of Environmental Variable Imputation
variable	n_missing_before	n_filled_fallback	n_missing_after	pct_missing_before	pct_missing_after
env_temperature_c	482	283	199	0.08	0.03
env_wind_speed_mps	1116	917	199	0.18	0.03
env_visibility_m	567	368	199	0.09	0.03
env_precipitation_mm	373	174	199	0.06	0.03
env_narrows_current_residual	52071	48636	3435	8.49	0.56
env_battery_water_level_obs	373	174	199	0.06	0.03
env_narrows_current_obs	52071	48636	3435	8.49	0.56

3.1.4.3 Modeling Utilities

A small set of reusable helper functions was created to prepare model inputs, fit linear models, extract model metrics, and generate standardized coefficient summaries.

Code

prepare_model_df <- function(data, outcome, predictors) {
  data %>% select(all_of(c(outcome, predictors))) %>% drop_na()
}

fit_linear <- function(data, model_name, outcome, predictors) {
  d   <- prepare_model_df(data, outcome, predictors)
  fml <- as.formula(paste(outcome, "~", paste(predictors, collapse = " + ")))
  fit <- lm(fml, data = d)
  pred <- predict(fit, newdata = d)
  rmse <- sqrt(mean((d[[outcome]] - pred)^2))
  g    <- glance(fit)

  list(
    metrics = tibble(
      model_name  = model_name, model_type = "linear",
      n = nrow(d), aic = g$AIC, bic = g$BIC,
      rsq = g$r.squared, adj_rsq = g$adj.r.squared, rmse = rmse,
      pseudo_rsq = NA_real_, accuracy = NA_real_,
      sensitivity = NA_real_, specificity = NA_real_, auc = NA_real_
    ),
    coefs        = tidy(fit) %>% mutate(model_name = model_name, model_type = "linear"),
    formula_text = paste(deparse(fml), collapse = "")
  )
}

standardize_for_lm <- function(data, outcome, predictors) {
  d       <- data %>% select(all_of(c(outcome, predictors))) %>% drop_na()
  num_vars <- names(d)[sapply(d, is.numeric)]
  d %>% mutate(across(all_of(num_vars), ~ as.numeric(scale(.x))))
}

fit_standardized_linear <- function(data, model_name, outcome, predictors) {
  d_std <- standardize_for_lm(data, outcome, predictors)
  fml   <- as.formula(paste(outcome, "~", paste(predictors, collapse = " + ")))
  fit   <- lm(fml, data = d_std)
  broom::tidy(fit) %>%
    filter(term != "(Intercept)") %>%
    mutate(model_name = model_name, abs_estimate = abs(estimate)) %>%
    arrange(desc(abs_estimate))
}

3.1.5 Variable Screening and Final Specification

This section explains how the final associative-model specification was selected from the broader predictor pool. The goal was not to include every available variable, but to keep interpretable coverage across the five explanatory domains while reducing redundancy among closely related predictors.

3.1.5.1 Initial Candidate Variables

The initial predictor pool was intentionally broad so that each explanatory domain was represented. Candidate variables included route and stop structure, environmental conditions, temporal indicators, operational propagation terms, and vessel/load-related characteristics. This broader pool provided the starting point for a smaller, more interpretable final specification.

Code

domain_a_predictors <- intersect(c(
  "route_id", "direction", "stop_sequence", "is_terminal_stop",
  "prev_stop_shared_stop_flag", "curr_stop_shared_stop_flag"
), names(df_imp))

domain_b_predictors <- intersect(c(
  "env_temperature_c", "env_wind_speed_mps", "env_visibility_m",
  "env_precipitation_mm", "env_narrows_current_residual",
  "env_battery_water_level_obs", "env_narrows_current_obs"
), names(df_imp))

domain_c_predictors <- intersect(c(
  "prev_stop_sched_hour", "schedule_season", "schedule_daytype"
), names(df_imp))

domain_d_predictors <- intersect(c(
  "inherited_prev_trip_delay", "carried_within_trip_delay"
), names(df_imp))

domain_e_predictors <- intersect(c(
  "vessel_capacity", "vessel_trip_seq_in_chain",
  "prev_stop_boardings", "prev_stop_alightings",
  "prev_stop_onboard", "prev_stop_load_factor"
), names(df_imp))

final_predictors <- unique(c(
  domain_a_predictors, domain_b_predictors,
  domain_c_predictors, domain_d_predictors, domain_e_predictors
))

3.1.5.2 Redundancy and Multicollinearity Checks

Because several candidate predictors measured closely related aspects of stop position, ridership, and operating context, we reviewed pairwise correlations among numeric predictors before fitting the final associative model. The goal was not to eliminate all correlation, but to identify clusters of variables that conveyed largely overlapping information and would make coefficient interpretation less clear.

Code

numeric_candidate_predictors <- intersect(
  candidate_predictors,
  names(df_imp)[sapply(df_imp, is.numeric)]
)

corr_vars   <- df_imp %>%
  select(all_of(numeric_candidate_predictors)) %>%
  select(where(~ !all(is.na(.))))

corr_matrix <- cor(corr_vars, use = "pairwise.complete.obs")

The figure below highlights selected variables reviewed during specification design. These variables are related to stop position and previous-stop load conditions, so they were checked carefully before finalizing the model.

Code

focus_vars <- c("stop_sequence", "prev_stop_sequence", "prev_stop_onboard", "prev_stop_load_factor")
focus_corr <- corr_matrix[focus_vars, focus_vars]

as.data.frame(as.table(focus_corr)) %>%
  rename(var1 = Var1, var2 = Var2, corr = Freq) %>%
  mutate(label = ifelse(var1 == var2, "", sprintf("%.2f", corr))) %>%
  ggplot(aes(x = var1, y = var2, fill = corr)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = label), size = 4) +
  scale_fill_gradient2(low = "#1A9988", mid = "white", high = "#EB5600", midpoint = 0, limits = c(-1, 1)) +
  coord_fixed() +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(size = 13, face = "bold", hjust = 0.5)
  ) +
  labs(title = "Correlation Among Selected Candidate Predictors", x = NULL, y = NULL, fill = "Correlation")

To support specification decisions more systematically, we also identified highly correlated numeric predictor pairs using an absolute correlation threshold of 0.7.

Code

as.data.frame(as.table(corr_matrix)) %>%
  rename(var1 = Var1, var2 = Var2, corr = Freq) %>%
  filter(var1 != var2) %>%
  mutate(abs_corr = abs(corr)) %>%
  filter(abs_corr >= 0.7) %>%
  rowwise() %>%
  mutate(pair_id = paste(sort(c(var1, var2)), collapse = " || ")) %>%
  ungroup() %>%
  distinct(pair_id, .keep_all = TRUE) %>%
  arrange(desc(abs_corr)) %>%
  select(var1, var2, corr, abs_corr) %>%
  head(10) %>%
  kable(caption = "Highly correlated candidate predictor pairs (|r| >= 0.7)")

Highly correlated candidate predictor pairs (|r| >= 0.7)
var1	var2	corr	abs_corr
prev_stop_sequence	stop_sequence	1.0000000	1.0000000
prev_stop_load_factor	prev_stop_onboard	0.8965757	0.8965757

3.1.5.3 Final Variable Set

The final variable set was selected to preserve coverage across the five explanatory domains while avoiding unnecessary duplication. When multiple variables represented similar information, we prioritized predictors that were easier to interpret operationally and more directly connected to the domain being represented.

After reviewing correlation patterns, we removed redundant candidate variables and retained a reduced final specification for the combined associative model.

Code

final_predictors_trimmed <- setdiff(
  final_predictors,
  c("prev_stop_sequence", "prev_stop_sched_hour_group")
)

tibble(
  model        = "Final combined associative model",
  n_predictors = length(final_predictors_trimmed)
) %>% kable(caption = "Predictor count in the final combined associative specification")

Predictor count in the final combined associative specification
model	n_predictors
Final combined associative model	24

Code

fit_vif_model <- function(data, outcome, predictors) {
  model_df <- data %>% select(all_of(unique(c(outcome, predictors)))) %>% drop_na()
  lm(as.formula(paste(outcome, "~", paste(predictors, collapse = " + "))), data = model_df)
}

extract_vif_table <- function(model, model_name) {
  vif_vals <- car::vif(model)
  if (is.matrix(vif_vals)) {
    out <- as.data.frame(vif_vals)
    out$term <- rownames(out)
    rownames(out) <- NULL
    out$model <- model_name
    out %>% rename(gvif = GVIF) %>%
      mutate(adjusted_vif = gvif^(1 / (2 * Df))) %>%
      select(model, term, Df, gvif, adjusted_vif)
  } else {
    tibble(model = model_name, term = names(vif_vals), vif = as.numeric(vif_vals)) %>%
      arrange(desc(vif))
  }
}

vif_model_final <- fit_vif_model(df_imp, outcome_cont, final_predictors_trimmed)
vif_table_final <- extract_vif_table(vif_model_final, "Final combined associative model")

We then checked variance inflation factors for the final combined specification. For multi-level categorical predictors, we report the adjusted GVIF, which is more comparable across terms with different degrees of freedom.

Code

if ("adjusted_vif" %in% names(vif_table_final)) {
  vif_table_final %>%
    arrange(desc(adjusted_vif)) %>%
    head(10) %>%
    mutate(gvif = round(gvif, 2), adjusted_vif = round(adjusted_vif, 2)) %>%
    kable(caption = "Top VIF results for the final combined associative specification")
} else {
  vif_table_final %>%
    arrange(desc(vif)) %>%
    head(10) %>%
    mutate(vif = round(vif, 2)) %>%
    kable(caption = "Top VIF results for the final combined associative specification")
}

Top VIF results for the final combined associative specification
model	term	Df	gvif	adjusted_vif
Final combined associative model	prev_stop_onboard	1	11.55	3.40
Final combined associative model	prev_stop_load_factor	1	11.30	3.36
Final combined associative model	vessel_capacity	1	4.37	2.09
Final combined associative model	env_temperature_c	1	3.73	1.93
Final combined associative model	stop_sequence	1	2.29	1.51
Final combined associative model	prev_stop_boardings	1	2.22	1.49
Final combined associative model	is_terminal_stop	1	1.77	1.33
Final combined associative model	prev_stop_alightings	1	1.54	1.24
Final combined associative model	env_narrows_current_residual	1	1.54	1.24
Final combined associative model	env_narrows_current_obs	1	1.51	1.23

3.1.6 Estimation Strategy

The associative model was estimated in two steps. First, we fit separate domain-specific linear models to summarize how much stop-level lateness was associated with each explanatory domain on its own. Second, we fit a combined associative model that included predictors from all five domains.

To make model comparisons more consistent, all models were estimated on the same analytic sample used for the final combined specification.

3.1.6.1 Domain-Specific Models

We estimated separate linear models for the five explanatory domains: structural context, environmental conditions, temporal patterns, operational propagation, and vessel/load characteristics. Each model used the same dependent variable: stop-level arrival lateness in minutes.

These models were not treated as competing final models. Instead, they were used to compare the relative explanatory signal of each domain when considered separately.

Code

df_model <- df_imp %>%
  drop_na(all_of(c(outcome_cont, final_predictors_trimmed)))

model_results <- list(
  domain_a_structural   = fit_linear(df_model, "domain_a_structural",   outcome_cont, domain_a_predictors),
  domain_b_environment  = fit_linear(df_model, "domain_b_environment",  outcome_cont, domain_b_predictors),
  domain_c_temporal     = fit_linear(df_model, "domain_c_temporal",     outcome_cont, domain_c_predictors),
  domain_d_operational  = fit_linear(df_model, "domain_d_operational",  outcome_cont, domain_d_predictors),
  domain_e_vessel       = fit_linear(df_model, "domain_e_vessel",       outcome_cont, domain_e_predictors),
  combined_associative  = fit_linear(df_model, "combined_associative",  outcome_cont, final_predictors_trimmed)
)

3.1.6.2 Combined Associative Model

The combined associative model uses the final predictor set across all five domains. This model serves as the main explanatory specification. It allows us to interpret individual coefficients while also discussing how different types of factors contribute to observed delay patterns within one framework.

3.1.6.3 Comparison Logic

Model comparison focused on interpretation rather than predictive model selection. We compared adjusted R-squared and RMSE across the domain-specific models and the combined model. The domain-specific models show how much delay variation is associated with each factor group on its own, while the combined model shows the explanatory power of the full specification.

3.1.7 Results: Explanatory Contribution by Domain

This section compares how much stop-level delay variation is explained by each explanatory domain. The goal is not to choose a predictive model, but to understand which types of factors are most strongly associated with observed lateness.

The domain-specific models show a clear hierarchy. Operational propagation explains the largest share of delay variation, followed by structural context. Environmental conditions, temporal patterns, and vessel/load characteristics explain smaller shares on their own, but they remain useful for interpretation because they describe monitorable operating conditions.

Code

model_comparison_summary <- bind_rows(lapply(model_results, `[[`, "metrics")) %>%
  mutate(
    domain = dplyr::recode(
      model_name,
      combined_associative  = "Combined Associative Model",
      domain_a_structural   = "Domain A: Structural Context",
      domain_b_environment  = "Domain B: Environmental Conditions",
      domain_c_temporal     = "Domain C: Temporal Patterns",
      domain_d_operational  = "Domain D: Operational Propagation",
      domain_e_vessel       = "Domain E: Vessel / Load Characteristics"
    )
  ) %>%
  arrange(match(domain, c(
    "Domain A: Structural Context", "Domain B: Environmental Conditions",
    "Domain C: Temporal Patterns",  "Domain D: Operational Propagation",
    "Domain E: Vessel / Load Characteristics", "Combined Associative Model"
  )))

model_comparison_summary %>%
  select(domain, n, rsq, adj_rsq, rmse, aic, bic) %>%
  mutate(across(c(rsq, adj_rsq, rmse), ~ round(.x, 4)),
         across(c(aic, bic), ~ round(.x, 0))) %>%
  kable(caption = "Explained Variation by Domain and Final Combined Associative Model")

Explained Variation by Domain and Final Combined Associative Model
domain	n	rsq	adj_rsq	rmse	aic	bic
Domain A: Structural Context	592720	0.1342	0.1342	3.6334	3211498	3211656
Domain B: Environmental Conditions	592720	0.0169	0.0169	3.8718	3286834	3286936
Domain C: Temporal Patterns	592720	0.0129	0.0129	3.8796	3289224	3289314
Domain D: Operational Propagation	592720	0.5865	0.5865	2.5109	2773416	2773461
Domain E: Vessel / Load Characteristics	592720	0.0783	0.0783	3.7489	3248611	3248701
Combined Associative Model	592720	0.7413	0.7412	1.9863	2495659	2496054

Code

model_comparison_summary %>%
  filter(domain != "Combined Associative Model") %>%
  mutate(domain = factor(domain, levels = domain), adj_rsq = round(adj_rsq, 4)) %>%
  ggplot(aes(x = domain, y = adj_rsq)) +
  geom_col(width = 0.65) +
  geom_text(aes(label = round(adj_rsq, 3)), vjust = -0.3, size = 3.5) +
  labs(title = "Explanatory Contribution by Domain", x = NULL, y = "Adjusted R-squared") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1), plot.title = element_text(face = "bold"))

3.1.7.1 Structural Context

The structural-context model explains a meaningful share of stop-level delay variation on its own. This indicates that route and stop configuration establish an important baseline pattern of delay across the system. In substantive terms, some portions of the ferry network appear structurally more delay-prone than others even before environmental or operational carry-forward effects are considered.

3.1.7.2 Environmental Conditions

The environmental model explains a smaller share of total variation on its own. This does not mean environmental signals are unimportant. Instead, it suggests that weather and water conditions are secondary to propagated delay and structural context, while still offering useful monitorable signals for operational awareness.

3.1.7.3 Temporal Patterns

The temporal model explains a modest share of variation, indicating that day type, scheduled hour, and season capture recurring service rhythms but do not dominate the delay process on their own. These results suggest that delay varies systematically across time, but those temporal patterns are weaker than operational propagation and structural context.

3.1.7.4 Operational Propagation

The operational-propagation model explains the largest share of variation among the individual domains. This confirms that delay accumulation within and across trips is the strongest single explanatory component in the associative model. In other words, stop-level delay is highly path-dependent: once lateness has entered the operating chain, a substantial portion of subsequent delay can be understood as propagated rather than newly generated at each stop.

3.1.7.5 Vessel / Load Characteristics

The vessel/load model explains the smallest share of variation among the five domains. This suggests that vessel capacity and load-related differences are present but relatively limited when considered in isolation. These factors are still useful as operational context, but they are secondary to route structure, external conditions, and propagated delay in explaining stop-level lateness.

3.1.8 Focused Interpretation of Key Predictors

The previous section compared explanatory contribution across domains. This section focuses on selected predictors used for coefficient interpretation and final presentation findings. These results are intended to make the model findings easier to interpret operationally, not to replace the full combined associative model.

3.1.8.1 Focused Interpretive Specification

In addition to the full combined associative model, we estimated a smaller interpretive specification using predictors selected for coefficient interpretation. This focused model includes prior delay, route context, seasonal and day-type patterns, time bins, selected environmental conditions, and stop sequence.

This model is used to support coefficient interpretation in the final presentation. It should be read as a simplified interpretive view of the larger associative framework.

Code

selected_predictors <- intersect(c(
  "inherited_prev_trip_delay", "route_id", "schedule_season",
  "prev_stop_sched_hour_group", "schedule_daytype",
  "env_temperature_c", "env_wind_speed_mps", "env_precipitation_mm",
  "env_battery_water_level_obs", "stop_sequence"
), names(df_model))

selected_model_df <- prepare_model_df(df_model, outcome_cont, selected_predictors)

selected_formula <- as.formula(paste(
  outcome_cont, "~",
  paste(selected_predictors, collapse = " + "),
  "+ prev_stop_sched_hour_group:schedule_daytype"
))

selected_fit <- lm(selected_formula, data = selected_model_df)

broom::glance(selected_fit) %>%
  transmute(
    model   = "Focused interpretive specification",
    n       = nobs(selected_fit),
    rsq     = round(r.squared, 4),
    adj_rsq = round(adj.r.squared, 4),
    rmse    = round(sqrt(mean(residuals(selected_fit)^2)), 4),
    aic     = round(AIC(selected_fit), 0),
    bic     = round(BIC(selected_fit), 0)
  ) %>%
  kable(caption = "Focused Interpretive Specification Fit")

Focused Interpretive Specification Fit
model	n	rsq	adj_rsq	rmse	aic	bic
Focused interpretive specification	592720	0.4112	0.4112	2.9963	2983010	2983326

3.1.8.2 Raw Coefficient Interpretation

To support operational interpretation, we report raw coefficients from the focused interpretive specification. These coefficients show the estimated change in arrival lateness, in minutes, associated with each predictor while holding other variables in the model constant.

For numeric predictors, the coefficient is interpreted per one-unit increase. For categorical predictors, coefficients are interpreted relative to the reference category.

Code

broom::tidy(selected_fit) %>%
  filter(term != "(Intercept)") %>%
  mutate(abs_estimate = abs(estimate)) %>%
  arrange(desc(abs_estimate)) %>%
  select(term, estimate, std.error, statistic, p.value) %>%
  mutate(across(c(estimate, std.error), ~ round(.x, 4)),
         statistic = round(statistic, 2),
         p.value   = round(p.value, 4)) %>%
  kable(caption = "Raw Coefficients for Focused Interpretive Specification")

Raw Coefficients for Focused Interpretive Specification
term	estimate	std.error	statistic	p.value
route_idRW	3.2060	0.0197	162.75	0.0000
route_idGI	1.9491	0.0762	25.58	0.0000
route_idRR	-1.5515	0.2317	-6.70	0.0000
route_idSG	0.9711	0.0167	58.17	0.0000
inherited_prev_trip_delay	0.8122	0.0014	561.44	0.0000
route_idSV	0.8064	0.0132	61.07	0.0000
prev_stop_sched_hour_group5_8:schedule_daytypeWEEKEND	-0.5679	0.0451	-12.59	0.0000
stop_sequence	0.5202	0.0025	208.73	0.0000
prev_stop_sched_hour_group15_17	0.4389	0.0155	28.26	0.0000
prev_stop_sched_hour_group5_8	0.4243	0.0156	27.17	0.0000
prev_stop_sched_hour_group9_11	-0.2685	0.0156	-17.24	0.0000
route_idER	-0.2424	0.0112	-21.65	0.0000
schedule_seasonSUMMER_1	-0.2001	0.0156	-12.83	0.0000
schedule_seasonSUMMER_2	0.1921	0.0166	11.54	0.0000
prev_stop_sched_hour_group18_22	-0.1628	0.0152	-10.71	0.0000
env_battery_water_level_obs	-0.1483	0.0080	-18.61	0.0000
prev_stop_sched_hour_group18_22:schedule_daytypeWEEKEND	-0.1227	0.0252	-4.88	0.0000
prev_stop_sched_hour_group9_11:schedule_daytypeWEEKEND	0.0883	0.0265	3.34	0.0009
schedule_seasonSPRING	-0.0846	0.0152	-5.56	0.0000
prev_stop_sched_hour_group15_17:schedule_daytypeWEEKEND	0.0525	0.0255	2.06	0.0397
env_temperature_c	0.0284	0.0008	36.36	0.0000
route_idSB	-0.0190	0.0127	-1.50	0.1334
env_wind_speed_mps	0.0180	0.0017	10.65	0.0000
env_precipitation_mm	-0.0157	0.0023	-6.83	0.0000
schedule_daytypeWEEKEND	-0.0123	0.0184	-0.67	0.5026
schedule_seasonWINTER	-0.0016	0.0169	-0.10	0.9231

3.1.8.3 Supporting Exploratory Patterns

The following exploratory plots provide visual context for the coefficient interpretation. They show how delay accumulates, how lateness varies across time bins and day type, and how selected environmental and route-level patterns relate to average lateness.

Code

tibble::tibble(
  component = factor(
    c("Inherited previous delay", "Carried within-trip delay", "New delay at current stop"),
    levels = c("Inherited previous delay", "Carried within-trip delay", "New delay at current stop")
  ),
  minutes = c(6.24, 1.85, 1.68)
) %>%
  ggplot(aes(x = "Excessive delay example", y = minutes, fill = component)) +
  geom_col(width = 0.45) +
  geom_text(aes(label = paste0(minutes, " min")),
            position = position_stack(vjust = 0.5),
            color = "white", fontface = "bold", size = 4) +
  scale_fill_manual(values = c(
    "Inherited previous delay"   = "#EB5600",
    "Carried within-trip delay"  = "#1A9988",
    "New delay at current stop"  = "#C7C7C7"
  )) +
  labs(title    = "Where Does Excessive Delay Come From?",
       subtitle = "Most excessive delay is inherited or carried forward, not newly created at the current stop",
       x = NULL, y = "Minutes of delay", fill = NULL) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        panel.grid.major.x = element_blank(), panel.grid.minor = element_blank(),
        legend.position = "bottom", plot.title = element_text(face = "bold"))

Code

df_model %>%
  group_by(route_id, stop_sequence) %>%
  summarise(avg_lateness = mean(.data[[outcome_cont]], na.rm = TRUE), n = n(), .groups = "drop") %>%
  filter(n >= 100) %>%
  ggplot(aes(x = stop_sequence, y = avg_lateness, color = route_id)) +
  geom_line(linewidth = 0.9) +
  geom_point(size = 1.8) +
  labs(title    = "Delay Accumulation Along Route",
       subtitle = "Average lateness tends to change as trips move through stop sequence",
       x = "Stop sequence", y = "Average lateness (minutes)", color = "Route") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"), legend.position = "bottom")

Code

lateness_col <- c(outcome_cont, "arr_lateness", "arr_late_min", "arrival_lateness",
                  "delay_min", "lateness")[
  c(outcome_cont, "arr_lateness", "arr_late_min", "arrival_lateness",
    "delay_min", "lateness") %in% names(df_model)][1]

if (is.na(lateness_col)) stop("Could not find the lateness column.")

time_delay_summary <- df_model %>%
  mutate(lateness_for_plot = .data[[lateness_col]]) %>%
  filter(!is.na(prev_stop_sched_hour_group), !is.na(schedule_daytype), !is.na(lateness_for_plot)) %>%
  group_by(prev_stop_sched_hour_group, schedule_daytype) %>%
  summarise(avg_lateness = mean(lateness_for_plot, na.rm = TRUE), n = n(), .groups = "drop") %>%
  mutate(prev_stop_sched_hour_group = factor(
    prev_stop_sched_hour_group, levels = c("5_8", "9_11", "12_14", "15_17", "18_22")
  ))

Code

ggplot(time_delay_summary, aes(x = prev_stop_sched_hour_group, y = avg_lateness, fill = schedule_daytype)) +
  geom_col(position = position_dodge(width = 0.75), width = 0.6) +
  scale_fill_manual(values = c("WEEKDAY" = "#EB5600", "WEEKEND" = "#1A9988")) +
  labs(title = "Average Lateness by Time Bin and Day Type",
       x = "Scheduled time bin", y = "Average lateness (minutes)", fill = "Day type") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"), legend.position = "bottom")

Code

temp_delay_summary <- df_model %>%
  filter(!is.na(env_temperature_c), !is.na(.data[[outcome_cont]])) %>%
  mutate(temp_bin = cut(env_temperature_c,
                        breaks = quantile(env_temperature_c, probs = seq(0, 1, 0.1), na.rm = TRUE),
                        include.lowest = TRUE)) %>%
  group_by(temp_bin) %>%
  summarise(avg_temperature = mean(env_temperature_c, na.rm = TRUE),
            avg_lateness    = mean(.data[[outcome_cont]], na.rm = TRUE),
            n = n(), .groups = "drop")

Code

ggplot(temp_delay_summary, aes(x = avg_temperature, y = avg_lateness)) +
  geom_point(color = "#1A9988", size = 2.5) +
  geom_smooth(method = "lm", se = FALSE, color = "#EB5600") +
  labs(title = "Temperature and Average Lateness",
       x = "Average temperature (°C)", y = "Average lateness (minutes)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

Code

if ("expected_water_min" %in% names(df_model)) {
  df_model %>%
    group_by(route_id) %>%
    summarise(avg_delay              = mean(.data[[outcome_cont]], na.rm = TRUE),
              avg_expected_water_min = mean(expected_water_min, na.rm = TRUE),
              n = n(), .groups = "drop") %>%
    ggplot(aes(x = avg_expected_water_min, y = avg_delay, label = route_id)) +
    geom_point(size = 3) +
    geom_text(nudge_y = 0.15) +
    geom_smooth(method = "lm", se = FALSE) +
    labs(title = "Route Travel Time vs Average Lateness",
         x = "Average expected water travel time (minutes)\n(proxy for route length / exposure)",
         y = "Average lateness (minutes)") +
    theme_minimal()
}

3.1.8.4 Monitorable Signals Beyond Prior Delay

Although operational propagation explains the largest share of delay variation, smaller signals are still useful when they are monitorable. Environmental conditions, time patterns, and vessel/load context may not dominate the model, but they describe conditions that operators can check before or during service.

Code

domain_e_plot_predictors <- intersect(c(
  domain_e_predictors, "prev_stop_boardings", "prev_stop_alightings",
  "prev_stop_onboard", "prev_stop_load_factor"
), names(df_model))

bce_predictors       <- unique(c(domain_b_predictors, domain_c_predictors, domain_e_plot_predictors))
plot_model_predictors <- unique(c(domain_a_predictors, domain_b_predictors, domain_c_predictors,
                                   domain_d_predictors, domain_e_plot_predictors))
numeric_predictors   <- plot_model_predictors[sapply(df_model[plot_model_predictors], is.numeric)]

df_model_std <- df_model %>%
  mutate(across(all_of(numeric_predictors), ~ as.numeric(scale(.x))))

combined_std_fit <- lm(
  as.formula(paste(outcome_cont, "~", paste(plot_model_predictors, collapse = " + "))),
  data = df_model_std
)

label_lookup <- c(
  env_temperature_c              = "Temperature",
  env_wind_speed_mps             = "Wind speed",
  env_visibility_m               = "Visibility",
  env_precipitation_mm           = "Precipitation",
  env_battery_water_level_obs    = "Water level",
  env_battery_water_level_anomaly = "Water-level anomaly",
  env_narrows_current_obs        = "Current",
  env_narrows_current_residual   = "Current residual",
  prev_stop_sched_hour           = "Previous scheduled hour",
  prev_stop_sched_hour_group     = "Previous scheduled hour group",
  schedule_season                = "Schedule season",
  schedule_daytype               = "Schedule day type",
  vessel_capacity                = "Vessel capacity",
  vessel_trip_seq_in_chain       = "Vessel trip sequence",
  prev_stop_boardings            = "Boardings",
  prev_stop_alightings           = "Alightings",
  prev_stop_onboard              = "Onboard count",
  prev_stop_load_factor          = "Load factor"
)

get_base_var <- function(term, vars) {
  hits <- vars[vapply(vars, function(v) startsWith(term, v), logical(1))]
  if (length(hits) == 0) return(NA_character_)
  hits[which.max(nchar(hits))]
}

coef_plot_df <- broom::tidy(combined_std_fit) %>%
  filter(term != "(Intercept)") %>%
  rowwise() %>%
  mutate(variable = get_base_var(term, bce_predictors)) %>%
  ungroup() %>%
  filter(!is.na(variable)) %>%
  mutate(
    domain = case_when(
      variable %in% domain_b_predictors       ~ "Environmental Conditions",
      variable %in% domain_c_predictors       ~ "Temporal Patterns",
      variable %in% domain_e_plot_predictors  ~ "Demand and Vessel Context",
      TRUE ~ "Other"
    )
  ) %>%
  filter(domain != "Other",
         !str_detect(term, "^schedule_season"),
         !variable %in% c("prev_stop_load_factor", "prev_stop_onboard", "vessel_trip_seq_in_chain")) %>%
  mutate(
    clean_label = if_else(
      variable %in% names(label_lookup),
      unname(label_lookup[variable]),
      variable
    ),
    clean_label = case_when(
      variable == "schedule_daytype"           ~ str_replace(term, variable, "Schedule day type "),
      variable == "prev_stop_sched_hour_group" ~ str_replace(term, variable, "Previous scheduled hour "),
      TRUE ~ clean_label
    ),
    abs_estimate = abs(estimate)
  )

coef_plot_top <- bind_rows(
  coef_plot_df %>% filter(domain == "Environmental Conditions") %>% slice_max(abs_estimate, n = 4, with_ties = FALSE),
  coef_plot_df %>% filter(domain == "Temporal Patterns")        %>% slice_max(abs_estimate, n = 3, with_ties = FALSE),
  coef_plot_df %>% filter(domain == "Demand and Vessel Context") %>% slice_max(abs_estimate, n = 4, with_ties = FALSE)
) %>%
  distinct(term, .keep_all = TRUE) %>%
  arrange(abs_estimate)

ggplot(coef_plot_top, aes(x = reorder(clean_label, abs_estimate), y = abs_estimate, fill = domain)) +
  geom_col(width = 0.72) +
  coord_flip() +
  scale_fill_manual(values = c(
    "Environmental Conditions"  = "#1A9988",
    "Temporal Patterns"         = "#7A7A7A",
    "Demand and Vessel Context" = "#EB5600"
  )) +
  labs(title = "Monitorable Signals Beyond Prior Delay",
       x = NULL, y = "Absolute standardized coefficient", fill = NULL) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom", panel.grid.major.y = element_blank(),
        plot.title = element_text(face = "bold"))

3.1.8.5 Operational Interpretation

The focused coefficient results and supporting plots reinforce three operational takeaways.

First, prior delay remains the strongest signal. This means delay management needs to account for accumulation across stops and linked trips, not only conditions at the current stop.

Second, route and stop structure matter. Some parts of the system appear to have a less favorable baseline, meaning they may be more vulnerable to delay even before short-run conditions are considered.

Third, weather, water, time, and vessel/load patterns are smaller signals, but they are monitorable. They can help operators understand when conditions may be less favorable and where extra attention may be useful.

3.2 Predictive Model

By: Guangze ‘Simon’ Sun

3.2.1 Feature Table Construction

The prediction feature table consolidates schedule structure, stop context, lagged demand, vessel traffic (AIS), and environmental conditions into a single modeling dataset. Each row represents one stop-level trip observation. The table below lists the required predictors; the build chunk is set to eval: false since it writes output files and depends on the full source parquet.

Code

source_parquet <- "../exploratory_code/modeling/baseline_prediction_table_stop_level_v2.parquet"
out_csv     <- "../exploratory_code/modeling/final_predictive_feature_table.csv"
out_parquet <- "../exploratory_code/modeling/final_predictive_feature_table.parquet"

Code

feature_table <- arrow::read_parquet(source_parquet) %>% as_tibble()

required_predictors <- c(
  "alightings_lag_7d", "boardings_lag_7d", "curr_stop_shared_stop_flag",
  "curr_vessel_count_lag_7d", "direction", "expected_water_min",
  "is_terminal_stop", "planned_turnaround_gap_min", "prev_stop_sched_hour",
  "prev_stop_shared_stop_flag", "prev_vessel_count_lag_7d", "schedule_daytype",
  "schedule_season", "scheduled_headway_min", "stop_sequence", "vessel_capacity",
  "vessel_trip_seq_in_chain", "env_battery_water_level_obs", "env_narrows_current_obs",
  "env_precipitation_mm", "env_temperature_c", "env_visibility_m",
  "env_wind_speed_mps", "from_stop_id", "to_stop_id", "segment_hourly_vessel_count_all"
)

required_columns <- c("trip_date", "lateness_vs_original_schedule_min", required_predictors)
missing_columns  <- setdiff(required_columns, names(feature_table))
if (length(missing_columns) > 0) stop("Missing required columns: ", paste(missing_columns, collapse = ", "))
stopifnot(nrow(feature_table) > 0)
stopifnot(anyDuplicated(names(feature_table)) == 0)
stopifnot("segment_hourly_vessel_count_all" %in% names(feature_table))
stopifnot("prev_stop_sched_hour" %in% names(feature_table))

readr::write_csv(feature_table, out_csv)
arrow::write_parquet(feature_table, out_parquet)
stopifnot(file.exists(out_csv), file.exists(out_parquet))

3.2.2 Baseline Model-Family Comparison

We used a chronological train / validation / test split. The most recent 30 days were held out as the test set, the previous 30 days were used as validation, and earlier observations were used for training. This matches the operational setting where future trips are predicted from historical data.

The baseline screen compared five model families: Logistic Regression, SVM, DNN, Random Forest, and XGBoost. SVM used a capped 20,000-row stratified training sample, so it should be interpreted as lightweight screening evidence. The baseline screen supported carrying forward Random Forest and XGBoost for focused finalist comparison.

Code

model_dir <- "../exploratory_code/modeling"

baseline_file   <- file.path(model_dir, "baseline_model_family_comparison.csv")
prediction_file <- file.path(model_dir, "final_rf_xgb_test_predictions.csv")
importance_file <- file.path(model_dir, "final_rf_top10_feature_importance.csv")

threshold_source_note <- "Test predictions were used for the recall >= 0.20 operating-point table because validation prediction probabilities were not available in the final delivery package."
rf_xgb_highlight_note <- "Yellow shading marks recall ranges where RF has higher precision than XGBoost and at least one model has precision >= 0.10."

Code

if (!file.exists(baseline_file)) {
  warning("Missing baseline comparison file: ", baseline_file)
} else {
  baseline_tbl <- read_csv(baseline_file, show_col_types = FALSE) %>%
    mutate(model_family = fct_reorder(model_family, validation_roc_auc))

  ggplot(baseline_tbl, aes(x = validation_roc_auc, y = model_family)) +
    geom_col(fill = "#1A9988", width = 0.68) +
    scale_x_continuous(labels = percent_format(accuracy = 1), limits = c(0, NA)) +
    labs(title = "Baseline Model-Family Screen", x = "Validation ROC-AUC", y = NULL) +
    theme_minimal(base_size = 15) +
    theme(plot.title = element_text(face = "bold", size = 18), panel.grid.minor = element_blank())
}

Validation ROC-AUC from the baseline screen. SVM used a capped training sample and is included as lightweight screening evidence.

3.2.3 Final RF vs XGBoost Comparison

After the baseline screen, we focused on Random Forest and XGBoost. Both model families can capture nonlinear relationships and interactions among schedule structure, stop context, lagged demand, vessel traffic, and environmental conditions.

The final tuning process used trees = 1500 as the practical central setting. Around that tree count, RF tuning mainly adjusted mtry and min_n. XGBoost tuning adjusted tree depth, min_n, learning rate, loss reduction, and feature sampling. The final comparison retained one RF operational candidate and one XGBoost benchmark candidate.

XGBoost remains a strong predictive benchmark, but the dashboard use case emphasizes conservative alerting rather than global ranking alone. In the higher-precision region, Random Forest provides a more useful recall balance, so we select RF as the operational model and retain XGBoost as the benchmark.

Code

if (!file.exists(prediction_file)) {
  warning("Missing final RF/XGB prediction file: ", prediction_file)
} else {
  pred_tbl <- read_csv(prediction_file, show_col_types = FALSE) %>%
    mutate(delay_gt_10 = factor(delay_gt_10, levels = c("no", "yes")))

  pr_tbl <- bind_rows(
    pred_tbl %>% transmute(model = "Final RF",  delay_gt_10, .pred_yes = rf_pred_yes)  %>%
      group_by(model) %>% pr_curve(truth = delay_gt_10, .pred_yes, event_level = "second"),
    pred_tbl %>% transmute(model = "Final XGB", delay_gt_10, .pred_yes = xgb_pred_yes) %>%
      group_by(model) %>% pr_curve(truth = delay_gt_10, .pred_yes, event_level = "second")
  )

  prep_curve <- function(dat, model_name) {
    dat %>%
      filter(model == model_name, is.finite(recall), is.finite(precision)) %>%
      group_by(recall) %>%
      summarise(precision = max(precision, na.rm = TRUE), .groups = "drop") %>%
      arrange(recall)
  }

  interp_precision <- function(curve, recall_grid) {
    approx(curve$recall, curve$precision, xout = recall_grid, rule = 2, ties = "ordered")$y
  }

  rf_curve  <- prep_curve(pr_tbl, "Final RF")
  xgb_curve <- prep_curve(pr_tbl, "Final XGB")
  max_shared_recall <- min(max(rf_curve$recall, na.rm = TRUE), max(xgb_curve$recall, na.rm = TRUE))
  recall_grid <- seq(0, max_shared_recall, length.out = 600)

  highlight_grid <- tibble(
    recall        = recall_grid,
    rf_precision  = interp_precision(rf_curve,  recall_grid),
    xgb_precision = interp_precision(xgb_curve, recall_grid)
  ) %>%
    mutate(
      high_precision_region = pmax(rf_precision, xgb_precision, na.rm = TRUE) >= 0.10,
      rf_above_xgb = rf_precision > xgb_precision,
      highlight    = rf_above_xgb & high_precision_region
    )

  if (!any(highlight_grid$highlight, na.rm = TRUE)) {
    highlight_grid <- highlight_grid %>% mutate(highlight = rf_above_xgb)
    rf_xgb_highlight_note <- "No RF-above-XGB intervals met the precision >= 0.10 rule, so the plot highlights all recall ranges where RF precision is above XGBoost."
  }

  highlight_intervals <- highlight_grid %>%
    mutate(run_id = cumsum(highlight != lag(highlight, default = first(highlight)))) %>%
    filter(highlight) %>%
    group_by(run_id) %>%
    summarise(xmin = min(recall), xmax = max(recall), .groups = "drop") %>%
    filter(xmax > xmin)

  ggplot(pr_tbl, aes(x = recall, y = precision, color = model)) +
    geom_rect(data = highlight_intervals,
              aes(xmin = xmin, xmax = xmax, ymin = 0, ymax = 1),
              inherit.aes = FALSE, fill = "#F2E8CF", alpha = 0.55) +
    geom_path(linewidth = 1.15) +
    scale_color_manual(values = c("Final RF" = "#D95F59", "Final XGB" = "#1A9988")) +
    scale_x_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
    scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, NA)) +
    labs(title = "Final RF vs XGBoost Precision-Recall Curve",
         x = "Recall", y = "Precision", color = NULL) +
    theme_minimal(base_size = 15) +
    theme(plot.title = element_text(face = "bold", size = 18),
          panel.grid.minor = element_blank(), legend.position = "bottom")
}

Yellow shading marks recall ranges where RF has higher precision than XGBoost.

3.2.4 Final RF Feature Importance

We inspected feature importance from the final Random Forest model to understand what types of advance-available information the operational model relies on. This importance is model-based and descriptive; it is not a causal estimate.

The top features combine environmental / water-condition variables and recurring operational patterns.

Code

if (!file.exists(importance_file)) {
  warning("Missing RF feature importance file: ", importance_file)
} else {
  importance_tbl <- read_csv(importance_file, show_col_types = FALSE) %>%
    arrange(desc(importance)) %>%
    slice_head(n = 10) %>%
    mutate(
      display_label = dplyr::recode(
        feature,
        segment_hourly_vessel_count_all  = "Segment vessel count, 7-day lag",
        boardings_lag_7d                 = "Boardings, 7-day lag",
        alightings_lag_7d                = "Alightings, 7-day lag",
        curr_vessel_count_lag_7d         = "Current-stop vessel count, 7-day lag",
        prev_vessel_count_lag_7d         = "Previous-stop vessel count, 7-day lag",
        env_temperature_c                = "Temperature",
        env_battery_water_level_obs      = "Battery water level",
        env_narrows_current_obs          = "Narrows current",
        env_precipitation_mm             = "Precipitation",
        env_visibility_m                 = "Visibility",
        env_wind_speed_mps               = "Wind speed",
        expected_water_min               = "Expected water time",
        planned_turnaround_gap_min       = "Planned turnaround gap",
        prev_stop_sched_hour             = "Scheduled hour",
        .default = str_to_sentence(str_replace_all(feature, "_", " "))
      )
    )

  ggplot(importance_tbl, aes(x = importance, y = fct_reorder(display_label, importance))) +
    geom_col(fill = "#1A9988", width = 0.72) +
    labs(title = "Final RF Feature Importance", x = "Importance", y = NULL) +
    theme_minimal(base_size = 15) +
    theme(plot.title = element_text(face = "bold", size = 18), panel.grid.minor = element_blank())
}

The final RF relies on a mix of environmental conditions, water-current context, and recurring operational demand/vessel-pattern features.

4 Concluding Remarks

4.1 Takeaways

The analysis combined two complementary modeling approaches to understand and anticipate service delays in the NYC Ferry system. A stacked linear regression model — trained on stop-level observations across five feature domains — explained 74% of arrival lateness variation. The dominant driver was operational propagation: inherited delay from a prior trip alone accounted for 59% of explanatory power, with each minute of prior delay carrying forward 0.81 minutes, and each additional stop in sequence adding 0.52 minutes. Structural route context (notably the Rockaway route at +3.2 minutes above reference), temporal patterns (afternoon 15–17 windows, summer service periods), and environmental conditions (temperature, wind, water level) contributed smaller but monitorable signals.

The predictive component used a Random Forest classifier — selected over logistic regression, SVM, deep neural network, and XGBoost based on precision-oriented evaluation — to assign each trip an excessive-delay probability (defined as >10 minutes) up to 48 hours before departure, using only features available at that planning horizon. Same-day realized outcomes were excluded in favor of 7-day lag ridership and vessel counts alongside forecast-based environmental conditions. Predicted probabilities were mapped to low, medium, and high risk bands to support operator decision-making around vessel allocation and landing staffing.

Key limitations include class imbalance in training data, a training window restricted to January 2024 through June 2025, and dependency on the pre-December 2025 route structure.

4.2 Future Work

A major area for future work is operationalizing the data pipeline. In its current form, the project processes data from large CSVs using R scripts, which constrained both collaboration and the volume of data that could be ingested — limiting the training window to January 2024 through June 2025. Moving data into a database accessible via an API would enable a proper machine learning pipeline, allowing for faster model iteration and likely improved performance.

This infrastructure improvement would also directly benefit the web application. Currently, the app relies on a static precomputation of all possible Random Forest outputs, a workaround made necessary by treating temperature and precipitation as discrete bins rather than continuous values. It also substitutes historical lag variables from 2024–2025 in place of true 7-day rolling counts. A live data pipeline would resolve both limitations and, critically, allow new trip data to be incorporated into model training on an ongoing basis — particularly important given the December 2025 system redesign, which rendered the existing training data partially obsolete.

Finally, there is meaningful untapped signal in the captain notes recorded in the trip data. Exploratory analysis indicated that coast guard restrictions and landing closures measurably affect delays, but no structured ground-truth labels currently exist for these events. Combining natural language analysis of captain notes with operational variables such as marine traffic density could unlock a richer understanding of delay causes that the current modeling framework is unable to capture.

5 Acknowledgements

Thank you to Rebekah Adams of the NYC Economic Development Corporation — our client representative and MUSA alum — for helping put this project together and hosting us at the EDC headquarters. Her support was instrumental in shaping the project’s direction. Thank you as well to the entire NYC Ferry team for providing invaluable operational insights into the system. We all came away with a much deeper understanding of how the ferry network functions day to day.

Finally, thank you to Professor Michael Fichman and Matt Harris of the MUSA Program for guiding us through the practicum course. Their candid feedback helped us iterate on this project throughout the semester, culminating in a final presentation we felt proud of.

6 Code Appendix

Through this project process, we wrote a lot of code that is not meaningful toward our final results and biggest takeaways. This code lives in various markdown files in our Github Repo.