In September 2024, we (Open Innovations) downloaded the live locations from the BODS GTFS-RT feed every minute for 10 days.
Our intention was to use the data to create “real” versions of the timetables that were based on where the buses actually were, rather than where they were supposed to be.
You can read more in two blog posts:
https://open-innovations.org/blog/2025-01-15-tracking-the-buses
One of the largest problems we faced was the number of trip_id
that were missing from the live location data. For a given download of the live feed, sometimes as many as 50% of the individual entities
(Buses) in the GTFS-RT feed did not contain a trip_id
. They also didn’t contain a route_id
or start_date
.
Even after removing all duplicate data (in timestamp, vehicle_id, coordinates), we found that for a given day, up to 20% of the data was missing a trip_id
, along with any other information that could be used to match the bus with the timetable.
We noticed that if a feed entity doesn’t contain a trip_id
, then it also doesn’t contain any information of which stop the bus is near to (i.e. stop_sequence
and current_stop_status
). However, each entity does continue to report latitude
, longitude
, bearing
, and vehicle_id
.
One method we used to mitigate this issue was to time order the data and group by the vehicle_id
. If we noticed that a trip_id
was reported, dropped out, and then reappeared, we were able to fill in the gaps. However, the majority of the missing trip_id
were at the start/end of routes, and so we weren’t able to assign these entities a trip_id
with confidence.
We have GTFS-RT data covering 15th-23rd September 2024, both raw and ‘cleaned’. We’re happy to share this and our methods to figure out how we can improve the data quality and ensure every feed entity has the necessary information.