The Subway

A path through the New York City subway.

New York City has almost 500 subway stations. There is a Guinness World Record for the fastest time to visit each station.

It seems like the fastest path to visit all stations can be found using a graph traversal algorithm. Maybe something like an algorithm for the travelling salesman problem. This post is about my attempt to use graph theory to find a fast path through the subway.

In order to come up with a solution using graph theory you need to be able to represent the New York City Subway as a graph. The subway data is publiclly available on the MTA's website.

Understanding the data

The data is in the GTFS format. GTFS consists of text files. routes.txt is a good place to start.

Routes

routes.txt contains the subway's routes. It looks like this.

The route_id column contains the familiar names for the routes like "the A" or "the 2" or "the Q". The schedule for a route varies depending on the day of the week. The schedule for the A might be different on weekdays than on Saturdays or Sundays. That brings us to services.

Services

Services are represented in calendar.txt. There are weekday services, Saturday services, and Sunday services. Several trains run on a route at a time and the stops they take and the times that they stop vary with the service. That brings us to trips.txt.

Trips

Each record in trips.txt maps to a real train that runs on a route on a given day. Each train has a trip_id and is tied back to a service with a service_id and a route with a route_id. It also tells you which direction the train runs in (e.g. uptown or downtown). To find out where a train stops and when you need to look at stop_times.txt.

Stop Times

stop_times.txt tells you which stations each train stops at and what time each train stops there. You can look up the train with the trip_id. The stop_id tells you which station the train stops at. The record tells you what time the train arrives and what time the train leaves. The records belonging to a trip_id each have a stop_sequence. This tells you the order in which the train visits the stops. To get more information on the actual stops you can look at stops.txt.

Stops

You look up a stop using the stop_id. Each stop has a name and a latitude and longitude. Stops are grouped in threes. There is the parent stop, 127, Times Sq - 42 St, for instance. And there are two child stops: 127N and 127S. These stops are part of the same station but one is a stop for uptown trains, and the other is a stop for downtown trains.

Transfers

Finally, there is transfers.txt which tells you where you can transfer between stops. By transfering to a different stop you can transfer to a different route. Each transfer has a from_stop_id and a to_stop_id. The min_transfer_time is related to how long it takes to get from one stop to another. See the details on transfer_type.

Turning the data into a graph

Alright, now that we understand the GTFS data we can work on turning it into a graph. A graph is represented as a collection of nodes (stops in our case) and edges (trains). We want a collection of stops and we want to be able to look up trains going out of each stop by stop_id. We will have two data structures: a map of stop_ids to stops and a map of stop_ids to edges.

I will be using Python and a GTFS module for Python called transitfeed.

Building the nodes

The stops.txt file maps almost exactly to the data structure we want. We can load the GTFS data with the transitfeed module and create a map of stop ids to stops with code like this.

This code filters out all the child stops. This makes things easier because we do not necessarily want to visit the north and south stop of every station. I think its sufficient to visit each station going in either direction.

In the full source I remove some of the nodes. Some of the nodes are removed because they do not have any edges. Maybe these stops are under construction. The other nodes are removed because they are on Staten Island and I did not want to account for the Staten Island Ferry.

Building the edges

We want to know which stops we can reach from a given stop. We can get this data by combining the stop times data and the transfers data. Two stop times with the same trip_id and consecutive stop_sequences can be transformed into an edge. Similarly a transfer represents an edge from one stop to another. We can write code like this to get the edges.

Code notes: First we get all the trips that are in service on weekdays. I chose weekdays only to limit the data set and I figure the most edges will exist on weekdays. Then we get all the edges from a trip by combining consecutive stop times into a single edge. Then we get all the edges from transfers.

There's an important difference between transfer edges and train edges. You take a transfer by walking from one station to another. That means as soon as you get off the train at a stop you can immediately start walking to the next stop. Whereas when you take a train edge, you have to wait for the train you are taking to arrive.

To account for this difference every Edge object has a depart_at and arrive_at time. This time is an integer that represents seconds after midnight (this makes comparing times easy). Edges created from transfers have a special value in depart_at. We use ANYTIME which is bound to 999999.

When we visit a stop we might like to know which stop we can get to most quickly. To do this we will look up all the edges that have the current stop as the origin. Then we will take all of those edges that depart after the current time (including the transfer edges). Then we will find the edge from that set that has the smallest arrive_at time.

Differences from the travelling salesman problem

You can think of this problem as a graph traversal problem where you want to minimize the total cost to visit each node in the graph. It is something like the travelling salesman problem with a few differences.

First in the travelling salesman problem you start at a node, visit every other node, and return to the node from which you started. Returning to the start node is not a requirement for this problem.

Second, the travelling salesman problem usually involves a complete graph (a graph in which every node is connected to every other node). The NYC subway is not like that. You can't necessarily take a single train from one station to any other.

Third, some stations are at the "end of the line". If you go to one of these stations, the only station you can go to next is the last one you passed through. This makes a difference because many heuristics for solving the travelling salesman problem depend on always being able to go to an unvisited node. In the NYC subway you have to backtrack through stops you have already visited.

Fourth, the weights of the edges change over time. Depending on when you arrive at a stop, you may have to wait a few minutes or a few seconds for the next train. Additionally, trains run more often at 9:00 a.m. than they do at midnight. The wait time at each stop changes the weight of an edge. The weight of an edge is the sum of the time it take for the next train to leave and the time it takes that train to arrive at the next stop.

Running Nearest Neighbor on the Graph

One approach we can take to compute the shortest path through the subway is the Nearest Neighbor Algorithm. The Nearest Neighbor Algorithm is a greedy approach to the travelling salesman problem where at each node you simply take the smallest edge to the next unvisited node. The Nearest Neighbor Algorithm does not necessarily lead to the minimum path through the graph. But, because it is easy enough to understand, it is a good first algorithm to use for this problem.

Remember there are some stops where we will not be able to reach an unvisited stop. In this case we will find the shortest path to an unvisited stop using Dijkstra's Algorithm. Let's look at an implementation of Dijkstra's algorithm and then see how we can use it to implement the modified Nearest Neighbor algorithm.

Running Dijksta's Algorithm

This is a little different from the typical dijkstra code. First we consider only edges that represent trains that leave after we arrive at a stop using get_edges_after_time(). Second we use the time we arrive at a destination as an edge's weight. Instead of computing the shortest path to every node from a given origin, we compute the earliest we can arrive at every node if we leave the given origin at the given time.

Nearest Neighbor

Nearest Neighbor picks a start node and a start time and adds that node to the path. Then it adds nodes to the path until it has visited all the nodes. The next node is selected by choosing edges that leave from the current stop after the arrival time. Then those edges are sorted by the time you arrive at the destination node. The smallest edge is chosen and the destination becomes the current node. If there are no unvisited nodes, we call path_to_nearest_unvisited_stop() which uses Dijkstra's algorithm.

path_to_nearest_unvisited_stop() runs Dijkstra's algorithm on the graph from the current node at the current time. Then it gets all the unvisited stops and finds the one we can get to the earliest. It returns the path from the source to that node. Sometimes you can't find a path if the current time is after midnight. In this case you subtract 24 hours from the current time and run the procedure again. This should work because we are only considering weekday trains. So if you start on a Monday you can assume the schedule is the same on Tuesday.

Results

Using the nearest neighbor algorithm I found that you could visit every stop in the subway in 27 hours and 23 minutes. That's around 6 hours more than the current record. The total time varies quite a bit depending on what time you start and what station you start from. You can do it in 27 hours and 23 minutes if you start at 9:00 a.m. at 14th - Union Sq. You can see the full path at the top of the page.

It takes something like one minute to run nearest_neighbor() on the graph. I did not run nearest_neighbor() on every possible start because it would have taken ~9 hours1. It's possible that you could do significantly better than 27 hours and 23 minutes by starting from a different station.

A different algorithm should be able to find a time at least as good as the current record. I would like to try Nearest Insertion or a branch and bound algorithm. Here is the full source.

Footnotes

Xianny's code to generate the graph in Go is much faster. If performance is a limiting factor to solving this problem maybe switching to another language would help.