The London Tube as a Graph

Representing a rail network as a graph is nothing new, its the most obvious way to do it. Nodes are the stations and edges are the lines between them. But what happens when you apply algorithms like PageSort to the graph? Will it be able to pick out the stations a human would intuitively pick out as important? Lets find out

Imports first, I'll use NetworkX as the graph library. It seems to be the easiest and most full featured library around.

In [1]:
%matplotlib inline

import colorsys
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from collections import Counter
from bokeh.plotting import figure, show
from bokeh.resources import CDN
from bokeh.io import output_notebook
output_notebook( resources=CDN )

pd.set_option('max_colwidth', 200)
BokehJS successfully loaded.

I found it suprisingly hard to find a nicely structured dataset of stations and connections between them. Luckily this library had some CSVs buried in it with just what I was looking for. So i yanked them out and opened them in pandas

In [2]:
lines       = pd.read_csv('london.lines.csv', index_col=0)
stations    = pd.read_csv('london.stations.csv', index_col=0)
connections = pd.read_csv('london.connections.csv')

Look at the nice data! Some of it is probably out of date, but not in any major way.

In [3]:
lines.head(3)
Out[3]:
name colour stripe
line
1 Bakerloo Line AE6017 NaN
3 Circle Line FFE02B NaN
6 Hammersmith & City Line F491A8 NaN
In [4]:
stations.head(3)
Out[4]:
latitude longitude name display_name zone total_lines rail
id
1 51.5028 -0.2801 Acton Town Acton<br />Town 3 2 0
2 51.5143 -0.0755 Aldgate NaN 1 2 0
3 51.5154 -0.0726 Aldgate East Aldgate<br />East 1 2 0
In [5]:
#drop station display_name for the future, its an ugly column
stations.drop('display_name', axis=1, inplace=True)
In [6]:
connections.head(3)
Out[6]:
station1 station2 line time
0 11 163 1 1
1 11 212 1 2
2 49 87 1 1

We can create a naive graph super easily from this.

A simplified graph

In [7]:
graph = nx.Graph()

for connection_id, connection in connections.iterrows():
    station1_name = stations.ix[connection['station1']]['name']
    station2_name = stations.ix[connection['station2']]['name']
    graph.add_edge(station1_name, station2_name, time = connection['time'])
    
#add the connection between Bank and Monument manually
graph.add_edge('Bank', 'Monument', time = 1)

Already we can do some kind of interesting stuff, like get a reasonable path between Oxford Circus and Canary Wharf

In [8]:
nx.shortest_path(graph, 'Oxford Circus', 'Canary Wharf', weight='time')
Out[8]:
['Oxford Circus',
 'Tottenham Court Road',
 'Holborn',
 'Chancery Lane',
 "St. Paul's",
 'Bank',
 'Shadwell',
 'Limehouse',
 'Westferry',
 'West India Quay',
 'Canary Wharf']

And run PageRank on the network!

In [9]:
pagerank = nx.pagerank_numpy(graph)
pagerank = pd.DataFrame(pagerank.items(), columns=['name', 'pagerank'])
stations = pd.merge(stations, pagerank, on='name', right_index=True)
In [10]:
stations.sort_values('pagerank', ascending=False).head(10)
Out[10]:
latitude longitude name zone total_lines rail pagerank
id
145 51.5308 -0.1238 King's Cross St. Pancras 1.0 6 1 0.007915
11 51.5226 -0.1571 Baker Street 1.0 5 0 0.007613
13 51.5133 -0.0886 Bank 1.0 4 0 0.007140
74 51.4920 -0.1973 Earl's Court 1.5 2 0 0.007047
193 51.5154 -0.1755 Paddington 1.0 4 1 0.006178
265 51.4951 -0.2547 Turnham Green 2.5 2 0 0.006108
279 51.5036 -0.1143 Waterloo 1.0 4 1 0.006082
107 51.5067 -0.1428 Green Park 1.0 3 0 0.005852
225 51.5117 -0.0560 Shadwell 2.0 2 0 0.005845
192 51.5150 -0.1415 Oxford Circus 1.0 3 0 0.005757

Those results look incredibly good considering how little work we've put in! With the exception of Turnham Green, this seems like a very reasonable list of the most important tube stations in London. And this is all without taking into account the Overground network, or looking at any traffic stats!

NetworkX also implements the HITS algorithm. It was originally designed to differenciate between web pages which acted as hubs of information and those which acted as authoritive sourves of information. It does this by looking at incoming and outgoing edges from each node. In an undirected graph (like we're using), incoming and outgoing edges are the same, but the results I found when I applied it to the Tube graph were quite interesting!

In [11]:
hits = nx.hits_scipy(graph, max_iter=1000)[0]
hits = pd.DataFrame(hits.items(), columns=['name', 'hits'])
stations = pd.merge(stations, hits, on='name', right_index=True)
In [12]:
stations.sort_values('hits', ascending=False).head(10)
Out[12]:
latitude longitude name zone total_lines rail pagerank hits
id
192 51.5150 -0.1415 Oxford Circus 1 3 0 0.005757 0.059742
107 51.5067 -0.1428 Green Park 1 3 0 0.005852 0.059523
197 51.5098 -0.1342 Picadilly Circus 1 2 0 0.003862 0.047062
28 51.5142 -0.1494 Bond Street 1 2 0 0.004143 0.042528
285 51.5010 -0.1254 Westminster 1 3 0 0.004058 0.035861
279 51.5036 -0.1143 Waterloo 1 4 1 0.006082 0.033215
259 51.5165 -0.1310 Tottenham Court Road 1 2 0 0.004067 0.031668
151 51.5113 -0.1281 Leicester Square 1 2 0 0.004019 0.031330
11 51.5226 -0.1571 Baker Street 1 5 0 0.007613 0.029949
13 51.5133 -0.0886 Bank 1 4 0 0.007140 0.028016

Where PageRank finds the important stations, the HITS algorithm seems to be pretty good at finding the busy stations, still without any traffic data! Neat!

Lets visualise the importance of stations as defined by pagerank. Less important stations will be colored green, and more important stations will be colored red.

In [13]:
def pseudocolor(val):
    h = (1.0 - val) * 120 / 360
    r, g, b = colorsys.hsv_to_rgb(h, 1., 1.)
    return r * 255, g * 255, b * 255
In [14]:
normed = stations[['longitude', 'latitude', 'pagerank']]
normed = normed - normed.min()
normed = normed / normed.max()
locations = dict(zip(stations['name'], normed[['longitude', 'latitude']].values))
pageranks = dict(zip(stations['name'], normed['pagerank'].values))

p = figure(
    x_range = (.4,.7),
    y_range = (.2,.5),
    height= 700,
    width= 900,
)
for edge in graph.edges():
    p.line( 
        x= [locations[pt][0] for pt in edge],
        y= [locations[pt][1] for pt in edge],
    )

for node in graph.nodes():
    x = [locations[node][0]]
    y = [locations[node][1]]
    p.circle(
        x, y, 
        radius = .01 * pageranks[node], 
        fill_color = pseudocolor(pageranks[node]), 
        line_alpha=0)
    p.text(
        x, y, 
        text = {'value':node}, 
        text_font_size = str(min(pageranks[node] * 12, 10)) + "pt", 
        text_alpha = pageranks[node],
        text_align='center',
        text_font_style='bold')
    
show(p)