Checking nearby stations

I noticed that the data about the bike stations in Barcelona had an interesting property: to each station the IDs of 2-4 other nearby stations were given (key: "nearbyStations"). This could allow us to draw connections among close nodes, but since we already had many points in the previous visualization, that would quickly overwhelm it, so these connections were not included.

However, the question is whether we should blindly trust such data. We need to have a way to verify whether these numbers are correct, otherwise we would be plotting the wrong connections.

Here you can see a small excerpt from the original data, which I reshaped to include only station id, street name and street number, latitude, longitude and "nearbyStations". Since the last one contains commas, I had to use semicolon as a separator to uniquely distinguish among the entries.

1;Gran Via Corts Catalanes 760;41.397952;2.180042;24, 369, 387, 426 2;Roger de Flor/ Gran Vía 126;41.39553;2.17706;360, 368, 387, 414 3;Ali Bei 44;41.393699;2.181137;4, 6, 119, 419 4;Ribes 13;41.39347;2.18149;3, 5, 359, 419 5;Pg Lluís Companys 11;41.391075;2.180223;6, 7, 359, 418 6;Pg Lluís Companys 18;41.391349;2.18061;5, 8, 359, 419 7;Pg Lluís Companys 1;41.388856;2.183251;8, 118, 389 8;Pg Lluís Companys 2;41.389088;2.183568;6, 118, 389 9;Marquès de l'Argentera 17;41.385031;2.185249;14, 115, 389 10;Carrer Comerç 27;41.38498;2.18417;9, 14, 115, 389 11;Passeig Marítim 19;41.381689;2.193914;116, 124, 125, 396 12;Pg Marítim Barceloneta 23;41.384538;2.195679;11, 13, 116, 396 13;Avinguda Litoral 16;41.386861;2.195761;11, 12, 46, 69 14;Avinguda del Marques Argentera 19;41.384825;2.185074;9, 115, 389 15;Girona 74;41.39515;2.17076;23, 25, 362, 413

By looking at this, how easy it is to say whether ID 1 has the neighbors with IDs 24, 369, 387 and 426? Here we will introduce code to check this visually:

import numpy as np import matplotlib.pyplot as plt from sklearn.neighbors import NearestNeighbors rows = [] with open('transformed_barcelona_stations.csv', 'r') as f: for line in f.read().split('\n')[:-1]: rows.append(line.split(';')) neighbors_to_find = 4 claimed_neighbors = [] rows = np.array(rows) X = rows[:,2:4].astype(float) # Get latitudes and lontitudes only for item in rows[:,4]: this_claimed_neighbors = list(map(int, item.split(', '))) this_claimed_neighbors_len = len(this_claimed_neighbors) claimed_neighbors.append(this_claimed_neighbors + [-1]*(neighbors_to_find - this_claimed_neighbors_len)) neigh = NearestNeighbors(n_neighbors=neighbors_to_find+1) # will contain nearest to itself neigh.fit(X) _, indices = neigh.kneighbors(X) found_neighbors = indices[:, 1:] # eliminate index of self #print(claimed_neighbors) #print(found_neighbors) fig, ax = plt.subplots(5,2, figsize=(8,8)) # Compare the first 5 points visually for k in range(2): for i in range(5): ax[i,0].set_title('Claimed neighbors to station ' + str(i+1)) ax[i,1].set_title('Found neighbors to station ' + str(i+1)) # Plot all points ax[i,k].plot(X[:,0], X[:,1], '.', markersize=2, color='#AAAAAA') # Mark point in red, neighbors in blue ax[i,k].plot(X[i,0], X[i,1], '.', markersize=5, color='red') if k == 0: claimed_neighbor_indices = claimed_neighbors[i] for idx in claimed_neighbor_indices: # We need to subtract 1, since in the dataset indices start from 1 ax[i,k].plot(X[idx-1,0], X[idx-1,1], '.', markersize=5, color='blue') ax[i,k].plot([X[i,0], X[idx-1,0]], [X[i,1], X[idx-1,1]], '-', lw=1, color='black') else: found_neighbor_indices = found_neighbors[i] for idx in found_neighbor_indices: ax[i,k].plot(X[idx,0], X[idx,1], '.', markersize=5, color='blue') ax[i,k].plot([X[i,0], X[idx,0]], [X[i,1], X[idx,1]], '-', lw=1, color='black') ax[i,k].axis('off') plt.tight_layout() plt.show()

We obtain the following graphic:

Comparison of claimed and found neighbors for the first five bike stations in Barcelona

Considering how densely positioned most bike stations in these small regions are (see the grey points), it is somewhat strange that many edges on the left side pass through many points that are visibly closer to the starting point. We wanted to see distances which are minimal, because this is what nearby stations means. The ones found by scikit-learn (on the right side), are much shorter, which represents an improvement over the original data. Here we checked only the first five stations of many, but this is sufficient to remind ourselves to never blindly trust our data.