Clustering teams from NBA standings

# Data attributes: wins, losses, win percentage, games behind, conference record (2 values), division record (2 values), home record (2 values), away record (2 values), last ten record (2 values), streak (+ for wins, - for losses). Source: NBA standings, 11.03.2018 teams = { 'eastern conference': { 'teams': ['Toronto', 'Boston', 'Cleveland', 'Indiana', 'Washington', 'Philadelphia', 'Miami', 'Milwaukee', 'Detroit', 'Charlotte', 'New York', 'Chicago', 'Brooklyn', 'Atlanta', 'Orlando'], 'team data': [ [48, 17, 0.738, 0, 31, 8, 8, 3,28, 5, 20, 12, 9, 1, 7], [46, 20, 0.697, 2.5, 29, 13, 10, 3, 23, 11, 23, 9, 6, 4, 2], [38, 27, 0.585, 10, 27, 14, 9, 5, 22, 11, 16, 16, 5, 5, -1], [38, 28, 0.576, 10.5, 28, 16, 10, 6, 23, 12, 15, 16, 7, 3, 1], [38, 29, 0.567, 11, 24, 18, 7, 6, 19, 14, 19, 15, 5, 5, -1], [35, 29, 0.547, 12.5, 21, 17, 4, 7, 20, 10, 15, 19, 6, 4, -1], [36, 31, 0.537, 13, 25, 18, 9, 5, 20, 13, 16, 18, 6, 4, 2], [35, 31, 0.530, 13.5, 22, 21, 5, 9, 21, 14, 14, 17, 3, 7, 1], [30, 36, 0.455, 18.5, 19, 26, 7, 7, 21, 14, 9, 22, 3, 7, 1], [29, 38, 0.433, 20, 17, 23, 9, 4, 19, 17, 10, 21, 5, 5, 1], [24, 42, 0.364, 24.5, 12, 26, 6, 7, 16, 14, 8, 28, 1, 9, -6], [22, 43, 0.338, 26, 17, 22, 4, 8, 15, 18, 7, 25, 3, 7, -1], [21, 45, 0.318, 27.5, 14, 24, 1, 9, 12, 21, 9, 24, 2, 8, 1], [20, 46, 0.303, 28.5, 9, 33, 3, 8, 15, 19, 5, 27, 3, 7, -2], [20, 47, 0.2999, 29, 12, 28, 4, 9, 13, 18, 7, 29, 2, 8, -4] ] }, 'western conference': { 'teams': ['Houston', 'Golden State', 'Portland', 'New Orleans', 'Oklahoma City', 'Minnesota', 'San Antonio', 'LA Clippers', 'Denver', 'Utah', 'Los Angeles Lakers', 'Sacramento', 'Dallas', 'Phoenix', 'Memphis'], 'team data': [ [51, 14, 0.785, 0, 31, 8, 8, 3, 25, 6, 26, 8, 9, 1, -1], [51, 15, 0.773, 0.5, 28, 12, 8, 2, 26, 7, 25, 8, 8, 2, -1], [40, 26, 0.606, 11.5, 25, 15, 7, 6, 22, 11, 18, 15, 9, 1, 9], [38, 27, 0.585, 13, 20, 19, 6, 4, 17, 13, 21, 14, 9, 1, -1], [39, 29, 0.574, 13.5, 23, 20, 5, 9, 23, 11, 16, 18, 7, 3, 2], [38, 29, 0.567, 14, 28, 13, 9, 4, 25, 8, 13, 21, 4, 6, -3], [37, 29, 0.561, 14.5, 21, 19, 7, 5, 23, 8, 14, 21, 2, 8, -2], [36, 29, 0.554, 15, 22, 19, 11, 3, 20, 14, 16, 15, 7, 3, 2], [36, 30, 0.545, 15.5, 23, 21, 6, 6, 25, 10, 11, 20, 6, 4, 1], [36, 30, 0.545, 15.5, 23, 16, 6, 8, 21, 11, 15, 19, 8, 2, 5], [29, 36, 0.446, 22, 14, 26, 5, 8, 16, 15, 13, 21, 6, 4, -1], [21, 45, 0.318, 30.5, 10, 31, 3, 9, 11, 22, 10, 23, 3, 7, 1], [21, 45, 0.318, 30.5, 12, 32, 5, 9, 14, 21, 7, 24, 3, 7, 2], [19, 49, 0.279, 33.5, 13, 30, 3, 8, 9, 24, 10, 25, 1, 9, -5], [18, 48, 0.273, 33.5, 15, 27, 5, 10, 13, 21, 5, 27, 0, 10, -17] ] } } import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score clusters = 3 # best result from silhouette score, see comment below opacity_step = int(255 / clusters) ss = StandardScaler() pca = PCA(n_components=2) kmeans = KMeans(n_clusters=clusters) conferences = ['eastern', 'western'] pastel_colors = plt.cm.Pastel2 for conf in conferences: selected_conference = teams[conf + ' conference'] team_labels = selected_conference['teams'] data = selected_conference['team data'] scaled = ss.fit_transform(data) reduced = pca.fit_transform(scaled) clustered = kmeans.fit_predict(reduced) plt.title('NBA %s Conference 2018' % (conf.title())) # Connect the teams in the standings' sequence plt.plot(reduced[:, 0], reduced[:, 1], '-', lw= 0.5, color='#777777') for i, idx in enumerate(clustered): rx, ry = reduced[i, 0], reduced[i, 1] #plt.plot(rx, ry, '.', markersize=12, color=cw(idx * opacity_step)) # for another color palette, e.g. "coolwarm" plt.plot(rx, ry, '.', markersize=12, color=pastel_colors(idx)) plt.annotate(team_labels[i], xy=(rx, ry), xytext=(rx, ry + 0.06), ha='center', fontsize=8) # Obtain optimal number of clusters #range_n_clusters = [2, 3, 4, 5, 6, 7, 8, 9, 10] #for n_clusters in range_n_clusters: #kmeans = KMeans(n_clusters=n_clusters) #clustered = kmeans.fit_predict(reduced) #silhouette_avg = silhouette_score(reduced, clustered) #print("%d clusters, silhouette score: %.4f" % (n_clusters, silhouette_avg)) plt.axis('off') plt.tight_layout() plt.show() Clustering of the teams in the NBA's Eastern Conference and connecting them according to the table standings (11.03.2018) Clustering of the teams in the NBA's Western Conference, 2018 and connecting them according to the table standings (11.03.2018)

We obtain highest silhouette score when the number of clusters is around three, which was chosen for consistency in the two cases. The sequence scaling -> dimensionality reduction -> clustering can be often found in practice, where at each step various different approaches can be chosen. For instance, we might have chosen to use RobustScaler, followed by t-SNE, followed by DBSCAN. The result would be again points, whose cluster labels are known, which allows us to select distinct colors based on these labels.

A person who checks the standings may not immediately recognize the relationships between the variables, especially when there are too many of them. PCA attempts to extract the latent information from the data, using SVD in the background. For instance, teams that haven't played all games yet may appear further below in the standings. The viewer may not pay attention to these teams much, but they may have the potential to climb up in the ranking once they play their games. PCA would be able to capture these differences and give us some hints about potential.

By connecting the teams as they appear in the standings, we may try to seek an explanation or correspondence between the clustering we observe and the current reality. If the points in a cluster are very close, so that we see many line crossings, perhaps we could expect that at that position in the table, things have not settled yet and changes in the rankings are possible. This is only an intuition of course, purely enabled by the visualization.