You probably remember that I have talked about the FIFA dataset available on Kaggle a while ago. This time I wanted to look at it again, but from another perspective—attempting to find features, which could be correlated. In other words features whose values tend to vary together, either in a positive or in a negative sense.
For this, I needed to slightly modify the original dataset. Some values of some features weren't given as exact numbers, but as having a certain tolerance. Whenever I saw something like val+tol or val-tol in a given column, I simply took the computed value to be able to use the data further.
Then I wrote the following code to extract the strongest correlations and present them on a heatmap.
from sklearn.preprocessing import StandardScaler from itertools import combinations import pandas as pd import matplotlib.pyplot as plt import seaborn as sns def plot_shift(vals): return [val+0.5 for val in vals] df = pd.read_csv('FIFA complete dataset.csv', low_memory=False) cols = 'Acceleration,Aggression,Agility,Balance,Ball control,Composure,Crossing,Curve,Dribbling,Finishing,Free kick accuracy,GK diving,GK handling,GK kicking,GK positioning,GK reflexes,Heading accuracy,Interceptions,Jumping,Long passing,Long shots,Marking,Penalties,Positioning,Reactions,Short passing,Shot power,Sliding tackle,Sprint speed,Stamina,Standing tackle,Strength,Vision,Volleys'.split(',') res = df[cols].dropna() ss = StandardScaler() ss = ss.fit_transform(res) df = pd.DataFrame(ss, columns=cols) seaborn_df = pd.DataFrame(index=cols, columns=cols) correlations =  for feature1, feature2 in combinations(cols, 2): corr_val = df[[feature1, feature2]].corr().loc[feature1, feature2] corr_type = 'Positive' if corr_val > 0 else 'Negative' correlations.append((abs(corr_val), '-'.join([feature1, feature2]), corr_type)) seaborn_df.loc[feature1, feature2] = seaborn_df.loc[feature2, feature1] = corr_val correlations.sort(reverse=True) seaborn_df = seaborn_df.fillna(1) table = ['<table><tr><th>Correlation value</th><th>Relationship</th><th>Correlation type</th></tr>'] for corr_tuple in correlations[:50]: table.append('<tr><td>%.4f</td><td>%s</td><td>%s</td></tr>' % (corr_tuple)) table.append('</table>') print(''.join(table)) fontsize = 8 plot = sns.heatmap(seaborn_df, linewidths=0.01) plot.set_title('Pairwise correlations among features in the FIFA dataset') plot.set_xticks(plot_shift(range(len(cols)))) plot.set_yticks(plot_shift(range(len(cols)))) plot.set_xticklabels(cols, rotation=90, fontsize=fontsize) plot.set_yticklabels(cols, rotation=0, fontsize=fontsize) plt.tight_layout() plt.show()
|Correlation value||Relationship||Correlation type|
|0.9727||GK diving-GK reflexes||Positive|
|0.9694||GK diving-GK handling||Positive|
|0.9693||GK handling-GK reflexes||Positive|
|0.9691||GK positioning-GK reflexes||Positive|
|0.9688||GK diving-GK positioning||Positive|
|0.9686||GK handling-GK positioning||Positive|
|0.9684||Sliding tackle-Standing tackle||Positive|
|0.9652||GK kicking-GK reflexes||Positive|
|0.9643||GK diving-GK kicking||Positive|
|0.9639||GK handling-GK kicking||Positive|
|0.9629||GK kicking-GK positioning||Positive|
|0.9052||Ball control-Short passing||Positive|
|0.8951||Long passing-Short passing||Positive|
|0.8574||Curve-Free kick accuracy||Positive|
|0.8283||Ball control-Shot power||Positive|
Although we have taken the absolute correlation values in the table, we still see that the strongest relationships tend to be only positive. What we see additionally is that the the features related to goalkeepers tend to be strongly correlated. This means that it is very likely that a goalkeeper posessing one of these qualities may likely posesses the other one as well, at least according to the data we were given. Further we see elements which we might tend to believe intuitively as well: that tackling is related, whether its sliding or standing, that ball control is related to good dribbling, that acceleration seems to be related to sprint speed. Almost at the end of the list, we see 0.7868 correlation between penalties and shot power, which is also interesting.
The heatmap shows all relationships as a matrix of values:
Interestingly, we see that the other qualities in the dataset were not strongly related to the goalkeeper qualities. In other words, it is not of much importance for instance, whether a goal keeper is very good at acceleration or dribbling if they are exceptional at typical goalkeeping. Yet, we still see that among the other features that have some weaker negative relationship to goalkeeper features tend to be jumping, reaction and strength. Another interesting observation are long lines of homogenous color.
Also if we use the snippet seaborn_df.sum(axis=1).sort_values()[::-1], we can see the sum of all correlations at feature level. Or which feature tends to be mostly correlated with the whole complex of other features. If we order this in a table, we obtain:
|Free kick accuracy||13.2896|
Ball control, short passing and dribbling seem to be most positively correlated to other features, whereas goalkeeper features tend to be negatively correlated to other player features.
As an important finishing note we should mention that correlation does not imply causation.