Finding the strength of pairwise relationships among features in the FIFA dataset

You probably remember that I have talked about the FIFA dataset available on Kaggle a while ago. This time I wanted to look at it again, but from another perspective—attempting to find features, which could be correlated. In other words features whose values tend to vary together, either in a positive or in a negative sense.

For this, I needed to slightly modify the original dataset. Some values of some features weren't given as exact numbers, but as having a certain tolerance. Whenever I saw something like val+tol or val-tol in a given column, I simply took the computed value to be able to use the data further.

Then I wrote the following code to extract the strongest correlations and present them on a heatmap.

from sklearn.preprocessing import StandardScaler from itertools import combinations import pandas as pd import matplotlib.pyplot as plt import seaborn as sns def plot_shift(vals): return [val+0.5 for val in vals] df = pd.read_csv('FIFA complete dataset.csv', low_memory=False) cols = 'Acceleration,Aggression,Agility,Balance,Ball control,Composure,Crossing,Curve,Dribbling,Finishing,Free kick accuracy,GK diving,GK handling,GK kicking,GK positioning,GK reflexes,Heading accuracy,Interceptions,Jumping,Long passing,Long shots,Marking,Penalties,Positioning,Reactions,Short passing,Shot power,Sliding tackle,Sprint speed,Stamina,Standing tackle,Strength,Vision,Volleys'.split(',') res = df[cols].dropna() ss = StandardScaler() ss = ss.fit_transform(res) df = pd.DataFrame(ss, columns=cols) seaborn_df = pd.DataFrame(index=cols, columns=cols) correlations = [] for feature1, feature2 in combinations(cols, 2): corr_val = df[[feature1, feature2]].corr().loc[feature1, feature2] corr_type = 'Positive' if corr_val > 0 else 'Negative' correlations.append((abs(corr_val), '-'.join([feature1, feature2]), corr_type)) seaborn_df.loc[feature1, feature2] = seaborn_df.loc[feature2, feature1] = corr_val correlations.sort(reverse=True) seaborn_df = seaborn_df.fillna(1) table = ['<table><tr><th>Correlation value</th><th>Relationship</th><th>Correlation type</th></tr>'] for corr_tuple in correlations[:50]: table.append('<tr><td>%.4f</td><td>%s</td><td>%s</td></tr>' % (corr_tuple)) table.append('</table>') print(''.join(table)) fontsize = 8 plot = sns.heatmap(seaborn_df, linewidths=0.01) plot.set_title('Pairwise correlations among features in the FIFA dataset') plot.set_xticks(plot_shift(range(len(cols)))) plot.set_yticks(plot_shift(range(len(cols)))) plot.set_xticklabels(cols, rotation=90, fontsize=fontsize) plot.set_yticklabels(cols, rotation=0, fontsize=fontsize) plt.tight_layout()
Correlation valueRelationshipCorrelation type
0.9727GK diving-GK reflexesPositive
0.9694GK diving-GK handlingPositive
0.9693GK handling-GK reflexesPositive
0.9691GK positioning-GK reflexesPositive
0.9688GK diving-GK positioningPositive
0.9686GK handling-GK positioningPositive
0.9684Sliding tackle-Standing tacklePositive
0.9652GK kicking-GK reflexesPositive
0.9643GK diving-GK kickingPositive
0.9639GK handling-GK kickingPositive
0.9629GK kicking-GK positioningPositive
0.9607Marking-Sliding tacklePositive
0.9570Marking-Standing tacklePositive
0.9326Ball control-DribblingPositive
0.9308Interceptions-Standing tacklePositive
0.9200Interceptions-Sliding tacklePositive
0.9153Acceleration-Sprint speedPositive
0.9052Ball control-Short passingPositive
0.8951Long passing-Short passingPositive
0.8607Ball control-PositioningPositive
0.8574Curve-Free kick accuracyPositive
0.8394Ball control-CrossingPositive
0.8363Dribbling-Short passingPositive
0.8322Ball control-CurvePositive
0.8283Ball control-Shot powerPositive
0.8212Shot power-VolleysPositive
0.8102Crossing-Short passingPositive
0.8008Positioning-Shot powerPositive
0.7994Finishing-Shot powerPositive
0.7970Dribbling-Shot powerPositive
0.7927Ball control-VolleysPositive
0.7868Penalties-Shot powerPositive
0.7848Curve-Shot powerPositive
0.7846Ball control-FinishingPositive

Although we have taken the absolute correlation values in the table, we still see that the strongest relationships tend to be only positive. What we see additionally is that the the features related to goalkeepers tend to be strongly correlated. This means that it is very likely that a goalkeeper posessing one of these qualities may likely posesses the other one as well, at least according to the data we were given. Further we see elements which we might tend to believe intuitively as well: that tackling is related, whether its sliding or standing, that ball control is related to good dribbling, that acceleration seems to be related to sprint speed. Almost at the end of the list, we see 0.7868 correlation between penalties and shot power, which is also interesting.

The heatmap shows all relationships as a matrix of values:

Showing the strength of the pairwise relationships among the features in the FIFA dataset

Interestingly, we see that the other qualities in the dataset were not strongly related to the goalkeeper qualities. In other words, it is not of much importance for instance, whether a goal keeper is very good at acceleration or dribbling if they are exceptional at typical goalkeeping. Yet, we still see that among the other features that have some weaker negative relationship to goalkeeper features tend to be jumping, reaction and strength. Another interesting observation are long lines of homogenous color.

Also if we use the snippet seaborn_df.sum(axis=1).sort_values()[::-1], we can see the sum of all correlations at feature level. Or which feature tends to be mostly correlated with the whole complex of other features. If we order this in a table, we obtain:

FeatureSum correlations
Ball control15.0211
Short passing14.9013
Long passing13.8498
Shot power13.7828
Free kick accuracy13.2896
Sprint speed10.8582
Heading accuracy10.6983
Long shots10.2845
Standing tackle8.3768
Sliding tackle7.9307
GK positioning-10.7483
GK kicking-10.7730
GK handling-10.8028
GK reflexes-10.8246
GK diving-10.8456

Ball control, short passing and dribbling seem to be most positively correlated to other features, whereas goalkeeper features tend to be negatively correlated to other player features.

As an important finishing note we should mention that correlation does not imply causation.