Reconstructing broken data

Please, note the caveats in the bit.

import numpy as np # from sklearn.decomposition import FastICA from sklearn.cluster import KMeans import matplotlib.pyplot as plt fig, ax = plt.subplots(3, 1, figsize=(8, 6)) np.set_printoptions(suppress=True) series, values_in_series = 12, 30 all_vals = [] k = 0 for i in range(series): from_val = np.random.randint(k+5) to_val = np.random.randint(k+5, k+15) rand_val = np.random.randint(from_val, to_val) all_vals.append(rand_val + np.random.rand(values_in_series) * 10) k += 100 kmeans = KMeans(n_clusters=series) # Original data all_vals = np.array(all_vals) ax[0].set_title('Original') ax[0].matshow(all_vals, cmap = plt.cm.coolwarm) ax[0].axis('off') # Break the structure in the data all_vals1d = all_vals.ravel() np.random.shuffle(all_vals1d) all_vals = all_vals1d.reshape(series, values_in_series) ax[1].set_title('With broken structure') ax[1].matshow(all_vals, cmap = plt.cm.coolwarm) ax[1].axis('off') # Attempt to restore it # FastICA does not seem to help here # fast_ica = FastICA(n_components=series) # res = fast_ica.fit_transform(all_vals) # plt.matshow(res, cmap = plt.cm.coolwarm) predicted_indices = kmeans.fit_predict(all_vals.ravel().reshape(-1,1)) reconstructed = [[] for i in range(12)] for i in range(series): for idx, k in enumerate(predicted_indices): if k == i: reconstructed[i].append(all_vals1d[idx]) ax[2].set_title('Reconstructed') ax[2].matshow(reconstructed, cmap = plt.cm.coolwarm) ax[2].axis('off') plt.tight_layout() plt.show()

In the original data matrix we see the 12 lines of 30 homogenous values each. Then we see how random shuffling breaks the structure in this data and finally we observe how the data is reconstructed with the help of k-means. Here it is quite helpful that we know the number of clusters in advance and that the values are quite different from each other. But as they get closer, k-means puts a different number of values in each cluster and as a result the matrix cannot be constructed or visualized. The vertical color sequence doesn't matter here much; what we wanted to have are all close values in their own bins. But you might be able to understand why I was trying to achieve the same with ICA.