t-SNE on DXOMark digital camera data

DXOMark has published various scores on the picture quality of digital cameras, which might be of interest to any ambitious photographer. I came to it through an article in Wired mentioning that Google Pixel 2 has currently the highest score among the cameras in smartphones.

We can take a look at the camera sensor database and try to extract the features that seem most interesting to us. By cleaning up the data, we reduce it to this file. Then we can plot the results with t-SNE in 2D and look for data points that are different than the rest. Here is the code:

import matplotlib.pyplot as plt from sklearn.manifold import TSNE from sklearn.preprocessing import StandardScaler from heapq import heappush, nlargest import pandas as pd import numpy as np df = pd.read_csv('dxomark_score.csv') unique_sensors = df['Sensor Format'].unique().tolist() df['Sensor Format'] = [unique_sensors.index(v) for v in df['Sensor Format']] df['Launch Price'] = df['Launch Price'].str.slice(1).astype(int) labels = df['Model'] del df['Model'] df = df.dropna() tsne = TSNE(n_components=2, random_state=60) embedded = tsne.fit_transform(df) embedded_x, embedded_y = embedded[:,0], embedded[:,1] plt.title('t-SNE on DXOMark digital camera data') plt.plot(embedded_x, embedded_y, '.', markersize=2, color='black', alpha=0.5) # Examine which coordinates enclose a box around the points of interest #plt.axhline(16) #plt.axhline(22) #plt.axvline(-15) plt.text(-12,27,'Source: DXOMark camera sensor database, http://bit.ly/2EtmKEK', fontsize=9, color='#777777') plt.tight_layout() plt.show() t-SNE on DXOMark digital camera data

We notice a couple of clusters here that we want to examine closer. First, we see very few points in the top left and we'd like to see to which cameras they correspond. We could use the following code for this task:

print(labels[np.argwhere((embedded_x < -15) & (embedded_y > 22)).ravel()].tolist()) """ ['Phase One IQ180 Digital Back', 'Phase One P65', 'Phase One P40 Plus', 'Hasselblad H3DII 50', 'Leaf Aptus75S', 'Phase One P45 Plus', 'Leica S', 'Hasselblad H3DII 39'] """

A closer look at the data tells us that these are one of the most expensive cameras examined overall. And if we looked on the y-coordinate between 16 and 22 ((embedded_y > 16) & (embedded_y < 22)), we would have found the next most expensive cameras. Yet, this is not what we are interested in.

Now we look at the cluster on the right, which has considerably more points. But they are very far from the rest, suggesting they are different. Let's see which these cameras are:

print(labels[np.argwhere((embedded_x > 28) & (embedded_y < 10)).ravel()].tolist()) """ ['Olympus E3', 'Olympus PEN EP1', 'Nikon 1 V1', 'Canon PowerShot G16', 'Nikon Coolpix P7800', 'Panasonic Lumix DMC GF1', 'Nikon 1 J4', 'Nikon 1 V3', 'Olympus E410', 'Nikon 1 AW1', 'Panasonic Lumix DMC GF5', 'Panasonic Lumix DMC GF3', 'Panasonic Lumix DMC LX7', 'Canon PowerShot S100', 'Olympus XZ-2 iHS', 'Pentax Q10', 'Fujifilm FinePix X S1', 'Pentax MX-1', 'Samsung EX2F', 'DJI Phantom 4', 'Canon Powershot S110', 'Canon PowerShot S95', 'Canon PowerShot G12', 'Canon PowerShot SX50 HS', 'Pentax Q', 'Canon Powershot G11', 'Canon Powershot G15', 'Canon PowerShot S90', 'Panasonic Lumix DMC-ZS50', 'Fujifilm FinePix S100fs', 'Nokia Lumia 1020', 'Panasonic Lumix DMC-FZ70', 'Fujifilm FinePix F800EXR', 'Panasonic Lumix DMC LX5', 'Panasonic LUMIX DMC-FZ150', 'Fujifilm FinePix F600EXR', 'Nikon D2H', 'Samsung EX1', 'Fujifilm FinePix F550EXR', 'Nikon Coolpix P7000', 'Panasonic Lumix DMC LX3', 'Canon PowerShot SX60 HS', 'GoPro HERO5 Black', 'Panasonic Lumix DMC-FZ330', 'Panasonic LUMIX DMC-FZ200', 'Panasonic Lumix DMC-ZS60', 'Canon Powershot G10', 'Nikon Coolpix P6000', 'Canon Powershot G9', 'Olympus XZ1', 'YUNEEC Breeze 4K'] """

These are a lot of cameras. The common theme about them as far as we can see is that they have a very low overall score. Yet, this is not what we are interested in.

Where should we look next? We could check an area that is sufficiently distant from the two we have already evaluated. This directs us towards the points at the bottom.

print(labels[np.argwhere((embedded_x > 8) & (embedded_y < -15)).ravel()].tolist()) """ ['Canon PowerShot G9 X Mark II', 'Nikon 1 J5', 'Nikon D60', 'Canon PowerShot G9 X', 'Samsung NX 100', 'Sony Alpha 100', 'Konica Minolta DYNAX 7D', 'Canon PowerShot S120', 'Nikon D40', 'Olympus E420', 'Olympus E450', 'Canon EOS 300D', 'Olympus E520', 'Panasonic Lumix DMC G1', 'Nikon 1 J3', 'Panasonic Lumix DMC G10', 'Nikon 1 V2'] """

These seem to be relatively cheap cameras, sometimes with slightly above average results. Could we now start thinking in ranges? We know that top left is super expensive and right is cheaper and low-score. Then we want to have points that are most to the left (indicating high-quality), but also to the bottom as possible (likely cheaper). In the larger body of points, we notice a small bump on the left.

print(labels[np.argwhere((embedded_x > -15) & (embedded_x < -8) & (embedded_y > -12) & (embedded_y < -2)).ravel()].tolist()) """ ['Samsung NX500', 'Nikon D3400', 'Sony A6300', 'Nikon D5500', 'Nikon D5200', 'Nikon D5600', 'Nikon D5300', 'Sony A6000', 'Pentax K-5 II', 'Nikon D3300', 'Nikon D3200', 'Sony A5100', 'Sony Alpha 580', 'Ricoh GR II', 'Nikon D5100', 'Sony A5000', 'Pentax K 01', 'Pentax K-50', 'Pentax K-30', 'Pentax K-500', 'Canon EOS 200D', 'Sony NEX-6', 'Canon EOS M100', 'Sony A3000', 'Sony NEX-5T', 'Canon EOS M6', 'Pentax K-S1', 'Sony NEX-5N', 'Canon EOS M5', 'Sony NEX-3N', 'Sony NEX-F3', 'Sony NEX-C3', 'Canon EOS M3'] """

These seem much more interesting models. The first on the list for instance has a whopping 28 megapixel at a launch price of only 800$. Its overall score is said to be the same as that of the 50.6 megapixel Canon EOS 5DS at 3700$ and the Phase One P40 Plus at 19500$. Now, this is impressing. The 24.2 megapixel Nikon D3400 at 650$ gets the same overall score as the much more expensive Canon EOS 5DS R (3900$). And Sony A6300 gets the same overall score as Canon EOS 6D Mark II.

At the end, we choose a much simpler metric not related to t-SNE:

ss = StandardScaler() res = ss.fit_transform(df[['Mpix', 'Portrait', 'Landscape', 'Sports', 'Overall Score', 'Launch Price']]) metric = res[:,0]*res[:,1]*res[:,2]*res[:,3]*res[:,4] / res[:,5] h = [] for i, val in enumerate(metric): heappush(h, (val, labels[i])) for score, model in nlargest(20, h): print("%.2f %s" % (score, model)) """ 1485.93 Nikon D600 553.96 Sony A7R 374.49 Sony A7R III 299.75 Nikon D850 238.05 Sony A7R II 180.75 Nikon D750 180.71 Sony Cyber-shot DSC-RX1R II 164.74 Nikon D810 152.39 Nikon D800 134.74 Nikon D800E 128.01 Pentax 645Z 128.01 Hasselblad X1D-50c 63.87 Sony SLT Alpha 99 II 46.55 Sony Cyber-shot DSC-RX1 33.68 Canon EOS 5D Mark IV 31.82 YUNEEC Breeze 4K 30.90 Nikon Coolpix P7100 28.59 Sony Cyber-shot DSC-RX1R 25.31 Canon Powershot G9 21.69 Canon EOS 6D Mark II """

As is often the case, many of the top camera models seem to be especially good in the Sports shooting category. In other words, their shooting speed is likely very fast. Nikon D600 at the launch price of 2100$ received very high overall score (94), while its sports shooting speed was measured to be the same as of the 3300$ model Nikon D800E, having a score of 96. As you can see, Sony A7R also gets a very high score for its price-performance. It costs only 200$ more than the Nikon, but has almost 12 megapixel more. Despite of this, it has the same overall score as the Nikon, being even slightly less performant in the sports category.

These results should be questioned. They are likely imperfect due to possible inconsistencies in the way the tests were conducted or the measurements were made. Yet, they can still serve as an orientation. Sharing the data here means that you can see it from your own viewpoint or share your own results.