GPU comparison

Finding a good graphic card is not always an easy task, especially given the choice we have today. To be able to produce beautiful 2D/3D animations—both within and outside the browser,—we need a good graphic card. To be able to fully utilize the capabilities of deep learning frameworks, we will need once again a great graphic card. GPUs are said to shine when we have a large number of operations that can be performed in parallel. They are optimized primarily for high throughput and less for low latency.

Graphic cards can have features like launch price, GPU clock, memory clock, shading units, TMUs, ROPs, compute units, pixel rate, texture rate, floating point performance, memory size, memory bandwidth, TDP and others. Across all these dimensions there could be a lot of variability, so it can be especially hard to evaluate them properly. This causes a lot of misunderstanding; many people prefer making the easy choice rather than spend the time to match it to their needs and intended workflow.

We might think that GPU clock and memory clock determine to a large extent the performance of a graphic card, but this may not necessarily be the case. A graphic card is a product of engineering thought and as such the entire architecture is important when evaluating the end result. For instance, while working with the TechPowerUp database, I noticed that many GPUs had relatively high clock rates of core and memory, yet their pixel and texture fill rates still remained relatively low. On the other side, some cards with slightly lower clock rates were outperforming them. This means that clock rates aren't everything.

We might read the following sentence: "Small primitives result in high vertex processing demands, while large primitives generate many more fragments than vertices". This means that if a GPU had a great pixel rate, but bad texel rate, it would behave differently from a card with bad pixel rate, but good texel rate. To ensure high frame rates, elements in the graphics pipeline need to be adjusted as a whole—to work well together rather than individually. This can also be seen in the design of some graphics cards, where the rates GPixels/s and GTexels/s are actually the same.

Memory bandwidth in relation to TDP is also an important topic. There were many cards which offered low memory bandwidth, while having high TDP. This hints that optimizing them for efficiency was not the primary goal. This is a problematic thought line in a world that is resource-constrained in general. The memory on a card is one of its main resources and to achieve the high throughput we desire, the memory bandwidth must be correspondingly high. Having a wide memory bus is helpful, but not sufficient to achieve this goal. Sometimes a card can still achieve higher memory bandwidth on a narrow bus, compared to a card with a wider memory bus. The memory bandwidth matters then more than the memory bus width. It is then a disservice to everyone to build power-hungry graphic cards, which deplete laptop batteries very fast (some mobile GPUs had TDP of of ≈200W!). The focus must be on energy-efficient cards instead that allow people to enjoy full 3D capabilities for long hours. Currently some laptops have two graphic cards, which allows turning off some capabilities when they aren't needed, but this still requires wiring to be built for both cards and space in the case to be available as well. This is a form of redundancy, when two components have to deal with graphics.

It is true that some cards labeled x2 and x4 reach higher memory bandwidths, but this doesn't make them more efficient since they also very often have much higher TDP. This deserves to be mentioned, especially when we know that higher efficiency in some big data centers has been achieved by using less components, not more. It is also reasonable to expect that whatever we can't optimize at the lowest possible granularity, we won't have optimal at the highest as well. This explains why something as powerful as NVIDIA Tesla S2050 can still demand 900W and a dedicated power supply.

Below you can see the code used to look at various different types of graphic cards. The goal was to have a working code quickly, not to make it beautiful.

# Source: import matplotlib.pyplot as plt from gpu_data import graphics_cards_desktop, graphics_cards_desktop_workstation, graphics_cards_mobile, graphics_cards_mobile_workstation from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler import numpy as np ss = StandardScaler() pca = PCA(n_components=2) type_labels = ['Desktop', 'Desktop Workstation', 'Mobile', 'Mobile Workstation'] k = 0 for gtype in (graphics_cards_desktop, graphics_cards_desktop_workstation, graphics_cards_mobile, graphics_cards_mobile_workstation): labels = [] perf_data = [] fig, ax = plt.subplots(figsize=(10,8)) max_mem_bandwidth_per_watt = -1 max_mem_bandwidth_per_watt_name = '' max_mem_bandwidth_per_dollar = -1 max_mem_bandwidth_per_dollar_name = '' max_float_perf_per_dollar = -1 max_float_perf_per_dollar_name = '' max_pixel_rate_per_dollar = -1 max_pixel_rate_per_dollar_name = '' max_texture_rate_per_dollar = -1 max_texture_rate_per_dollar_name = '' max_mem_size_per_dollar = -1 max_mem_size_per_dollar_name = '' lines = gtype.splitlines() for i, line in enumerate(lines): name, launch_price, gpu_clock, mem_clock, shading_units, tmus, rops, compute_units, pixel_rate, texture_rate, float_perf, mem_size, mem_bandwidth, tdp = line.split(', ') labels.append(name) if tdp != 'None': mem_bandwidth = float(mem_bandwidth) tdp = int(tdp) if mem_bandwidth / tdp > max_mem_bandwidth_per_watt: max_mem_bandwidth_per_watt = mem_bandwidth / tdp max_mem_bandwidth_per_watt_name = name perf_data.append(( int(gpu_clock), int(mem_clock), int(shading_units), int(tmus), int(rops), int(compute_units), float(pixel_rate), float(texture_rate), float(float_perf), int(mem_size), float(mem_bandwidth), int(tdp) )) if launch_price != 'None': float_perf = float(float_perf) mem_bandwidth = float(mem_bandwidth) launch_price = int(launch_price) if float_perf / launch_price > max_float_perf_per_dollar: max_float_perf_per_dollar = float_perf / launch_price max_float_perf_per_dollar_name = name pixel_rate = float(pixel_rate) texture_rate = float(texture_rate) mem_size = int(mem_size) if pixel_rate / launch_price > max_pixel_rate_per_dollar: max_pixel_rate_per_dollar = pixel_rate / launch_price max_pixel_rate_per_dollar_name = name if texture_rate / launch_price > max_texture_rate_per_dollar: max_texture_rate_per_dollar = texture_rate / launch_price max_texture_rate_per_dollar_name = name if mem_bandwidth / launch_price > max_mem_bandwidth_per_dollar: max_mem_bandwidth_per_dollar = mem_bandwidth / launch_price max_mem_bandwidth_per_dollar_name = name if mem_size / launch_price > max_mem_size_per_dollar: max_mem_size_per_dollar = mem_size / launch_price max_mem_size_per_dollar_name = name perf_data = ss.fit_transform(perf_data) perf_data = pca.fit_transform(perf_data) perf_data = np.array(perf_data) firstcol, secondcol = perf_data[:,0], perf_data[:,1] minx = np.min(firstcol) maxx = np.max(firstcol) miny = np.min(secondcol) maxy = np.max(secondcol) for name, (x, y) in zip(labels, perf_data): ax.scatter(x,y, label=name, s=5, color='black', alpha=0.5) if (k == 0 and x > 1.5) or (k == 1 and (x > 2 or y > 6)) or (k == 2 and (x > 4 or y > 6)) or (k == 3 and (x > 3 or y > 3)): ax.annotate(name, xy=(x,y), fontsize=7, color='black', alpha=0.4) #ax.set_title(type_labels[k] + ' GPU comparison, considering only performance features and TDP') plt.title(type_labels[k] + ' GPU comparison, considering all features except price') plt.xlim(minx-0.5, maxx + 2) plt.ylim(miny-0.5, maxy + 0.5) plt.axis('off') plt.tight_layout() print(type_labels[k]) print('Max memory bandwidth per watt has %s: %.4f(GB/s)/W' % (max_mem_bandwidth_per_watt_name, max_mem_bandwidth_per_watt)) print('Max memory size per dollar has %s: %.4fMB/$' % (max_mem_size_per_dollar_name, max_mem_size_per_dollar)) print('Max memory bandwidth per dollar has %s: %.4f(GB/s)/$' % (max_mem_bandwidth_per_dollar_name, max_mem_bandwidth_per_dollar)) print('Max float performance per dollar has %s: %.4fGFLOPS/$' % (max_float_perf_per_dollar_name, max_float_perf_per_dollar)) print('Max pixel rate per dollar has %s: %.4f(GPixels/s)/$' % (max_pixel_rate_per_dollar_name, max_pixel_rate_per_dollar)) print('Max texture rate per dollar has %s: %.4f(GTexels/s)/$' % (max_texture_rate_per_dollar_name, max_texture_rate_per_dollar)) k += 1 """ Desktop Max memory bandwidth per watt has AMD Radeon R9 Nano: 2.9257(GB/s)/W Max memory size per dollar has NVIDIA GeForce GT 720: 41.7959MB/$ Max memory bandwidth per dollar has AMD Radeon RX 550: 1.4177(GB/s)/$ Max float performance per dollar has AMD Radeon RX 570: 30.1479GFLOPS/$ Max pixel rate per dollar has NVIDIA GeForce GTX 1050: 0.4272(GPixels/s)/$ Max texture rate per dollar has AMD Radeon RX 570: 0.9420(GTexels/s)/$ Desktop Workstation Max memory bandwidth per watt has NVIDIA Tesla P4: 4.2707(GB/s)/W Max memory size per dollar has AMD Radeon Pro Duo Polaris: 32.8168MB/$ Max memory bandwidth per dollar has AMD Radeon Pro Duo: 0.6831(GB/s)/$ Max float performance per dollar has AMD Radeon Vega Frontier Edition: 13.1201GFLOPS/$ Max pixel rate per dollar has AMD Radeon PRO WX 2100: 0.1309(GPixels/s)/$ Max texture rate per dollar has AMD Radeon Vega Frontier Edition: 0.4100(GTexels/s)/$ Mobile Max memory bandwidth per watt has AMD Radeon HD 7490M: 3.3778(GB/s)/W Max memory size per dollar has AMD Xbox One X GPU: 24.6253MB/$ Max memory bandwidth per dollar has AMD Xbox One X GPU: 0.6541(GB/s)/$ Max float performance per dollar has AMD Xbox One X GPU: 12.0261GFLOPS/$ Max pixel rate per dollar has AMD Xbox One X GPU: 0.0752(GPixels/s)/$ Max texture rate per dollar has AMD Xbox One X GPU: 0.3758(GTexels/s)/$ Mobile Workstation Max memory bandwidth per watt has NVIDIA Quadro M620 Mobile: 2.6730(GB/s)/W Max memory size per dollar has NVIDIA Quadro K4100M: 2.7325MB/$ Max memory bandwidth per dollar has NVIDIA Quadro K4100M: 0.0683(GB/s)/$ Max float performance per dollar has NVIDIA Quadro K4100M: 1.0854GFLOPS/$ Max pixel rate per dollar has NVIDIA Quadro K4100M: 0.0113(GPixels/s)/$ Max texture rate per dollar has NVIDIA Quadro K4100M: 0.0452(GTexels/s)/$ Note: Only cards with given TDP and/or launch price have been considered. """

Below you can see the positioning of desktop, desktop workstation, mobile and mobile workstation GPUs when all features except price were considered and when only performance features (pixel rate, texture rate, floating point performance, memory bandwidth) and TDP were considered. Whenever you see lots of white space, it means that one or more GPUs are likely very different from the rest. You can click on an image to see its original size.