Nvidia GPU Architecture

1 Overview

The GPU architecture is built around a scalable array of Streaming Multiprocessors (SM)

The key components of a SM:

CUDA cores (ALU + FPU)
Double Precision Units (DPU)
Special Function Units (SPU)
Load/Store Units (LD/ST)

Register File
Shared Memory/L1 Cache
Warp Scheduler

Cards:

GTX980 (2048 CUDA cores, 16SMs, 28nm, 5.2billion, 4GB, 4.981 TFLOPS / DPU: 0.1556TFLOPs, 165W, 2014.9, $549) evga gtx980
GTX1050 (640 CUDA cores, 5SMs, 14nm, 3.3billion, 4GB, 1.458 TFLOPS / DPU: 0.04556TFLOPs, 75W, 2018.1) NVIDIA GeForce GTX 1050 Max-Q ----> Pascal GP107
GTX1050 Ti (768 CUDA cores, 6SMs, 14nm, 3.3billion, 4GB, 1.983 TFLOPS / DPU: 0.06197TFLOPs, 75W, 2018.1) NVIDIA GeForce GTX 1050 Ti Max-Q ----> Pascal GP107
RTX3050 (2048CUDA cores, 16SMs, 64 TensorCore, 16 RTCore, 8nm, 12billion, 4GB, 4.329 TFLOPS/ DPU: 0.06765TFLOPs, 75W, 2021.5 ) NVIDIA GeForce RTX 3050 Mobile -----> Ampere GA107
RTX3050 Ti (2560CUDA cores, 20SMs, 80 TensorCore, 20 RTCore, 8nm, 12billion, 4GB, 5.299 TFLOPS/ DPU: 0.08280TFLOPs, 75W, 2021.5 ) NVIDIA GeForce RTX 3050 Ti Mobile -----> Ampere GA106
GTX1070 / GTX1080 (2560 CUDA cores, 20 SMs, 16nm, 7.2billion, 8GB, 8.2TFLOPs / DPU: 0.257TFLOPs, 180W, 2016.5)
GTX1080 Ti / TITAN X (3584 CUDA cores, 28 SMs, 16nm, 12billion, 12GB, 10TFLOPs / DPU: 0.317TFLOPs, 250W, 2016.8)
TITAN X (3584 CUDA cores, 28 SMs, 16nm, 12billion, 12GB, 10.97TFLOPs / DPU: 0.3429TFLOPs, 250W, 2016.8) NVIDIA TITAN X Pascal----> Pascal GP107
RTX3080 (8704 CUDA cores, 68SMs, 272 TensorCore, 68 RTCore, 8nm, 28.3billion, 10GB, 29.77 TFLOPs / DPU: 0.465 TFLOPs, 320W, 2020.9 $699) RTX3080 RTX3080 Ti
RTX3080 Ti (10240 CUDA cores, 80SMs, 320 TensorCore, 80 RTCore, 8nm, 28.3billion, 12GB, 34.10 TFLOPs / DPU: 0.5328 TFLOPs, 350W, 2021.5 $1199) RTX3080 Ti ---> Ampere GA102
RTX3090 Ti (10752 CUDA cores, 84SMs, 336 TensorCore, 84 RTCore, 8nm, 28.3billion, 24GB, 40TFLOPs / DPU:0.625TFLOPs, 450W, 2022.1) Nvidia RTX3090 Ti RTX3090 ----> Ampere GA102

Tesla K80 (Kepler GK210/28nm/300W/2x12GB/FP16 8 TFLOPS/FP32 8TFLOPS/2x2496 CUDA Cores/2x208 TMUs/2x48 ROPs/2x13 SMX Cnout) 2014.11
RTX 4050 (Ada Lovelace AD107/5nm/100W/6GB/FP16 13.5 TFLOPS/FP32 13.5 TFLOPS/2560 CUDA cores/80 TMUs/32 ROPs/18 SM Count/120 Tensor Cores/18 RT Cores)
Tesla V100 (Volta GV100/12nm/300W/16GB/FP32 14TFlops/FP16 28TFlops/5120 CUDA Cores/320 TMUs/128 ROPs/80 SM Count/640 Tensor Cores/40 RT Cores) 2017.6
Tesla T4 (Turing TU104/12nm/70W/16GB/FP32 8TFlops/FP16 65TFlops/INT8 130 TOPS/2560 CUDA Cores/160 TMUs/64 ROPs/40 SM Cnout/320 Tensor Cores/40 RT Cores) 2018.9

https://www.techpowerup.com/gpu-specs/

2 Fermi Micro Architecture

The Fermi architecture was the first complete GPU computing architecture to deliver the features required for the most demanding HPC applications.

1 SM: 32 CUDA cores + 16 Load/Store Unit + 4 SPU + 2 Warp Scheduler
1 SM: 2 Warps
1 Warp: 16 CUDA cores + 16 Load/Store Unit(shared) + 4 SPU(shared) + [32 threads context ?]
Handle 48 warps per SM for a total of 1536 (48x32) threads resident in a single SM at a time [48 Warps context ?]
1 CUDA core: 1 ALU + 1 FPU
Register file is 32KB

GTX480:

15 SM (32 CUDA cores/SM)
480 CUDA cores
1345 GFLOPs
40 nm
3.2 billion transistors
GTX480 250Watts

2.1 Video Cards

2.1.1 GeForce 400 Series

Release date: April 12, 2010
Codename: GF10x
Architecture: Fermi

Models:

GeForce Series
GeForce GT Series
GeForce GTS Series
GeForce GTX Series

Fabrication process and transistors:

260M 40nm (GT218)
585M 40 nm (GF108)
1.170M 40 nm (GF106)
1.950M 40 nm (GF104)
1.950M 40 nm (GF114)
3.200M 40 nm (GF100)

Cards:

Entry-level GT420 GT430
Mid-range GT440 GTS450 GTX460
High-end GTX465 GTX470
Enthusiast GTX480 (2010.3, 3.2 billion Transistors, 15 SMs, 1536MB, 1345 GFLOPS, 250W)

2.1.2 GeForce 500 Series

Release date: 8 November 2010
Codename: GF11x
Architecture: Fermi

Models:

GeForce Series
GeForce GT Series
GeForce GTX Series

Fabrication process and transistors:

292M 40nm (GF119)
585M 40 nm (GF108)
1.170M 40 nm (GF116)
1.950M 40 nm (GF114)
3.000M 40 nm (GF110)

Cards:

Entry-level 510 GT520 GT530
Mid-range GT545 GTX550Ti GTX560 GTX560Ti
High-end GTX570 GTX580 GTX590(2011.3, 2x3 billion transistors, 32 SMs, 2x1536MB, 2488GFLOPS, 365W)

2.2 GPGPU Cards

Goto: http://wiki.jackslab.org/Nvidia_GPU_Architecture#Nvidia_Tesla_GPGPU_Cards

3 Kepler Micro Architecture

Released in the fall of 2012

1 SM: 4 Warps Scheduler (2 instruction dispatchers per Warp)
1 Warp: [32 threads context ?]
1 SM: 192 CUDA cores + 64 DPU (shared) + 32 Load/Store Unit (shared) + 32 SPU (shared) + 4 Warp Scheduler

Handle 64 warps/SM for a total of 2048 (64x32) threads resident in a single SM at a time [64 Warps context ?]
Register file size is 64K

K20X:

14 SM
2688 CUDA cores, 6GB
3.935 TFLOPs / DPU: 1.312 TFLOPs
28 nm
235Watts

GTX690:

2x8 SM
3072 CUDA cores
2x2.8TFLOPs
2x3.54 billion transistors
300Watts (2012.4)

3.1 Video Cards

3.1.1 GeForce 600 series

Release date: March 22, 2012
Codename: GK10x

Models

GeForce Series
GeForce GT Series
GeForce GTX Series

Fabrication process and transistors

292M 40 nm (GF119)
585M 40 nm (GF108)
1.170M 40 nm (GF116)
1.950M 40 nm (GF114)
1.270M 28 nm (GK107)
1.270M 28 nm (GK208)
2.540M 28 nm (GK106)
3.540M 28 nm (GK104)

Cards:

Entry-level GT610 GT620 GT630 GT640
Mid-range GTX650 GTX650Ti GTX650Ti Boost GTX 660
High-end GTX660Ti GTX670
Enthusiast GTX680 GTX690

3.1.2 GeForce 700 series

Release date: May 2013
Codename: GK110 GK208

Models:

GeForce Series
GeForce GT Series
GeForce GTX Series

Fabrication process and transistors:

585M 28 nm (GF117)
1.020M 28 nm (GK208)
1.270M 28 nm (GK107)
3.540M 28 nm (GK104)
7.080M 28 nm (GK110)

Cards

Entry-level: GeForce GT 705 GeForce GT 710 GeForce GT 720 GeForce GT 730 GeForce GT 740 GeForce GTX 745
Mid-range: GeForce GTX 750 GeForce GTX 750 Ti GeForce GTX 760 192-Bit GeForce GTX 760 GeForce GTX 760 Ti
High-end: GeForce GTX 770 GeForce GTX 780
Enthusiast: GeForce GTX 780 Ti GeForce GTX Titan GeForce GTX Titan Black GeForce GTX Titan Z

3.2 GPGPU Cards

Goto: http://wiki.jackslab.org/Nvidia_GPU_Architecture#Nvidia_Tesla_GPGPU_Cards

4 Maxwell Micro Architecture

The SM arch of Maxwell GM204:

1 SM (SMM): 4 Warp Scheduler (2 instruction dispatchers per Warp)
1 Warp: 32 CUDA cores + 1 DPU + 8 Load/Store Units + 8 SPU
1 SM (SMM): 128 CUDA cores + 4 DPU + 32 Load/Store Units + 32 SPU
e.g. GTX980: 16 SM (SMM), 2048 CUDA cores, 64 DPUs, 4612 GFLOPs / DPU: 144 GFLOPs, 28 nm, 5.2 billion transistors, 165W

The arch of Maxwell GM204:

TITAN X (GM204):

4.1 Video Cards

4.1.1 GeForce 900 series

Release date: September 2014
Codename: GM20x

Models

GeForce Series
GeForce GT Series
GeForce GTX Series

Cards

Mid-range GTX950 / GTX960
High-end GTX970 / GTX980
Enthusiast GTX980 Ti / GTX Titan X

4.2 GPGPU Cards

Goto: http://wiki.jackslab.org/Nvidia_GPU_Architecture#Nvidia_Tesla_GPGPU_Cards

5 Pascal Micro Architecture

The SM arch of Pascal GP100

1 SM: 2 Warp Scheduler (2 instruction dispatchers per Warp)
1 Warp: 32 CUDA cores + 16 DPU + 8 Load/Store Units + 8 SPU
1 SM: 64 CUDA cores + 32 DPU + 16 Load/Store Units + 16 SPU
e.g. Tesla P100: 60 SM(56 enabled), 3584 CUDA cores, 1792 DPUs, 16GB, 9.5 TFLOPs / DPU: 4.7 TFLOPs， 300Watts

The arch of Pascal GP100:

The SM arch of Pascal GP104

1 SM: 4 Warp Scheduler (2 instruction dispatchers per Warp)
1 Warp: 32 CUDA cores + 1 DPU + 8 Load/Store Units + 8 SPU
1 SM: 128 CUDA cores + 4 DPU + 32 Load/Store Units + 32 SPU
e.g. GTX1080, GTX1080Ti, TITAN X
GTX1080 (GP104): 20 SMs, 2560 CUDA cores, 80 DPUs, 16nm, 7.2billion, 8GB, 8.2 TFLOPs / DPU: 257 GFLOPs, 180Watts

The arch of Pascal GP104:

5.1 Video Cards

5.1.1 GeForce 1000 series

Release date: May 2016
Codename: GP10x

Models

GeForce GTX Series

Fabrication process and transistors:

3.3B 14 nm (GP107)
4.4B 16 nm (GP106)
7.2B 16 nm (GP104)
12B 16 nm (GP102)

Cards:

Entry-level: GTX1050 / GTX1050 Ti
Mid-range: GTX1060
High-end: GTX1070 / GTX1080(2016.5, 2560 CUDA cores, 20 SMs, 16nm, 7.2billion, 8GB, 8.2TFLOPs / DPU: 0.257TFLOPs, 180Watts)
Enthusiast: GTX1080 Ti / NVIDIA Titan X(2016.8, 3584 CUDA cores, 28 SMs, 16nm, 12billion, 12GB, 10TFLOPs / DPU: 0.317TFLOPs, 250Watts)

5.2 GPGPU Cards

Goto: http://wiki.jackslab.org/Nvidia_GPU_Architecture#Nvidia_Tesla_GPGPU_Cards

6 Nvidia Tesla GPGPU Cards

Tesla products target the high-performance computing market.

As of 2012, Nvidia Teslas power some of the world's fastest supercomputers, including Titan at Oak Ridge National Laboratory and Tianhe-1A, in Tianjin, China.