Nvidia GPU Architecture

2022年4月8日 (五) 13:42的最后版本

[编辑] 1 Overview

The GPU architecture is built around a scalable array of Streaming Multiprocessors (SM)

The key components of a SM:

CUDA cores (ALU + FPU)
Double Precision Units (DPU)
Special Function Units (SPU)
Load/Store Units (LD/ST)

Register File
Shared Memory/L1 Cache
Warp Scheduler

Cards:

GTX980 (2048 CUDA cores, 16SMs, 28nm, 5.2billion, 4GB, 4.981 TFLOPS / DPU: 0.1556TFLOPs, 165W, 2014.9, $549) evga gtx980
GTX1050 (640 CUDA cores, 5SMs, 14nm, 3.3billion, 4GB, 1.458 TFLOPS / DPU: 0.04556TFLOPs, 75W, 2018.1) NVIDIA GeForce GTX 1050 Max-Q ----> Pascal GP107
GTX1050 Ti (768 CUDA cores, 6SMs, 14nm, 3.3billion, 4GB, 1.983 TFLOPS / DPU: 0.06197TFLOPs, 75W, 2018.1) NVIDIA GeForce GTX 1050 Ti Max-Q ----> Pascal GP107
RTX3050 (2048CUDA cores, 16SMs, 64 TensorCore, 16 RTCore, 8nm, 12billion, 4GB, 4.329 TFLOPS/ DPU: 0.06765TFLOPs, 75W, 2021.5 ) NVIDIA GeForce RTX 3050 Mobile -----> Ampere GA107
RTX3050 Ti (2560CUDA cores, 20SMs, 80 TensorCore, 20 RTCore, 8nm, 12billion, 4GB, 5.299 TFLOPS/ DPU: 0.08280TFLOPs, 75W, 2021.5 ) NVIDIA GeForce RTX 3050 Ti Mobile -----> Ampere GA106
Mid-range: GTX1060 ()
High-end: GTX1070 / GTX1080 (2560 CUDA cores, 20 SMs, 16nm, 7.2billion, 8GB, 8.2TFLOPs / DPU: 0.257TFLOPs, 180W, 2016.5)
GTX1080 Ti / TITAN X (3584 CUDA cores, 28 SMs, 16nm, 12billion, 12GB, 10TFLOPs / DPU: 0.317TFLOPs, 250W, 2016.8)
TITAN X (3584 CUDA cores, 28 SMs, 16nm, 12billion, 12GB, 10.97TFLOPs / DPU: 0.3429TFLOPs, 250W, 2016.8) NVIDIA TITAN X Pascal----> Pascal GP107
RTX3070
RTX3080 (8704 CUDA cores, 68SMs, 272 TensorCore, 68 RTCore, 8nm, 28.3billion, 10GB, 29.77 TFLOPs / DPU: 0.465 TFLOPs, 320W, 2020.9 $699) RTX3080 RTX3080 Ti
RTX3080 Ti (10240 CUDA cores, 80SMs, 320 TensorCore, 80 RTCore, 8nm, 28.3billion, 12GB, 34.10 TFLOPs / DPU: 0.5328 TFLOPs, 350W, 2021.5 $1199) RTX3080 Ti ---> Ampere GA102
RTX3090 Ti (10752 CUDA cores, 84SMs, 336 TensorCore, 84 RTCore, 8nm, 28.3billion, 24GB, 40TFLOPs / DPU:0.625TFLOPs, 450W, 2022.1) Nvidia RTX3090 Ti RTX3090 ----> Ampere GA102

[编辑] 2 Fermi Micro Architecture

The Fermi architecture was the first complete GPU computing architecture to deliver the features required for the most demanding HPC applications.

1 SM: 32 CUDA cores + 16 Load/Store Unit + 4 SPU + 2 Warp Scheduler
1 SM: 2 Warps
1 Warp: 16 CUDA cores + 16 Load/Store Unit(shared) + 4 SPU(shared) + [32 threads context ?]
Handle 48 warps per SM for a total of 1536 (48x32) threads resident in a single SM at a time [48 Warps context ?]
1 CUDA core: 1 ALU + 1 FPU
Register file is 32KB

GTX480:

15 SM (32 CUDA cores/SM)
480 CUDA cores
1345 GFLOPs
40 nm
3.2 billion transistors
GTX480 250Watts

[编辑] 2.1 Video Cards

[编辑] 2.1.1 GeForce 400 Series

Release date: April 12, 2010
Codename: GF10x
Architecture: Fermi

Models:

GeForce Series
GeForce GT Series
GeForce GTS Series
GeForce GTX Series

Fabrication process and transistors:

260M 40nm (GT218)
585M 40 nm (GF108)
1.170M 40 nm (GF106)
1.950M 40 nm (GF104)
1.950M 40 nm (GF114)
3.200M 40 nm (GF100)

Cards:

Entry-level GT420 GT430
Mid-range GT440 GTS450 GTX460
High-end GTX465 GTX470
Enthusiast GTX480 (2010.3, 3.2 billion Transistors, 15 SMs, 1536MB, 1345 GFLOPS, 250W)

[编辑] 2.1.2 GeForce 500 Series

Release date: 8 November 2010
Codename: GF11x
Architecture: Fermi

Models:

GeForce Series
GeForce GT Series
GeForce GTX Series

Fabrication process and transistors:

292M 40nm (GF119)
585M 40 nm (GF108)
1.170M 40 nm (GF116)
1.950M 40 nm (GF114)
3.000M 40 nm (GF110)

Cards:

Entry-level 510 GT520 GT530
Mid-range GT545 GTX550Ti GTX560 GTX560Ti
High-end GTX570 GTX580 GTX590(2011.3, 2x3 billion transistors, 32 SMs, 2x1536MB, 2488GFLOPS, 365W)

[编辑] 2.2 GPGPU Cards

Goto: http://wiki.jackslab.org/Nvidia_GPU_Architecture#Nvidia_Tesla_GPGPU_Cards

[编辑] 3 Kepler Micro Architecture

Released in the fall of 2012

1 SM: 4 Warps Scheduler (2 instruction dispatchers per Warp)
1 Warp: [32 threads context ?]
1 SM: 192 CUDA cores + 64 DPU (shared) + 32 Load/Store Unit (shared) + 32 SPU (shared) + 4 Warp Scheduler

Handle 64 warps/SM for a total of 2048 (64x32) threads resident in a single SM at a time [64 Warps context ?]
Register file size is 64K

K20X:

14 SM
2688 CUDA cores, 6GB
3.935 TFLOPs / DPU: 1.312 TFLOPs
28 nm
235Watts

GTX690:

2x8 SM
3072 CUDA cores
2x2.8TFLOPs
2x3.54 billion transistors
300Watts (2012.4)

[编辑] 3.1 Video Cards

[编辑] 3.1.1 GeForce 600 series

Release date: March 22, 2012
Codename: GK10x

Models

GeForce Series
GeForce GT Series
GeForce GTX Series

Fabrication process and transistors

292M 40 nm (GF119)
585M 40 nm (GF108)
1.170M 40 nm (GF116)
1.950M 40 nm (GF114)
1.270M 28 nm (GK107)
1.270M 28 nm (GK208)
2.540M 28 nm (GK106)
3.540M 28 nm (GK104)

Cards:

Entry-level GT610 GT620 GT630 GT640
Mid-range GTX650 GTX650Ti GTX650Ti Boost GTX 660
High-end GTX660Ti GTX670
Enthusiast GTX680 GTX690

[编辑] 3.1.2 GeForce 700 series

Release date: May 2013
Codename: GK110 GK208

Models:

GeForce Series
GeForce GT Series
GeForce GTX Series

Fabrication process and transistors:

585M 28 nm (GF117)
1.020M 28 nm (GK208)
1.270M 28 nm (GK107)
3.540M 28 nm (GK104)
7.080M 28 nm (GK110)

Cards

Entry-level: GeForce GT 705 GeForce GT 710 GeForce GT 720 GeForce GT 730 GeForce GT 740 GeForce GTX 745
Mid-range: GeForce GTX 750 GeForce GTX 750 Ti GeForce GTX 760 192-Bit GeForce GTX 760 GeForce GTX 760 Ti
High-end: GeForce GTX 770 GeForce GTX 780
Enthusiast: GeForce GTX 780 Ti GeForce GTX Titan GeForce GTX Titan Black GeForce GTX Titan Z

[编辑] 3.2 GPGPU Cards

Goto: http://wiki.jackslab.org/Nvidia_GPU_Architecture#Nvidia_Tesla_GPGPU_Cards

[编辑] 4 Maxwell Micro Architecture

The SM arch of Maxwell GM204:

1 SM (SMM): 4 Warp Scheduler (2 instruction dispatchers per Warp)
1 Warp: 32 CUDA cores + 1 DPU + 8 Load/Store Units + 8 SPU
1 SM (SMM): 128 CUDA cores + 4 DPU + 32 Load/Store Units + 32 SPU
e.g. GTX980: 16 SM (SMM), 2048 CUDA cores, 64 DPUs, 4612 GFLOPs / DPU: 144 GFLOPs, 28 nm, 5.2 billion transistors, 165W

The arch of Maxwell GM204:

TITAN X (GM204):

[编辑] 4.1 Video Cards

[编辑] 4.1.1 GeForce 900 series

Release date: September 2014
Codename: GM20x

Models

GeForce Series
GeForce GT Series
GeForce GTX Series

Cards

Mid-range GTX950 / GTX960
High-end GTX970 / GTX980
Enthusiast GTX980 Ti / GTX Titan X

[编辑] 4.2 GPGPU Cards

Goto: http://wiki.jackslab.org/Nvidia_GPU_Architecture#Nvidia_Tesla_GPGPU_Cards

[编辑] 5 Pascal Micro Architecture

The SM arch of Pascal GP100

1 SM: 2 Warp Scheduler (2 instruction dispatchers per Warp)
1 Warp: 32 CUDA cores + 16 DPU + 8 Load/Store Units + 8 SPU
1 SM: 64 CUDA cores + 32 DPU + 16 Load/Store Units + 16 SPU
e.g. Tesla P100: 60 SM(56 enabled), 3584 CUDA cores, 1792 DPUs, 16GB, 9.5 TFLOPs / DPU: 4.7 TFLOPs， 300Watts

The arch of Pascal GP100:

The SM arch of Pascal GP104

1 SM: 4 Warp Scheduler (2 instruction dispatchers per Warp)
1 Warp: 32 CUDA cores + 1 DPU + 8 Load/Store Units + 8 SPU
1 SM: 128 CUDA cores + 4 DPU + 32 Load/Store Units + 32 SPU
e.g. GTX1080, GTX1080Ti, TITAN X
GTX1080 (GP104): 20 SMs, 2560 CUDA cores, 80 DPUs, 16nm, 7.2billion, 8GB, 8.2 TFLOPs / DPU: 257 GFLOPs, 180Watts

The arch of Pascal GP104:

[编辑] 5.1 Video Cards

[编辑] 5.1.1 GeForce 1000 series

Release date: May 2016
Codename: GP10x

Models

GeForce GTX Series

Fabrication process and transistors:

3.3B 14 nm (GP107)
4.4B 16 nm (GP106)
7.2B 16 nm (GP104)
12B 16 nm (GP102)

Cards:

Entry-level: GTX1050 / GTX1050 Ti
Mid-range: GTX1060
High-end: GTX1070 / GTX1080(2016.5, 2560 CUDA cores, 20 SMs, 16nm, 7.2billion, 8GB, 8.2TFLOPs / DPU: 0.257TFLOPs, 180Watts)
Enthusiast: GTX1080 Ti / NVIDIA Titan X(2016.8, 3584 CUDA cores, 28 SMs, 16nm, 12billion, 12GB, 10TFLOPs / DPU: 0.317TFLOPs, 250Watts)

[编辑] 5.2 GPGPU Cards

Goto: http://wiki.jackslab.org/Nvidia_GPU_Architecture#Nvidia_Tesla_GPGPU_Cards

[编辑] 6 Nvidia Tesla GPGPU Cards

Tesla products target the high-performance computing market.

As of 2012, Nvidia Teslas power some of the world's fastest supercomputers, including Titan at Oak Ridge National Laboratory and Tianhe-1A, in Tianjin, China.

[编辑] 6.1 Overview

[编辑] 7 Reference

https://en.wikipedia.org/wiki/GeForce_900_series

https://en.wikipedia.org/wiki/GeForce_10_series

https://en.wikipedia.org/wiki/Nvidia_Tesla

@@ 第17行： / 第17行： @@
 * Register File
 * Shared Memory/L1 Cache
 * Warp Scheduler
+'''Cards:'''
+# GTX980 (2048 CUDA cores, 16SMs, 28nm, 5.2billion, 4GB, 4.981 TFLOPS / DPU: 0.1556TFLOPs, 165W, 2014.9, $549) [https://www.techpowerup.com/gpu-specs/evga-gtx-980.b3061 evga gtx980]
+# GTX1050 (640 CUDA cores, 5SMs, 14nm, 3.3billion, 4GB, 1.458 TFLOPS / DPU: 0.04556TFLOPs, 75W, 2018.1) [https://www.techpowerup.com/gpu-specs/geforce-gtx-1050-max-q.c3074 NVIDIA GeForce GTX 1050 Max-Q]  ----> Pascal GP107
+# GTX1050 Ti (768 CUDA cores, 6SMs, 14nm, 3.3billion, 4GB, 1.983 TFLOPS / DPU: 0.06197TFLOPs, 75W, 2018.1) [https://www.techpowerup.com/gpu-specs/geforce-gtx-1050-ti-max-q.c3075 NVIDIA GeForce GTX 1050 Ti Max-Q]  ----> Pascal GP107
+# RTX3050 (2048CUDA cores, 16SMs, 64 TensorCore, 16 RTCore, 8nm, 12billion, 4GB, 4.329 TFLOPS/ DPU: 0.06765TFLOPs, 75W, 2021.5 ) [https://www.techpowerup.com/gpu-specs/geforce-rtx-3050-mobile.c3788 NVIDIA GeForce RTX 3050 Mobile] -----> Ampere GA107
+# RTX3050 Ti (2560CUDA cores, 20SMs, 80 TensorCore, 20 RTCore, 8nm, 12billion, 4GB, 5.299 TFLOPS/ DPU: 0.08280TFLOPs, 75W, 2021.5 ) [https://www.techpowerup.com/gpu-specs/geforce-rtx-3050-ti-mobile.c3812 NVIDIA GeForce RTX 3050 Ti Mobile] -----> Ampere GA106
+#Mid-range:	GTX1060 ()
+#High-end:	GTX1070 / GTX1080 (2560 CUDA cores, 20 SMs, 16nm, 7.2billion, 8GB, 8.2TFLOPs / DPU: 0.257TFLOPs, 180W, 2016.5)
+# GTX1080 Ti / TITAN X (3584 CUDA cores, 28 SMs, 16nm, 12billion, 12GB, 10TFLOPs / DPU: 0.317TFLOPs, 250W, 2016.8)
+# TITAN X (3584 CUDA cores, 28 SMs, 16nm, 12billion, 12GB, 10.97TFLOPs / DPU: 0.3429TFLOPs, 250W, 2016.8) [https://www.techpowerup.com/gpu-specs/titan-x-pascal.c2863 NVIDIA TITAN X Pascal]----> Pascal GP107
+# RTX3070
+# RTX3080  (8704 CUDA cores, 68SMs, 272 TensorCore, 68 RTCore, 8nm, 28.3billion, 10GB, 29.77 TFLOPs / DPU: 0.465 TFLOPs, 320W, 2020.9 $699) [https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621 RTX3080][https://www.techpowerup.com/gpu-specs/geforce-rtx-3080-ti.c3735 RTX3080 Ti]
+# RTX3080 Ti (10240 CUDA cores, 80SMs, 320 TensorCore, 80 RTCore, 8nm, 28.3billion, 12GB, 34.10 TFLOPs / DPU: 0.5328 TFLOPs, 350W, 2021.5 $1199) [https://www.techpowerup.com/gpu-specs/geforce-rtx-3080-ti.c3735 RTX3080 Ti] ---> Ampere GA102
+# RTX3090 Ti (10752 CUDA cores, 84SMs, 336 TensorCore, 84 RTCore, 8nm, 28.3billion, 24GB, 40TFLOPs / DPU:0.625TFLOPs, 450W, 2022.1) [https://www.techpowerup.com/gpu-specs/geforce-rtx-3090-ti.c3829 Nvidia RTX3090 Ti][https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622 RTX3090] ----> Ampere GA102
 <br><br>
@@ 第124行： / 第139行： @@
-[[文件:Kepler-GK110-arch.jpg ]]
+[[文件:Kepler-GK110-arch.jpg | 800px]]
@@ 第209行： / 第224行： @@
 == Maxwell Micro Architecture ==
+The SM arch of Maxwell GM204:
 [[文件:Maxwell-GTX980-SM-arch.png]]
-[[文件:Maxwell-arch.png | 950px]]
 * 1 SM (SMM): 4 Warp Scheduler (2 instruction dispatchers per Warp)
 * 1 Warp: 32 CUDA cores + 1 DPU + 8 Load/Store Units + 8 SPU
 * 1 SM (SMM): 128 CUDA cores + 4 DPU + 32 Load/Store Units + 32 SPU
+* e.g. GTX980: 16 SM (SMM), 2048 CUDA cores, 64 DPUs, 4612 GFLOPs / DPU: 144 GFLOPs, 28 nm, 5.2 billion transistors, 165W
-GTX980:
+The arch of Maxwell GM204:
-* 16 SM (SMM)
+[[文件:Maxwell-arch.png | 800px]]
-* 2048 CUDA cores
-* 4612 GFLOPs / DPU: 144 GFLOPs
-* 28 nm
-* 5.2 billion transistors
-* 165W
-TITAN X:
+TITAN X (GM204):
-[[文件:TITAN-X-arch.png | 950px]]
+[[文件:TITAN-X-arch.png | 800px]]
 <br>
@@ 第258行： / 第266行： @@
 === GPGPU Cards ===
+Goto: http://wiki.jackslab.org/Nvidia_GPU_Architecture#Nvidia_Tesla_GPGPU_Cards
 <br><br>
@@ 第275行： / 第285行： @@
 The arch of Pascal GP100:
-[[文件:Pascal-GP100-arch.png | 950px]]
+[[文件:Pascal-GP100-arch.png | 800px]]
@@ 第291行： / 第301行： @@
 The arch of Pascal GP104:
-[[文件:Pascal-GP104-arch.png | 950px]]
+[[文件:Pascal-GP104-arch.png | 800px]]
 <br>
@@ 第320行： / 第330行： @@
 === GPGPU Cards ===
+Goto: http://wiki.jackslab.org/Nvidia_GPU_Architecture#Nvidia_Tesla_GPGPU_Cards
 <br><br>
@@ 第338行： / 第350行： @@
-[[文件:Nvidia-tesla-lineup-1.jpg]]
+[[文件:Nvidia-tesla-lineup-1.jpg | 800px]]
 <br>