Nvidia GPU Architecture
(→GPGPU Cards) |
(→Overview) |
||
(未显示1个用户的35个中间版本) | |||
第17行: | 第17行: | ||
* Register File | * Register File | ||
* Shared Memory/L1 Cache | * Shared Memory/L1 Cache | ||
− | |||
* Warp Scheduler | * Warp Scheduler | ||
+ | |||
+ | |||
+ | '''Cards:''' | ||
+ | # GTX980 (2048 CUDA cores, 16SMs, 28nm, 5.2billion, 4GB, 4.981 TFLOPS / DPU: 0.1556TFLOPs, 165W, 2014.9, $549) [https://www.techpowerup.com/gpu-specs/evga-gtx-980.b3061 evga gtx980] | ||
+ | # GTX1050 (640 CUDA cores, 5SMs, 14nm, 3.3billion, 4GB, 1.458 TFLOPS / DPU: 0.04556TFLOPs, 75W, 2018.1) [https://www.techpowerup.com/gpu-specs/geforce-gtx-1050-max-q.c3074 NVIDIA GeForce GTX 1050 Max-Q] ----> Pascal GP107 | ||
+ | # GTX1050 Ti (768 CUDA cores, 6SMs, 14nm, 3.3billion, 4GB, 1.983 TFLOPS / DPU: 0.06197TFLOPs, 75W, 2018.1) [https://www.techpowerup.com/gpu-specs/geforce-gtx-1050-ti-max-q.c3075 NVIDIA GeForce GTX 1050 Ti Max-Q] ----> Pascal GP107 | ||
+ | # RTX3050 (2048CUDA cores, 16SMs, 64 TensorCore, 16 RTCore, 8nm, 12billion, 4GB, 4.329 TFLOPS/ DPU: 0.06765TFLOPs, 75W, 2021.5 ) [https://www.techpowerup.com/gpu-specs/geforce-rtx-3050-mobile.c3788 NVIDIA GeForce RTX 3050 Mobile] -----> Ampere GA107 | ||
+ | # RTX3050 Ti (2560CUDA cores, 20SMs, 80 TensorCore, 20 RTCore, 8nm, 12billion, 4GB, 5.299 TFLOPS/ DPU: 0.08280TFLOPs, 75W, 2021.5 ) [https://www.techpowerup.com/gpu-specs/geforce-rtx-3050-ti-mobile.c3812 NVIDIA GeForce RTX 3050 Ti Mobile] -----> Ampere GA106 | ||
+ | #Mid-range: GTX1060 () | ||
+ | #High-end: GTX1070 / GTX1080 (2560 CUDA cores, 20 SMs, 16nm, 7.2billion, 8GB, 8.2TFLOPs / DPU: 0.257TFLOPs, 180W, 2016.5) | ||
+ | # GTX1080 Ti / TITAN X (3584 CUDA cores, 28 SMs, 16nm, 12billion, 12GB, 10TFLOPs / DPU: 0.317TFLOPs, 250W, 2016.8) | ||
+ | # TITAN X (3584 CUDA cores, 28 SMs, 16nm, 12billion, 12GB, 10.97TFLOPs / DPU: 0.3429TFLOPs, 250W, 2016.8) [https://www.techpowerup.com/gpu-specs/titan-x-pascal.c2863 NVIDIA TITAN X Pascal]----> Pascal GP107 | ||
+ | # RTX3070 | ||
+ | # RTX3080 (8704 CUDA cores, 68SMs, 272 TensorCore, 68 RTCore, 8nm, 28.3billion, 10GB, 29.77 TFLOPs / DPU: 0.465 TFLOPs, 320W, 2020.9 $699) [https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621 RTX3080][https://www.techpowerup.com/gpu-specs/geforce-rtx-3080-ti.c3735 RTX3080 Ti] | ||
+ | # RTX3080 Ti (10240 CUDA cores, 80SMs, 320 TensorCore, 80 RTCore, 8nm, 28.3billion, 12GB, 34.10 TFLOPs / DPU: 0.5328 TFLOPs, 350W, 2021.5 $1199) [https://www.techpowerup.com/gpu-specs/geforce-rtx-3080-ti.c3735 RTX3080 Ti] ---> Ampere GA102 | ||
+ | # RTX3090 Ti (10752 CUDA cores, 84SMs, 336 TensorCore, 84 RTCore, 8nm, 28.3billion, 24GB, 40TFLOPs / DPU:0.625TFLOPs, 450W, 2022.1) [https://www.techpowerup.com/gpu-specs/geforce-rtx-3090-ti.c3829 Nvidia RTX3090 Ti][https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622 RTX3090] ----> Ampere GA102 | ||
<br><br> | <br><br> | ||
第124行: | 第139行: | ||
− | [[文件:Kepler-GK110-arch.jpg ]] | + | [[文件:Kepler-GK110-arch.jpg | 800px]] |
第209行: | 第224行: | ||
== Maxwell Micro Architecture == | == Maxwell Micro Architecture == | ||
+ | |||
+ | The SM arch of Maxwell GM204: | ||
[[文件:Maxwell-GTX980-SM-arch.png]] | [[文件:Maxwell-GTX980-SM-arch.png]] | ||
− | |||
− | |||
− | |||
− | |||
* 1 SM (SMM): 4 Warp Scheduler (2 instruction dispatchers per Warp) | * 1 SM (SMM): 4 Warp Scheduler (2 instruction dispatchers per Warp) | ||
* 1 Warp: 32 CUDA cores + 1 DPU + 8 Load/Store Units + 8 SPU | * 1 Warp: 32 CUDA cores + 1 DPU + 8 Load/Store Units + 8 SPU | ||
− | |||
* 1 SM (SMM): 128 CUDA cores + 4 DPU + 32 Load/Store Units + 32 SPU | * 1 SM (SMM): 128 CUDA cores + 4 DPU + 32 Load/Store Units + 32 SPU | ||
+ | * e.g. GTX980: 16 SM (SMM), 2048 CUDA cores, 64 DPUs, 4612 GFLOPs / DPU: 144 GFLOPs, 28 nm, 5.2 billion transistors, 165W | ||
− | + | The arch of Maxwell GM204: | |
− | + | [[文件:Maxwell-arch.png | 800px]] | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | TITAN X: | + | TITAN X (GM204): |
− | [[文件:TITAN-X-arch.png | | + | [[文件:TITAN-X-arch.png | 800px]] |
<br> | <br> | ||
第258行: | 第266行: | ||
=== GPGPU Cards === | === GPGPU Cards === | ||
+ | |||
+ | Goto: http://wiki.jackslab.org/Nvidia_GPU_Architecture#Nvidia_Tesla_GPGPU_Cards | ||
<br><br> | <br><br> | ||
第275行: | 第285行: | ||
The arch of Pascal GP100: | The arch of Pascal GP100: | ||
− | [[文件:Pascal-GP100-arch.png | | + | [[文件:Pascal-GP100-arch.png | 800px]] |
第291行: | 第301行: | ||
The arch of Pascal GP104: | The arch of Pascal GP104: | ||
− | [[文件:Pascal-GP104-arch.png | | + | [[文件:Pascal-GP104-arch.png | 800px]] |
<br> | <br> | ||
第320行: | 第330行: | ||
=== GPGPU Cards === | === GPGPU Cards === | ||
+ | |||
+ | Goto: http://wiki.jackslab.org/Nvidia_GPU_Architecture#Nvidia_Tesla_GPGPU_Cards | ||
<br><br> | <br><br> | ||
第338行: | 第350行: | ||
− | [[文件:Nvidia-tesla-lineup-1.jpg]] | + | [[文件:Nvidia-tesla-lineup-1.jpg | 800px]] |
<br> | <br> |
2022年4月8日 (五) 13:42的最后版本
目录 |
[编辑] 1 Overview
The GPU architecture is built around a scalable array of Streaming Multiprocessors (SM)
The key components of a SM:
- CUDA cores (ALU + FPU)
- Double Precision Units (DPU)
- Special Function Units (SPU)
- Load/Store Units (LD/ST)
- Register File
- Shared Memory/L1 Cache
- Warp Scheduler
Cards:
- GTX980 (2048 CUDA cores, 16SMs, 28nm, 5.2billion, 4GB, 4.981 TFLOPS / DPU: 0.1556TFLOPs, 165W, 2014.9, $549) evga gtx980
- GTX1050 (640 CUDA cores, 5SMs, 14nm, 3.3billion, 4GB, 1.458 TFLOPS / DPU: 0.04556TFLOPs, 75W, 2018.1) NVIDIA GeForce GTX 1050 Max-Q ----> Pascal GP107
- GTX1050 Ti (768 CUDA cores, 6SMs, 14nm, 3.3billion, 4GB, 1.983 TFLOPS / DPU: 0.06197TFLOPs, 75W, 2018.1) NVIDIA GeForce GTX 1050 Ti Max-Q ----> Pascal GP107
- RTX3050 (2048CUDA cores, 16SMs, 64 TensorCore, 16 RTCore, 8nm, 12billion, 4GB, 4.329 TFLOPS/ DPU: 0.06765TFLOPs, 75W, 2021.5 ) NVIDIA GeForce RTX 3050 Mobile -----> Ampere GA107
- RTX3050 Ti (2560CUDA cores, 20SMs, 80 TensorCore, 20 RTCore, 8nm, 12billion, 4GB, 5.299 TFLOPS/ DPU: 0.08280TFLOPs, 75W, 2021.5 ) NVIDIA GeForce RTX 3050 Ti Mobile -----> Ampere GA106
- Mid-range: GTX1060 ()
- High-end: GTX1070 / GTX1080 (2560 CUDA cores, 20 SMs, 16nm, 7.2billion, 8GB, 8.2TFLOPs / DPU: 0.257TFLOPs, 180W, 2016.5)
- GTX1080 Ti / TITAN X (3584 CUDA cores, 28 SMs, 16nm, 12billion, 12GB, 10TFLOPs / DPU: 0.317TFLOPs, 250W, 2016.8)
- TITAN X (3584 CUDA cores, 28 SMs, 16nm, 12billion, 12GB, 10.97TFLOPs / DPU: 0.3429TFLOPs, 250W, 2016.8) NVIDIA TITAN X Pascal----> Pascal GP107
- RTX3070
- RTX3080 (8704 CUDA cores, 68SMs, 272 TensorCore, 68 RTCore, 8nm, 28.3billion, 10GB, 29.77 TFLOPs / DPU: 0.465 TFLOPs, 320W, 2020.9 $699) RTX3080RTX3080 Ti
- RTX3080 Ti (10240 CUDA cores, 80SMs, 320 TensorCore, 80 RTCore, 8nm, 28.3billion, 12GB, 34.10 TFLOPs / DPU: 0.5328 TFLOPs, 350W, 2021.5 $1199) RTX3080 Ti ---> Ampere GA102
- RTX3090 Ti (10752 CUDA cores, 84SMs, 336 TensorCore, 84 RTCore, 8nm, 28.3billion, 24GB, 40TFLOPs / DPU:0.625TFLOPs, 450W, 2022.1) Nvidia RTX3090 TiRTX3090 ----> Ampere GA102
[编辑] 2 Fermi Micro Architecture
The Fermi architecture was the first complete GPU computing architecture to deliver the features required for the most demanding HPC applications.
- 1 SM: 32 CUDA cores + 16 Load/Store Unit + 4 SPU + 2 Warp Scheduler
- 1 SM: 2 Warps
- 1 Warp: 16 CUDA cores + 16 Load/Store Unit(shared) + 4 SPU(shared) + [32 threads context ?]
- Handle 48 warps per SM for a total of 1536 (48x32) threads resident in a single SM at a time [48 Warps context ?]
- 1 CUDA core: 1 ALU + 1 FPU
- Register file is 32KB
GTX480:
- 15 SM (32 CUDA cores/SM)
- 480 CUDA cores
- 1345 GFLOPs
- 40 nm
- 3.2 billion transistors
- GTX480 250Watts
[编辑] 2.1 Video Cards
[编辑] 2.1.1 GeForce 400 Series
- Release date: April 12, 2010
- Codename: GF10x
- Architecture: Fermi
- Models:
- GeForce Series
- GeForce GT Series
- GeForce GTS Series
- GeForce GTX Series
- Fabrication process and transistors:
- 260M 40nm (GT218)
- 585M 40 nm (GF108)
- 1.170M 40 nm (GF106)
- 1.950M 40 nm (GF104)
- 1.950M 40 nm (GF114)
- 3.200M 40 nm (GF100)
- Cards:
- Entry-level GT420 GT430
- Mid-range GT440 GTS450 GTX460
- High-end GTX465 GTX470
- Enthusiast GTX480 (2010.3, 3.2 billion Transistors, 15 SMs, 1536MB, 1345 GFLOPS, 250W)
[编辑] 2.1.2 GeForce 500 Series
- Release date: 8 November 2010
- Codename: GF11x
- Architecture: Fermi
- Models:
- GeForce Series
- GeForce GT Series
- GeForce GTX Series
- Fabrication process and transistors:
- 292M 40nm (GF119)
- 585M 40 nm (GF108)
- 1.170M 40 nm (GF116)
- 1.950M 40 nm (GF114)
- 3.000M 40 nm (GF110)
- Cards:
- Entry-level 510 GT520 GT530
- Mid-range GT545 GTX550Ti GTX560 GTX560Ti
- High-end GTX570 GTX580 GTX590(2011.3, 2x3 billion transistors, 32 SMs, 2x1536MB, 2488GFLOPS, 365W)
[编辑] 2.2 GPGPU Cards
Goto: http://wiki.jackslab.org/Nvidia_GPU_Architecture#Nvidia_Tesla_GPGPU_Cards
[编辑] 3 Kepler Micro Architecture
Released in the fall of 2012
- 1 SM: 4 Warps Scheduler (2 instruction dispatchers per Warp)
- 1 Warp: [32 threads context ?]
- 1 SM: 192 CUDA cores + 64 DPU (shared) + 32 Load/Store Unit (shared) + 32 SPU (shared) + 4 Warp Scheduler
- Handle 64 warps/SM for a total of 2048 (64x32) threads resident in a single SM at a time [64 Warps context ?]
- Register file size is 64K
K20X:
- 14 SM
- 2688 CUDA cores, 6GB
- 3.935 TFLOPs / DPU: 1.312 TFLOPs
- 28 nm
- 235Watts
GTX690:
- 2x8 SM
- 3072 CUDA cores
- 2x2.8TFLOPs
- 2x3.54 billion transistors
- 300Watts (2012.4)
[编辑] 3.1 Video Cards
[编辑] 3.1.1 GeForce 600 series
- Release date: March 22, 2012
- Codename: GK10x
- Models
- GeForce Series
- GeForce GT Series
- GeForce GTX Series
- Fabrication process and transistors
- 292M 40 nm (GF119)
- 585M 40 nm (GF108)
- 1.170M 40 nm (GF116)
- 1.950M 40 nm (GF114)
- 1.270M 28 nm (GK107)
- 1.270M 28 nm (GK208)
- 2.540M 28 nm (GK106)
- 3.540M 28 nm (GK104)
- Cards:
- Entry-level GT610 GT620 GT630 GT640
- Mid-range GTX650 GTX650Ti GTX650Ti Boost GTX 660
- High-end GTX660Ti GTX670
- Enthusiast GTX680 GTX690
[编辑] 3.1.2 GeForce 700 series
- Release date: May 2013
- Codename: GK110 GK208
- Models:
- GeForce Series
- GeForce GT Series
- GeForce GTX Series
- Fabrication process and transistors:
- 585M 28 nm (GF117)
- 1.020M 28 nm (GK208)
- 1.270M 28 nm (GK107)
- 3.540M 28 nm (GK104)
- 7.080M 28 nm (GK110)
- Cards
- Entry-level: GeForce GT 705 GeForce GT 710 GeForce GT 720 GeForce GT 730 GeForce GT 740 GeForce GTX 745
- Mid-range: GeForce GTX 750 GeForce GTX 750 Ti GeForce GTX 760 192-Bit GeForce GTX 760 GeForce GTX 760 Ti
- High-end: GeForce GTX 770 GeForce GTX 780
- Enthusiast: GeForce GTX 780 Ti GeForce GTX Titan GeForce GTX Titan Black GeForce GTX Titan Z
[编辑] 3.2 GPGPU Cards
Goto: http://wiki.jackslab.org/Nvidia_GPU_Architecture#Nvidia_Tesla_GPGPU_Cards
[编辑] 4 Maxwell Micro Architecture
The SM arch of Maxwell GM204:
- 1 SM (SMM): 4 Warp Scheduler (2 instruction dispatchers per Warp)
- 1 Warp: 32 CUDA cores + 1 DPU + 8 Load/Store Units + 8 SPU
- 1 SM (SMM): 128 CUDA cores + 4 DPU + 32 Load/Store Units + 32 SPU
- e.g. GTX980: 16 SM (SMM), 2048 CUDA cores, 64 DPUs, 4612 GFLOPs / DPU: 144 GFLOPs, 28 nm, 5.2 billion transistors, 165W
The arch of Maxwell GM204:
TITAN X (GM204):
[编辑] 4.1 Video Cards
[编辑] 4.1.1 GeForce 900 series
- Release date: September 2014
- Codename: GM20x
- Models
- GeForce Series
- GeForce GT Series
- GeForce GTX Series
- Cards
- Mid-range GTX950 / GTX960
- High-end GTX970 / GTX980
- Enthusiast GTX980 Ti / GTX Titan X
[编辑] 4.2 GPGPU Cards
Goto: http://wiki.jackslab.org/Nvidia_GPU_Architecture#Nvidia_Tesla_GPGPU_Cards
[编辑] 5 Pascal Micro Architecture
- The SM arch of Pascal GP100
- 1 SM: 2 Warp Scheduler (2 instruction dispatchers per Warp)
- 1 Warp: 32 CUDA cores + 16 DPU + 8 Load/Store Units + 8 SPU
- 1 SM: 64 CUDA cores + 32 DPU + 16 Load/Store Units + 16 SPU
- e.g. Tesla P100: 60 SM(56 enabled), 3584 CUDA cores, 1792 DPUs, 16GB, 9.5 TFLOPs / DPU: 4.7 TFLOPs, 300Watts
The arch of Pascal GP100:
- The SM arch of Pascal GP104
- 1 SM: 4 Warp Scheduler (2 instruction dispatchers per Warp)
- 1 Warp: 32 CUDA cores + 1 DPU + 8 Load/Store Units + 8 SPU
- 1 SM: 128 CUDA cores + 4 DPU + 32 Load/Store Units + 32 SPU
- e.g. GTX1080, GTX1080Ti, TITAN X
- GTX1080 (GP104): 20 SMs, 2560 CUDA cores, 80 DPUs, 16nm, 7.2billion, 8GB, 8.2 TFLOPs / DPU: 257 GFLOPs, 180Watts
The arch of Pascal GP104:
[编辑] 5.1 Video Cards
[编辑] 5.1.1 GeForce 1000 series
- Release date: May 2016
- Codename: GP10x
- Models
- GeForce GTX Series
- Fabrication process and transistors:
- 3.3B 14 nm (GP107)
- 4.4B 16 nm (GP106)
- 7.2B 16 nm (GP104)
- 12B 16 nm (GP102)
- Cards:
- Entry-level: GTX1050 / GTX1050 Ti
- Mid-range: GTX1060
- High-end: GTX1070 / GTX1080(2016.5, 2560 CUDA cores, 20 SMs, 16nm, 7.2billion, 8GB, 8.2TFLOPs / DPU: 0.257TFLOPs, 180Watts)
- Enthusiast: GTX1080 Ti / NVIDIA Titan X(2016.8, 3584 CUDA cores, 28 SMs, 16nm, 12billion, 12GB, 10TFLOPs / DPU: 0.317TFLOPs, 250Watts)
[编辑] 5.2 GPGPU Cards
Goto: http://wiki.jackslab.org/Nvidia_GPU_Architecture#Nvidia_Tesla_GPGPU_Cards
[编辑] 6 Nvidia Tesla GPGPU Cards
Tesla products target the high-performance computing market.
As of 2012, Nvidia Teslas power some of the world's fastest supercomputers, including Titan at Oak Ridge National Laboratory and Tianhe-1A, in Tianjin, China.
[编辑] 6.1 Overview
[编辑] 7 Reference
- https://en.wikipedia.org/wiki/Fermi_(microarchitecture)
- https://en.wikipedia.org/wiki/GeForce_400_series
- https://en.wikipedia.org/wiki/GeForce_500_series
- https://en.wikipedia.org/wiki/GeForce_600_series
- https://en.wikipedia.org/wiki/GeForce_700_series
- https://en.wikipedia.org/wiki/GeForce_800M_series