Nvidia GPU Architecture

1 Overview

The GPU architecture is built around a scalable array of Streaming Multiprocessors (SM)

The key components of a SM:

CUDA cores (ALU + FPU)
Double Precision Units (DPU)
Special Function Units (SPU)
Load/Store Units (LD/ST)

Register File
Shared Memory/L1 Cache

Warp Scheduler

2 Fermi Micro Architecture

The Fermi architecture was the first complete GPU computing architecture to deliver the features required for the most demanding HPC applications.

1 CUDA core: 1 ALU + 1 FPU
1 SM: 32 CUDA cores + 16 Load/Store Unit + 4 SPU + 2 Warp Scheduler
1 SM: 2 Warps
1 Warp: 16 CUDA cores + 16 Load/Store Unit(shared) + 4 SPU(shared) + [32 threads context ?]
Handle 48 warps per SM for a total of 1536 (48x32) threads resident in a single SM at a time [48 Warps context ?]
Register file is 32KB

GTX480:

15 SM (32 CUDA cores/SM)
480 CUDA cores
1345 GFLOPs
40 nm
3.2 billion transistors
GTX480 250Watts

2.1 Fermi Cards

2.1.1 GeForce 400 Series

Release date: April 12, 2010
Codename: GF10x
Architecture: Fermi

Models:

GeForce Series
GeForce GT Series
GeForce GTS Series
GeForce GTX Series

Fabrication process and transistors:

260M 40nm (GT218)
585M 40 nm (GF108)
1.170M 40 nm (GF106)
1.950M 40 nm (GF104)
1.950M 40 nm (GF114)
3.200M 40 nm (GF100)

Cards:

Entry-level GT420 GT430
Mid-range GT440 GTS450 GTX460
High-end GTX465 GTX470
Enthusiast GTX480 (2010.3, 3.2 billion Transistors, 15 SMs, 1536MB, 1345 GFLOPS, 250W)

2.1.2 GeForce 500 Series

Release date: 8 November 2010
Codename: GF11x
Architecture: Fermi

Models:

GeForce Series
GeForce GT Series
GeForce GTX Series

Fabrication process and transistors:

292M 40nm (GF119)
585M 40 nm (GF108)
1.170M 40 nm (GF116)
1.950M 40 nm (GF114)
3.000M 40 nm (GF110)

Cards:

Entry-level 510 GT520 GT530
Mid-range GT545 GTX550Ti GTX560 GTX560Ti
High-end GTX570 GTX580 GTX590(2011.3, 2x3 billion transistors, 32 SMs, 2x1536MB, 2488GFLOPS, 365W)

3 Kepler Micro Architecture

Released in the fall of 2012

1 SM: 4 Warps Scheduler (2 instruction dispatchers per Warp)
1 Warp: [32 threads context ?]
1 SM: 192 CUDA cores + 64 DPU (shared) + 32 Load/Store Unit (shared) + 32 SPU (shared) + 4 Warp Scheduler

Handle 64 warps/SM for a total of 2048 (64x32) threads resident in a single SM at a time [64 Warps context ?]
Register file size is 64K

K20X:

14 SM
2688 CUDA cores, 6GB
3.935 TFLOPs / DPU: 1.312 TFLOPs
28 nm
235Watts

GTX690:

2x8 SM
3072 CUDA cores
2x2.8TFLOPs
2x3.54 billion transistors
300Watts (2012.4)

4 Maxwell Micro Architecture

GM204:

1 SM (SMM): 4 Warp Scheduler (8 instruction dispatchers)
1 Warp: 32 CUDA cores + 1 DPU + 8 Load/Store Units + 8 SPU

1 SM (SMM): 128 CUDA cores + 4 DPU + 32 Load/Store Units + 32 SPU

GTX980:

16 SM (SMM)
2048 CUDA cores
4612 GFLOPs / DPU: 144 GFLOPs
28 nm
5.2 billion transistors
165W

5 Reference

https://en.wikipedia.org/wiki/GeForce_900_series

https://en.wikipedia.org/wiki/GeForce_10_series

https://en.wikipedia.org/wiki/Nvidia_Tesla