Pascal new feature

2018-03-18

History

Tesla->Kepler->Fermi->MaxWell->Pascal->Tesla V100, Titan V(2017 June, CUDA Compute Capability 7.0)
Pascal is the codename for a GPU microarchitecture developed by Nvidia, as the successor to the Maxwell architecture

Architecture

In Pascal, an SM (streaming multiprocessor) consists of 64 CUDA cores. Maxwell packed 128, Kepler 192, Fermi 32 and Tesla only 8 CUDA cores into an SM; the GP100 SM is partitioned into two processing blocks, each having 32 single-precision CUDA Cores, an instruction buffer, a warp scheduler, 2 texture mapping units and 2 dispatch units. The performance of double precision will be worse than single precision data.
CUDA Compute Capability 6.0.
重要的性能加速点：
16-bit (FP16) floating-point operations (colloquially “half precision”) can be executed at twice the rate of 32-bit floating-point operations (“single precision”)[8] and 64-bit floating-point operations (colloquially “double precision”) executed at half the rate of 32-bit floating point operations.[9]
More registers — twice the amount of registers per CUDA core compared to Maxwell.
More shared memory
Dynamic load balancing scheduling system.[10] This allows the scheduler to dynamically adjust the amount of the GPU assigned to multiple tasks, ensuring that the GPU remains saturated with work except when there is no more work that can safely be distributed to distribute.[10] Nvidia therefore has safely enabled asynchronous compute in Pascal’s driver.
Instruction-level and thread-level preemption.
CUDA Compute Capability 6.1.

Overview Architecture

Graphics Processor Cluster
A chip is partitioned into Graphics Processor Clusters (GPCs).
Geforce1080:arch_64&&computing arch_64
A “Streaming Multiprocessor“ corresponds to AMD’s Compute Unit. An SMP encompasses 64 single-precision ALUs on GP100 chips(“CUDA cores”).
One Pascal SM on the GP100 combines 64 single-precision (FP32) shader processors and also 32 double-precision (FP64) (at least the GP100 GPUs) providing a 2:1 ratio of single- to double-precision throughput. On the GP104 an SM combines 128 single-precision ALUs, 4 double-precision ALUs providing a 32:1 ratio, and one half-precision ALU that contains a vector of two half-precision floats which can execute the same instruction on both floats providing a 64:1 ratio if the same instruction is used on both elements. GP100, however, uses more flexible FP32 cores that are able to process one single-precision or two half-precision numbers in a two-element vector.[15] Nvidia intends to address the calculation of algorithms related to deep learning with those.

GP100: Nvidia Tesla P100 GPU accelerator is targeted at GPGPU applications such as FP64 double precision compute and deep learning training that uses FP16. It uses HBM2 memory.[17] Quadro GP100 also uses the GP100 GPU.
GP102: This GPU is used in the TITAN Xp, Titan X and the GeForce GTX 1080 Ti. It is also used in the Quadro P6000 & Tesla P40.
GP104: This GPU is used in the GeForce GTX 1070, GTX 1070 Ti and the GTX 1080. The GTX 1070 has 15/20 and the GTX 1070 Ti has 19/20 of its SMs enabled. Both are connected to GDDR5 memory, while the GTX 1080 is a full chip and is connected to GDDR5X memory. It is also used in the Quadro P5000, Quadro P4000 and Tesla P4.
GP102:chips for GTX 1080Ti.

Succeesor Architecture

After Pascal, the next architecture will be preliminarily codenamed Volta. Nvidia announced that the Volta GPU would feature High Bandwidth Memory, Unified Memory, complete FP16 support (two times its FP32) and NVLink.[28] It is still not known if Volta will be part of the consumer lineup and may be replaced with Ampere.

Compiling option for certain architecture

A complete description is somewhat complicated, but there are intended to be relatively simple, easy-to-remember canonical usages. Compile for the architecture (both virtual and real), that represents the GPUs you wish to target. A fairly simple form is:
-gencode arch=compute_XX,code=sm_XX
where XX is the two digit compute capability for the GPU you wish to target. If you wish to target multiple GPUs, simply repeat the entire sequence for each XX target. This is approximately the approach taken with the CUDA sample code projects. (If you’d like to include PTX in your executable, include an additional -gencode with the code option specifying the same PTX virtual architecture as the arch option).
Another fairly simple form, when targetting only a single GPU, is just to use:
-arch=sm_XX
默认版本编译是sm_20,是不支持double-precision运算的.
When no -gencode switch is used, and no -arch switch is used, nvcc assumes a default -arch=sm_20 is appended to your compile command (this is for CUDA 7.5, the default -arch setting may vary by CUDA version). sm_20 is a real architecture, and it is not legal to specify a real architecture on the -arch option when a -code option is also supplied.

Matchin between Arch type & computing ability

Pascal (CUDA 8 and later)
SM60 or SM_60, compute_60 – GP100/Tesla P100 – DGX-1 (Generic Pascal)
SM61 or SM_61, compute_61 – GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030, Titan Xp, Tesla P40, Tesla P4
SM62 or SM_62, compute_62 – Drive-PX2, Tegra (Jetson) TX2, Denver-based GPU
Volta (CUDA 9 and later)
SM70 or SM_70, compute_70 – Tesla V100
SM71 or SM_71, compute_71 – probably not implemented
SM72 or SM_72, compute_72 – currently unknown

check CUDA version:nvcc – version
check Nvidia GPU usage: nvidia-smi

Programming model for CUDA

Programmer writes code for a single thread in
simple C program.

All threads executes the same code, but can take different paths.
Threads are grouped into a block.
Threads within the same block can synchronize execution.
Blocks are grouped into a grid.
Blocks are independently scheduled on the GPU, can be executed in any order.
A kernel is executed as a grid of blocks of threads.
Each block is executed by one SM and does not migrate. Several concurrent blocks can reside on one SM depending on block’s memory requirement and the SM’s memory resources.
A warp consists of 32 threads

A warp is the basic schedule unit in kernel execution.
A thread block consists of 32-thread warps.
Each cycle, a warp scheduler selects one ready warps and dispatches the warps to CUDA cores to execute.

NVCC compiling option

-pg:nstrument generated code/executable for use by gprof (Linux only).
-g: Generate debug information for host code.
-G: Generate debug information for device code. Turns off all optimizations. Don’t use for profiling; use lineinfo instead.
-O Specify optimization level for host code.
-shared Generate a shared library during linking. Use option –linker-options when other linker options are required for more control.

-arch
Specify the name of the class of NVIDIA virtual GPU architecture for which the CUDA input files must be compiled.
With the exception as described for the shorthand below, the architecture specified with this option must be a virtual architecture (such as compute_50). Normally, this option alone does not trigger assembly of the generated PTX for a real architecture (that is the role of nvcc option –gpu-code, see below); rather, its purpose is to control preprocessing and compilation of the input to PTX.

For convenience, in case of simple nvcc compilations, the following shorthand is supported. If no value for option –gpu-code is specified, then the value of this option defaults to the value of –gpu-architecture. In this situation, as only exception to the description above, the value specified for –gpu-architecture may be a real architecture (such as a sm_50), in which case nvcc uses the specified real architecture and its closest virtual architecture as effective architecture values. For example, nvcc –gpu-architecture=sm_50 is equivalent to nvcc –gpu-architecture=compute_50 –gpu-code=sm_50,compute_50.

Exisiting Library for HPC

cuBLAS,cuDNN.

Floarting point number calculation

IEEE 754 supports both 32 bits and 64 bits number operation, corresbonding to float and double. Devices with compute capability 2.0 and above support both single and double precision IEEE 754 including fused multiply-add in both single and double precision. Operations such as square root and division will result in the floating point value closest to the correct mathematical result in both single and double precision, by default. Floating point number will consist of such sign bit and power weight parts and actual number with 1.XXX.
the fused multiply-add operation is faster and more accurate than performing separate multiply and add operations. The flags have no effect on double precision or on devices of compute capability below 2.0.
Faster compiling options:
-ftz=true
-prec-div=false
-prec-sqrt=false
Possible to call cuBLAS in the user kernal.

GP100: Nvidia Tesla P100 GPU accelerator is targeted at GPGPU applications such as FP64 double precision compute and deep learning training that uses FP16. All aims at using FP16 float point number.

Synchrnous & Asynchonous programming

CUDA kernel launch is asynchronous / non-blocking. host call starts the kernel execution, but doesn’t wait
for it to finish before going on to next instruction similar for cudaMemcpyAsync starts the copy but doesn’t wait for completion has to be done through a “stream” with page-locked memory (also known as pinned memory) – see
documentation in both cases, host eventually waits when at a cudaDeviceSynchronize() call.
benefit? – in general, doesn’t affect correct execution, and might improve performance by overlapping CPU and GPU execution.
Applications manage concurrency through streams.
Within each stream, CUDA operations are carried out in
order (i.e. FIFO – first in, first out); one finishes before the
next starts Key to getting better performance is using multiple streams
to overlap things.

Thrust usage to maximize efficiency

As this example shows, the = operator can be used to copy a host_vector to a device_vector (or vice-versa). The = operator can also be used to copy host_vector to host_vector or device_vector to device_vector. Also note that individual elements of a device_vector can be accessed using the standard bracket notation. However, because each of these accesses requires a call to cudaMemcpy, they should be used sparingly. We’ll look at some more efficient techniques later.
减少[]/= copy数据通信。
Thrust can be called from kernal calls.
Thrust algorithms may be called from global or device functions when invoked with an execution policy.

Math API Optimization

The answer lies in Appendix D of the programming guide. The intrinsics for the transcendental, trigonometric, and special functions are faster, but have more domain restrictions and generally lower accuracy than their software counterparts. For the primary purpose of the hardware (ie graphics), having fast approximate functions for sin, cos, square root, reciprocal, etc. allows for improved shader performance when ultimate mathematical accuracy is not critical. For some compute tasks, the less accurate versions are also fine. For other applications, the intrinsics may not be sufficient.
Having both allows the informed programmer to have a choice: speed or accuracy.