Teragon Cuda 2016 구축 기술 문서
######## GPGPU CUDA HPC 환경 구축 하기
작성일 : 2016년 7월 1일
작성자 : 서진우 (alang@clunix.com)
CUDA 환경을 구축하기 위해서는 기본적으로 OS에 freeglut, freeglut-devel
의 rpm 패키지가 사전 설치 되어 있어야 함.
1. CUDA Toolkit 다운로드 및 설치
https://developer.nvidia.com/
id : muchunalang@gmail.com
pw : Root21!
# wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda_7.5.18_linux.run
# wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/gdk/gdk_linux_amd64_352_79_release.run
# wget http://developer.download.nvidia.com/compute/cuda/5_5/rel/nvml/tdk_5.319.85.tar.gz
or
# cd TRG_PKG_2016/cuda
# ./cuda_7.5.18_linux.run –help
# ./cuda_7.5.18_linux.run –driver –toolkit –samples –silent –verbose
# ./cuda_7.5.18_linux.run –toolkit –silent –verbose
# ./cuda_7.5.18_linux.run –samples –silent –verbose
# vi vi /etc/profile.d/cuda_env.sh
—————————————————————–
#!/bin/sh
CUDA_HOME=/usr/local/cuda
PATH=${CUDA_HOME}/bin:$PATH
LD_LIBRARY_PATH=${CUDA_HOME}/lib64:$LD_LIBRARY_PATH
NVIDIA_CUDA_SDK=${CUDA_HOME}/samples
export CUDA_HOME PATH LD_LIBRARY_PATH NVIDIA_CUDA_SDK
——————————————————————
# source /etc/profile.d/cuda_env.sh
# which nvcc
/usr/local/cuda/bin/nvcc
# sh gdk_linux_amd64_352_79_release.run
Logging to /tmp/gdk_install_23161.log
Welcome to the GPU Deployment Installer.
Enter installation directory [ default is / ]: /usr/local/cuda
Installation complete!
Installation directory: /usr/local/cuda
# tar xzvf tdk_5.319.85.tar.gz
# cp -a tdk_5.319.85/nvml /usr/local/cuda
2. GPU BIOS 설정 관련 ..
– ECC mode On/off 하기
개발 환경에 따라 GPU Memory 의 ECC 기능을 on/off 제어를 요청 받을 때가
있다.
# nvidia-smi
Wed Jul 6 12:48:55 2016
+——————————————————+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|——————————-+———————-+———————-+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40c Off | 0000:03:00.0 Off | 0 |
| 23% 29C P0 64W / 235W | 22MiB / 11519MiB | 0% Default |
+——————————-+———————-+———————-+
| 1 GRID K2 Off | 0000:84:00.0 Off | Off |
| N/A 32C P0 45W / 117W | 11MiB / 4095MiB | 0% Default |
+——————————-+———————-+———————-+
| 2 GRID K2 Off | 0000:85:00.0 Off | Off |
| N/A 29C P0 34W / 117W | 11MiB / 4095MiB | 5% Default |
+——————————-+———————-+———————-+
# nvidia-smi -q ;; ecc mode 확인
Ecc Mode
Current : Disabled
Pending : Disabled
# nvidia-smi -i <GPU num> –ecc-config=0 ;; Off
# nvidia-smi -i <GPU num> –ecc-config=1 ;; On
– Persistence-mode 변경 하기
nvidia-smi 명령 수행 시, 결과 응답이 느린 경우가 있다.
이때는 persistence mode 가 disabled 된 경우가 대부분이다.
persistence mode 를 enabled 시켜준다.
(nvml api를 이용한 대부분의 프로그램이 느린 경우, 이 방법으로 해결)
# nvidia-smi -q | grep Persis
Persistence Mode : Disabled
Persistence Mode : Disabled
Persistence Mode : Disabled
# nvidia-smi -pm 1
Persistence Mode : Enabled
Persistence Mode : Enabled
Persistence Mode : Enabled
– GPU Device 활당 제어 (compute mode)
multi gpu 가 장착된 경우, 기본적으로 gpu 작업을 동시에 여러개 제출하면
기본적으로 모두 0번 gpu 로 작업이 들어간다.
이걸 해결하기 위해서는 cuda code 에서 cudaSetDevice(Devid) 함수로 직접 gpu를
지정해주는 방법과 driver 레벨에서 compute mode를 변경해 주는 방법이
# nvidia-smi -q | grep Compute
Compute Mode : Default
Compute Mode : Default
Compute Mode : Default
# nvidia-smi –compute-mode=1
만일 특정 gpu 에 작업이 할당 되는 것을 금지하기 위해서는 ..
# nvidia-smi –compute-mode=2
– 주요 GPU Device Operation 관련 옵션
-pm, –persistence-mode= Set persistence mode: 0/DISABLED, 1/ENABLED
-e, –ecc-config= Toggle ECC support: 0/DISABLED, 1/ENABLED
-p, –reset-ecc-errors= Reset ECC error counts: 0/VOLATILE, 1/AGGREGATE
-c, –compute-mode= Set MODE for compute applications:
0/DEFAULT, 1/EXCLUSIVE_THREAD,
2/PROHIBITED, 3/EXCLUSIVE_PROCESS
–gom= Set GPU Operation Mode:
0/ALL_ON, 1/COMPUTE, 2/LOW_DP
-r –gpu-reset Trigger reset of the GPU.
Can be used to reset the GPU HW state in situations
that would otherwise require a machine reboot.
Typically useful if a double bit ECC error has
occurred.
Reset operations are not guarenteed to work in
all cases and should be used with caution.
–id= switch is mandatory for this switch
-ac –applications-clocks= Specifies clocks as a
pair (e.g. 2000,800) that defines GPU’s
speed in MHz while running applications on a GPU.
-rac –reset-applications-clocks
Resets the applications clocks to the default values.
-acp –applications-clocks-permission=
Toggles permission requirements for -ac and -rac commands:
0/UNRESTRICTED, 1/RESTRICTED
-pl –power-limit= Specifies maximum power management limit in watts.
-am –accounting-mode= Enable or disable Accounting Mode: 0/DISABLED, 1/ENABLED
-caa –clear-accounted-apps
Clears all the accounted PIDs in the buffer.
–auto-boost-default= Set the default auto boost policy to 0/DISABLED
or 1/ENABLED, enforcing the change only after the
last boost client has exited.
–auto-boost-permission=
Allow non-admin/root control over auto boost mode:
0/UNRESTRICTED, 1/RESTRICTED
3. MPI 환경 구축하기
– OpenMPI-Cuda 설치하기
# cd ~/TRG_PKG_2016/mpi/openmpi-intel
# tar xjvf openmpi-1.8.7.tar.bz2
# cd openmpi-1.8.7
export FC=ifort
export F77=ifort
export CC=icc
export CXX=icpc
export RSHCOMMAND=/usr/bin/ssh
# ./configure –prefix=/APP/enhpc/mpi/openmpi-intel-cuda –enable-mpi-cxx –enable-mpi-fortran –enable-shared –enable-mpi-thread-multiple –with-sge –with-cuda=/usr/local/cuda
# make -j 4 && make install
# vi /APP/enhpc/profile.d/openmpi-intel-cuda.sh
—————————————————————-
#!/bin/sh
MPI_HOME=/APP/enhpc/mpi/openmpi-intel-cuda
PATH=${MPI_HOME}/bin:$PATH
LD_LIBRARY_PATH=${MPI_HOME}/lib:$LD_LIBRARY_PATH
export MPI_HOME PATH LD_LIBRARY_PATH
—————————————————————-
# source vi /APP/enhpc/profile.d/openmpi-intel-cuda.sh
4. CUDA 기본 테스트 방법
# source /etc/profile.d/cuda-env.sh
# cd /usr/local/cuda/samples/
# make 2> error.log
error.log 파일을 살펴본다. 별다른 error 가 없이 컴파일이 완료되면,
CUDA 컴파일 환경은 정상이라 판단하면 된다.
CUDA 샘플이 컴파일된 경로로 이동한다.
# cd /usr/local/cuda/samples/bin/x86_64/linux/release
// 기본적인 활용법 설명은 cuda-install-2012 편 문서 참고 ..
– GPU 인식 테스트
# ./deviceQuery
– GPU 와 GPU 간 PCI-E 대역폭 테스트
# ./bandwidthTest –memory=pinned –device=<N>
GPU 와 PCI-E 의 bandwidth 측정을 위한 또 다른 프로그램이 있다.
# cd ~/TRG_PKG_2016/cuda
# icc -o concBandwidthTest -std=c99 -I /usr/local/cuda/include concBandwidthTest.c -L /usr/local/cuda/lib -lcuda -lpthread
# ./concBandwidthTest 0 1 2 3
– nbody를 이용한 flops(부동소수점 연산 성능) 테스트
# ./nbody -benchmark -numbodies=131072 -device=<dev_id>
장착된 모든 GPU device 를 이용하여 계산
# ./nbody -benchmark -numbodies=131072 -numdevices=<N>
– CUDA-MPI 기본 테스트
# source vi /APP/enhpc/profile.d/openmpi-intel-cuda.sh
# cd /usr/local/cuda/samples/bin/linux/release
# mpirun -np 4 -machinefile ./gpuhosts ./simpleMPI
——————————————————————————-
Running on 4 nodes
Average of square roots is: 0.667305
PASSED
——————————————————————————-
4. BMT 성능 측정
나머지 기술 부분은 “2012년 Teragon GPGPU” 문서 참고 ..
테스트에 사용한 K40, GRID K2 의 기본 정보와 성능 부분 측정 결과
– 모니터링 명령어 설치
# cc gpu_sensor.c -I/usr/local/cuda/nvml/include -lnvidia-ml -DSTANDALONE -o gpu_sensor
# ./gpu_sensor
begin
wpsvr01:gpu.0.name:Tesla K40c
wpsvr01:gpu.0.busId:0000:03:00.0
wpsvr01:gpu.0.fanspeed:23
wpsvr01:gpu.0.clockspeed:745
wpsvr01:gpu.0.memfree:12859617280
wpsvr01:gpu.0.memused:25088000
wpsvr01:gpu.0.memtotal:12884705280
wpsvr01:gpu.0.utilgpu:0
wpsvr01:gpu.0.utilmem:0
wpsvr01:gpu.0.temperature:33
wpsvr01:gpu.1.name:GRID K2
wpsvr01:gpu.1.busId:0000:84:00.0
wpsvr01:gpu.1.fanspeed:0
wpsvr01:gpu.1.clockspeed:745
wpsvr01:gpu.1.memfree:4283052032
wpsvr01:gpu.1.memused:11718656
wpsvr01:gpu.1.memtotal:4294770688
wpsvr01:gpu.1.utilgpu:0
wpsvr01:gpu.1.utilmem:0
wpsvr01:gpu.1.temperature:37
wpsvr01:gpu.2.name:GRID K2
wpsvr01:gpu.2.busId:0000:85:00.0
wpsvr01:gpu.2.fanspeed:0
wpsvr01:gpu.2.clockspeed:745
wpsvr01:gpu.2.memfree:4283052032
wpsvr01:gpu.2.memused:11718656
wpsvr01:gpu.2.memtotal:4294770688
wpsvr01:gpu.2.utilgpu:7
wpsvr01:gpu.2.utilmem:0
wpsvr01:gpu.2.temperature:33
end
– NVIDIA K40 기본 성능
# ./deviceQuery
CUDA core : 2880
GPU Memory : 12GB
GPU Clock : 745 MHz (0.75 GHz)
memory bandwidth (ECC off) : 288GB/s
Rpeak (double) : 1.4 TFlops
Rpeak (single) : 4.3 TFlops
# ./bandwidthTest –memory=pinned –device=0
[CUDA Bandwidth Test] – Starting…
Running on…
Device 0: Tesla K40c
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 10298.1
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 10301.4
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 187232.0
# /data3/TRG_PKG_2016/cuda/concBandwidthTest 0
Device 0 took 635.972717 ms
Average HtoD bandwidth in MB/s: 10063.325195
Device 0 took 636.052002 ms
Average DtoH bandwidth in MB/s: 10062.070312
# ./nbody -benchmark -numbodies=131072 -device=0
gpuDeviceInit() CUDA Device [0]: “Tesla K40c
number of bodies = 131072
131072 bodies, total time for 10 iterations: 1971.515 ms
= 87.140 billion interactions per second
= 1742.809 single-precision GFLOP/s at 20 flops per interaction
– NVIDIA GRID K2 기본 성능
# ./deviceQuery
CUDA core : 1536
GPU Clock : 745 MHz (0.75 GHz)
GPU Memory : 4GB
memory bandwidth (ECC off) :
Rpeak (double) :
Rpeak (single) :
# ./bandwidthTest –memory=pinned –device=1
[CUDA Bandwidth Test] – Starting…
Running on…
Device 1: GRID K2
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3820.2
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 9191.3
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 120137.0
# /data3/TRG_PKG_2016/cuda/concBandwidthTest 1
Device 1 took 1526.583496 ms
Average HtoD bandwidth in MB/s: 4192.368164
Device 1 took 751.509583 ms
Average DtoH bandwidth in MB/s: 8516.192383
# ./nbody -benchmark -numbodies=131072 -device=1
gpuDeviceInit() CUDA Device [1]: “GRID K2
number of bodies = 131072
131072 bodies, total time for 10 iterations: 3464.428 ms
= 49.589 billion interactions per second
= 991.787 single-precision GFLOP/s at 20 flops per interaction
– HPL 테스트 결과
// mpi 는 가급적 mpich2 로 진행. openmpi 로 수행할때 openmp 기능이
// 잘 동작안하는듯 보임..
# source /etc/profile.d/mpich2-intel-hd.sh
# tar xzvf hpl-2.0_FERMI_v15.gz
# cd hpl-2.0_FERMI_v15
# vi Make.CUDA
TOPdir = /APP/enhpc/hpl-cuda
.
LAdir = /APP/enhpc/compiler/intel/v15/mkl/lib/intel64
.
LAlib = -L $(TOPdir)/src/cuda -ldgemm -L/usr/local/cuda/lib64 -lcuda -lcudart -lcublas -L$(LAdir) -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5
.
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc) -I/usr/local/cuda/include
.
# make arch=CUDA
# vi run_linpack
———————————————————————————
#!/bin/bash
#location of HPL
HPL_DIR=/APP/enhpc/hpl-cuda
# Number of CPU cores ( per GPU used = per MPI process )
CPU_CORES_PER_GPU=1
# FOR MKL
export MKL_NUM_THREADS=$CPU_CORES_PER_GPU
# FOR GOTO
export GOTO_NUM_THREADS=$CPU_CORES_PER_GPU
# FOR OMP
export OMP_NUM_THREADS=$CPU_CORES_PER_GPU
export MKL_DYNAMIC=FALSE
# hint: for 2050 or 2070 card
# try 350/(350 + MKL_NUM_THREADS*4*cpu frequency in GHz)
export CUDA_DGEMM_SPLIT=0.90
# hint: try CUDA_DGEMM_SPLIT – 0.10
export CUDA_DTRSM_SPLIT=0.80
export LD_LIBRARY_PATH=$HPL_DIR/src/cuda:$LD_LIBRARY_PATH
#$HPL_DIR/bin/CUDA/xhpl
mpirun -np 1 -machinefile ./mpihosts $HPL_DIR/bin/CUDA/xhpl
———————————————————————————-
// 아래 코드는 GPU 메모리 상에 작업 등록은 되는데..GPU Utilize 는 0 임.
// flops 도 GPU 가속이 되어 보이지는 않음. 참고만 하삼..
# git clone git://github.com/avidday/hpl-cuda.git
# cd hpl-cuda
# vi Make.CUDA
——————————————————————-
TOPdir = /APP/enhpc/hpl-cuda
.
.
### 아래내용 추가
MPICH2_INSTALL_PATH=/APP/enhpc/mpi/mpich2-intel-hd
BLAS_INSTALL_PATH=/APP/enhpc/compiler/intel/v15/mkl/lib/intel64
MPICH2_INCLUDES=-I/APP/enhpc/mpi/mpich2-intel-hd/include
MPICH2_LIBRARIES=-L/APP/enhpc/mpi/mpich2-intel-hd/lib -lmpich -lmpl
BLAS_INCLUDES=-I/APP/enhpc/compiler/intel/v15/mkl/include
BLAS_LIBRARIES=-L/APP/enhpc/compiler/intel/v15/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5
CUBLAS_INCLUDES=-I/usr/local/cuda/include
CUBLAS_LIBRARIES=-L/usr/local/cuda/lib64 -lcublas -lcudart -lcuda
######################################################################
.
.
CC = /APP/enhpc/mpi/mpich2-intel-hd/bin/mpicc
CXX = /APP/enhpc/mpi/mpich2-intel-hd/bin/mpic++
LINKER = /APP/enhpc/mpi/mpich2-intel-hd/bin/mpicc
# make arch=CUDA
CPU 계산 시..
Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz * 20 core
(Rpeak : avx2 clock 2.2GHz = 704Gflops, Rmax : 620Gflops)
CCFLAGS : -O3 -march=core-avx-i -mtune=core-avx-i -mavx2
Column=005376 Fraction=0.065 Mflops=611659.01
NVIDIA K40 GPU 이용 시
N : 115200
NB : 1024
PMAP : Row-major process mapping
P : 1
Q : 1
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Left
BCAST : 1ring
DEPTH : 1
SWAP : Spread-roll (long)
L1 : no-transposed form
U : no-transposed form
EQUIL : yes
ALIGN : 8 double precision words
1core 1gpu 시 ..최대 953 GFlops 측정, 이론치는 1.4 TFlops
================================================================================
T/V N NB P Q Time Gflops
——————————————————————————–
WR10L2L2 115200 1024 1 1 1069.11 9.534e+02
——————————————————————————–
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0026113 …… PASSED
================================================================================
4core 1gpu 시 최대 1.180 TFlops 측정 (cpu flops 140GFlops)
1180 GFlops – 140 Gflops(2core) = 1050 Gflops (GPU Flops) (Rpeak 대비 75%)
================================================================================
T/V N NB P Q Time Gflops
——————————————————————————–
WR10L2L2 115200 1024 1 1 863.87 1.180e+03
8core ( 1333 – 281 = 1052 GFlops )
================================================================================
T/V N NB P Q Time Gflops
——————————————————————————–
WR10L2L4 115200 1152 1 1 764.66 1.333e+03
——————————————————————————–
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0027419 …… PASSED
================================================================================
10core (1407 – 352 = 1055 GFlops ):
================================================================================
T/V N NB P Q Time Gflops
——————————————————————————–
WR10L2L4 115200 1152 1 1 724.59 1.407e+03
——————————————————————————–
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0028644 …… PASSED
================================================================================
5. CAE S/W GPU 이용법
– ansys mechanical
ansys150 -acc nvidia -na <N> ; -na 옵션은 default 1
ansys gpu 라이선스 사용 방식은 기존의 core 수와 gpu 수가 같은 기준으로
산정됨.
즉 기존에 8core 를 사용할 수 있는 라이선스가 있었다면, 6core + 2gpu
씩으로 이용이 가능함.
;; ansys 라이선스 책정 방식
ansys job 라이선스 1개, 작업 1개와 2core 병렬 라이선스 포함
anshpc_pack 라이선스 1개 (2core x 4 = 8core)
anshpc_pack 라이선스 2개 (2core x 4 x 4 = 32 core)
anshpc_pack 라이선스 3개 (2core x 4 x 4 x 4 = 128core)
– fluent
fluent 3d -t8 -gpgpu=<N>
– abaqus
abaqus job=<> cpus=<> gpus=<gpu_num> int
abaqus 의 경우 explicit model 은 gpgpu 지원이 안됨.
standard 버전만 지원 가능.
gpgpu를 사용하기 위해서는 gpgpu feature 명의 라이선스 필요
gpgpu 라이선스는 token 방식이 아니라 gpu_num 수 기준으로 라이선스 필요
즉, gpu 1 개를 사용할 경우, gpgpu 라이선스 1개 필요
gpu 자원 이용 특성은
메모리는 GPU가 보유한 Full 메모리를 사용함. GPU utilze 는 순간순간 3~30%
내외로 사용됨
s4b.inp 해석
1core, 1gpu = 23m 05s
1core, 0gpu = 57m 33s
1core, 1gpu x 2job = 23m 52s, 55m 57s
1core, 0gpu x 2job = 58 m21s, 57m 21s
;; 1gpu 에 2job 을 동시 제출은 가능하지만, 첫번째 gpu 에 할당되는 작업에
해당 gpu 의 memory를 95%이상 allocation 함으로 2번째 job은 남은 5%정도를
할당 받음. 결국 2번째 작업은 계산 도중 ..아래와 같은 메세지가 나타나며
Supernode 27182 is too large for the GPU: 232 blocks required,
74 blocks available clique 408 front 1245 0
실제 GPU 가속 성능을 보장 받지 못함.
2core, 1gpu = 14m 44s
2core, 0gpu = 31m 27s
4core, 1gpu = 10m 54s
4core, 0gpu = 19m 16s
8core, 1gpu = 8m 46s
8core, 0gpu = 12m 08s
16core, 1gpu = 7m 58s
16core, 0gpu = 8m 52s
20core, 1gpu = 8m 2s
20core, 0gpu = 8m 25s
150K_BS_V31_F_sag2-fine.inp 2core 해석
//동시 작업 수행으로 변수 존재 ..
//2core, 1gpu = 8h 3m 58s
//2core, 0gpu = 12h 45m 28s
1core, 1gpu = 11h 29m 2s
1core, 0gpu = 25h 01m 39s
2core, 1gpu = 8h 4m 17s
2core, 0gpu = 12h 42m 24s
4core, 1gpu = 3h 33m 5s
4core, 0gpu = 7h 12m 03
8core, 1gpu = 3h 47m 22s
8core, 0gpu = 4h 27m 19
16core, 1gpu = 3h 00m 56s
16core, 0gpu = 3h 11m 03s
20core, 1gpu = 2h 33m 24s
20core, 0gpu = 2h 45m 28s
– lammps
– wrf
– amber
– namd
– CHARMM