Infiniband(ibgd) 로 HPC Cluster 환경 구축 하기
Infiniband 로 HPC Cluster 환경 구축 하기
작성자 : 서진우
1. 기본 Linux HPC Cluster 환경 구축 하기
– Network 환경 구축 ( ifcfg-eth0 ..,hostname ..)
– rsh,ssh 환경 구축
– intel compiler 설치 ( icc, ifc .. )
2. IBGD install ( Infiniband Gold Distribution ) 하기
IBGD 는 Infiniband 개발 환경 자동 구축 스크립터 툴이다. 기본적인 환경 설정과
시스템 최적화 프로그램 RPM Build 및 install 을 순차적으로 실행해 주는 프로그램이다.
여기에서 infiniband 에서 사용하는 HCA card module 및 infiniband porotocol 을 지원하는
mpich 그리고 Benchmark 툴등을 제공한다.
기본적인 개발 환경이 구축된 상태에서 glib-devel 이 설치 되어 있어야 한다.
IBGD-1.8.0.gz source 를 풀고 install.sh 를 실행한다.
[root@noco01 infini]# tar xzvf IBGD-1.8.0.gz
[root@noco01 infini]# cd IBGD-1.8.0
[root@noco01 IBGD-1.8.0]# ./install.sh
InfiniBand Gold Distribution (IBGD) Software Installation Menu
1) View IBGD Installation Guide
2) Install IBGD Software
3) Show Installed Software
4) Configure IPoIB Network Interface, IBADM Server, and OpenSM Server
5) Uninstall IBGD Software
6) Build IBGD Software RPMs
Q) Exit
Select Option [1-6]:
-> 2번
Select IBGD Software
1) Typical (ib-verbs, ib-ipoib, opensm, ibadm and mpi)
2) Minimal (ib-verbs only)
3) All packages (ib-verbs, ib-ipoib, ib-cm, ib-sdp, ib-dapl, ib-srp, opensm, ibadm, mpi, pdsh)
4) Customize
Q) Exit
ib-ipoib -> device drive
ib-verbs ->
opensm -> system monitoring app
ibadm -> admin command
mpi -> infiband mpi
Select Option [1-4]:
-> 1번
The following compiler(s) on your system can be used to build/install MPI: gcc intel
Next you will be prompted to choose the compiler(s) with which to build/install the MPI RPM(s)
Do you wish to create/install an MPI RPM with gcc? [Y/n]:
Do you wish to create/install an MPI RPM with intel? [Y/n]:
Next you will be prompted to enter the number of CPUs in your cluster.
small: 1 – 63 CPUs
medium: 64 – 255 CPUs
big: 256+ CPUs
Please select the size of your cluster [small/medium/big]:
-> small
Following is the list of IBGD packages that you have chosen
(some may have been added by the installation program due to package dependencies):
ib-ipoib
ib-verbs
opensm
mpi_osu
ibadm
Preparing to build the IBGD RPMs:
RPM build process uses a temporary directory.
Please enter the temporary directory [/var/tmp/IBGD]:
Please enter IBGD installation directory [/usr/local/ibgd]:
The following compiler(s) will be used to build the MPI RPM(s): intel gcc
Checking dependencies. Please wait …
Building InfiniBand Software RPMs. Please wait…
Building ib RPMs. Please wait…
Running /tmp/ib-1.8.0/build_rpm.sh –prefix /usr/local/ibgd –build_root /var/tmp/IBGD \\
–packages ib_ipoib ib_verbs — -kver 2.6.9-11.EL.rootsmp –ksrc /lib/modules/2.6.9-11.EL.rootsmp/source
.
.
시스템에 최적화된 RPM 을 build 한다…build 가 완성되면 ..
Configuring IPoIB:
The default IPoIB interface configuration is based on a LAN interface configuration.
You may change this default configuration in the following steps.
Enter LAN interface to be used for setting ib0 interface [eth0]:ib0
Configuring IPoIB:
The default IPoIB interface configuration is based on a LAN interface configuration.
You may change this default configuration in the following steps.
Enter LAN interface to be used for setting ib0 interface [eth0]:ib1
ib0 configuration:
Current IPOIB configuration for ib0
DEVICE=ib0
BOOTPROTO=static
IPADDR=193.168.123.111
NETMASK=255.255.255.0
NETWORK=193.168.123.0
BROADCAST=193.168.123.255
ONBOOT=yes
Do you want to change this configuration? [y/N]: n
IPOIB interface configured successfully
Configuring OpenSM:
Enter OpenSM Server IP Address [192.168.123.111]:
Configuring IBADM:
Please provide IBADM (Infiniband Administration Package) configuration:
Enter IBADM Name Server and In-Band Server IP Address (one per IB subnet) [192.168.123.111]:
Enter In-Band Server Hostname [H-1]: -> node01
Enter firmware work directory [/tmp]:
Creating FW directory to be used by IBADM server
Running tar xzvf /usr/local/src/infini/IBGD-1.8.0/SOURCES/ibgd1.8.0.fwrel.tgz
/usr/local/ibgd/FW directory updated with new FW release
Do you want to install IBGranite Cluster Verification Suite (ibgfvs-1.0.0 – beta version) [y/N]?
InfiniBand Gold Distribution (IBGD) Software Installation Menu
1) View IBGD Installation Guide
2) Install IBGD Software
3) Show Installed Software
4) Configure IPoIB Network Interface, IBADM Server, and OpenSM Server
5) Uninstall IBGD Software
6) Build IBGD Software RPMs
Q) Exit
Select Option [1-6]:
일단 이 상태에서 설치는 완료된다.
openibd 를 실행하면 자동으로 모듈을 올리고 자동으로 네트워크를 잡는다 하지만..버그 투성
수동으로 네트워크를 잡아주는 것이 좋다
일단 설치가 완료된 상태에서 ibadm 설정을 한다.
[root@noco01 ~]# vi /etc/ibadm.hosts
——————————————————————————————
192.168.123.111
192.168.123.112
.
.
infiniband 로 구성된 모든 노드 ..
그런후 첫 번째 노드에서 다른 모든 노드로 아래 파일을 복사한다.
[root@noco01 ~]# dua2 /etc/ibadm.hosts
[root@noco01 ~]# dua2 /etc/ibadm.conf
[root@noco01 ~]# dua2 /etc/ibfw
그런 후 아래 데몬을 순서대로 실행 한다.
[root@noco01 ~]# dush2 /etc/rc.d/init.d/ibadmd stop
[root@noco01 ~]# dush2 /etc/rc.d/init.d/opensmd stop
[root@noco01 ~]# dush2 /etc/rc.d/init.d/openibd restart
3. 기본 네트워크 성능 테스트 ( perf_main, Netpipe )
– perf_main Test :
noco01 에서 다음 실행 ..
[root@noco01 ~]# perf_main –send -trc -mbw -s 128000 -n 1000
********************************************
********* perf_main version 10.3 *********
********* CPU is: 2993.00 Mcps *********
********* Architecture X86 *********
********************************************
noco02 에서 다음 실행
[root@noco02 ~]# perf_main -a 192.168.123.111 (noco01 ip)
********************************************
********* perf_main version 10.3 *********
********* CPU is: 2993.00 Mcps *********
********* Architecture X86 *********
********************************************
그럼 noco01 노드의 콘솔에 아래와 같은 테스트 결과 수치가 나타난다.
************* RC BW Unidirection Test started for port 1 *********************
BW: 935.6 MBytes/sec [size: 128000 bytes, iter: 1000, total 128000000]
************* RC BW Unidirection Test Finished for port 1 *********************
즉 초당 935MB/sec 의 네트워크 대역폭을 지원함을 나타낸다. Gigabit Ethernet의 경우
초당 100MB/sec 의 네트워크 대역폭을 지원하고 있다.
– Netpipe 로 NPmpi 와 NPtcp, NPib 성능 측정
먼저 Netpipe 를 컴파일 한다.
[root@noco01 infini]# tar xzvf NetPIPE_3.6.2.tar.tar
[root@noco01 infini]# cd NetPIPE_3.6.2
makefile 을 현재 mpich 환경에 맞게 수정한다.
[root@noco01 NetPIPE_3.6.2]# vi makefile
MPICC = /usr/local/ibgd/mpi/osu/gcc/mvapich-0.9.5/bin/mpicc
MTHOME = /usr/local/ibgd/driver/infinihost
그런 후 아래와 같이 make 실행을 한다.
[root@noco01 NetPIPE_3.6.2]# make mpi
[root@noco01 NetPIPE_3.6.2]# make tcp
[root@noco01 NetPIPE_3.6.2]# make ib
컴파일된 실행 파일을 모든 노드에 동기화 한다.
[root@noco01 NetPIPE_3.6.2]# dua2 *
이제 Npmpi 를 이용하여 mpi 통신에서 사용되는 네트워크 최대 네트워크 대역폭을 측정한다.
*** NPmpi ( infiniband 드라이브가 포함된 MPICH 로 mpi 통신 대역폭 측정 )
[root@noco01 NetPIPE_3.6.2]# /usr/local/ibgd/mpi/osu/gcc/mvapich-0.9.5/bin/mpirun_rsh -rsh -np 2 node01 node02 ./NPmpi
———————————————————————————————————————–
0: noco01
1: noco02
Now starting the main loop
0: 1 bytes 20491 times –> 1.80 Mbps in 4.24 usec
1: 2 bytes 23590 times –> 3.61 Mbps in 4.23 usec
2: 3 bytes 23655 times –> 5.41 Mbps in 4.23 usec
.
.
108: 1572867 bytes 40 times –> 7314.13 Mbps in 1640.66 usec
109: 2097149 bytes 20 times –> 7335.83 Mbps in 2181.07 usec
110: 2097152 bytes 22 times –> 7334.86 Mbps in 2181.36 usec
111: 2097155 bytes 22 times –> 7331.74 Mbps in 2182.30 usec
112: 3145725 bytes 22 times –> 7354.99 Mbps in 3263.09 usec
113: 3145728 bytes 20 times –> 7355.64 Mbps in 3262.80 usec
114: 3145731 bytes 20 times –> 7351.55 Mbps in 3264.62 usec
115: 4194301 bytes 10 times –> 7365.29 Mbps in 4344.70 usec
116: 4194304 bytes 11 times –> 7366.41 Mbps in 4344.04 usec
117: 4194307 bytes 11 times –> 7372.73 Mbps in 4340.32 usec
118: 6291453 bytes 11 times –> 7387.76 Mbps in 6497.23 usec
119: 6291456 bytes 10 times –> 7388.09 Mbps in 6496.95 usec
120: 6291459 bytes 10 times –> 7385.65 Mbps in 6499.09 usec
121: 8388605 bytes 5 times –> 7393.03 Mbps in 8656.80 usec
122: 8388608 bytes 5 times –> 7393.03 Mbps in 8656.80 usec
123: 8388611 bytes 5 times –> 7391.16 Mbps in 8659.00 usec
————————————————————————————————————————–
아래와 같이 최대 700MB/sec 정도(Gigabit 의 7배)의 대역폭이 측정되었다.
*** NPtcp ( 일반 TCP 네트워크 대역폭 )
Node02 에서 아래 실행
[root@noco02 NetPIPE_3.6.2]# ./NPtcp
Node01 에서 아래 실행
[root@noco01 NetPIPE_3.6.2]# ./NPtcp -h node02
—————————————————————————–
.
.
109: 2097149 bytes 3 times –> 1055.77 Mbps in 15154.83 usec
110: 2097152 bytes 3 times –> 1060.92 Mbps in 15081.18 usec
111: 2097155 bytes 3 times –> 1056.31 Mbps in 15147.15 usec
112: 3145725 bytes 3 times –> 1059.30 Mbps in 22656.51 usec
113: 3145728 bytes 3 times –> 1061.99 Mbps in 22599.18 usec
114: 3145731 bytes 3 times –> 1057.49 Mbps in 22695.18 usec
115: 4194301 bytes 3 times –> 1058.45 Mbps in 30232.83 usec
116: 4194304 bytes 3 times –> 1056.31 Mbps in 30294.00 usec
117: 4194307 bytes 3 times –> 1061.43 Mbps in 30148.17 usec
118: 6291453 bytes 3 times –> 1063.24 Mbps in 45144.85 usec
119: 6291456 bytes 3 times –> 1061.07 Mbps in 45237.15 usec
120: 6291459 bytes 3 times –> 1060.43 Mbps in 45264.83 usec
121: 8388605 bytes 3 times –> 1060.78 Mbps in 60332.68 usec
122: 8388608 bytes 3 times –> 1061.87 Mbps in 60271.00 usec
123: 8388611 bytes 3 times –> 1062.30 Mbps in 60246.67 usec
——————————————————————————-
일반적인 Gigabit 정도의 수준으로 나타난다.
*** NPib ( infiniband 전용 대역폭 )
[root@noco01 NetPIPE_3.6.2]# ./NPib -h node02
——————————————————————————-
.
.
108: 1572867 bytes 40 times –> 7299.72 Mbps in 1643.90 usec
109: 2097149 bytes 20 times –> 7327.83 Mbps in 2183.45 usec
110: 2097152 bytes 22 times –> 7326.93 Mbps in 2183.73 usec
111: 2097155 bytes 22 times –> 7329.99 Mbps in 2182.82 usec
112: 3145725 bytes 22 times –> 7354.01 Mbps in 3263.52 usec
113: 3145728 bytes 20 times –> 7354.85 Mbps in 3263.15 usec
114: 3145731 bytes 20 times –> 7353.85 Mbps in 3263.60 usec
115: 4194301 bytes 10 times –> 7367.33 Mbps in 4343.50 usec
116: 4194304 bytes 11 times –> 7368.11 Mbps in 4343.04 usec
117: 4194307 bytes 11 times –> 7368.11 Mbps in 4343.05 usec
118: 6291453 bytes 11 times –> 7382.49 Mbps in 6501.87 usec
119: 6291456 bytes 10 times –> 7382.23 Mbps in 6502.10 usec
120: 6291459 bytes 10 times –> 7382.06 Mbps in 6502.25 usec
121: 8388605 bytes 5 times –> 7388.34 Mbps in 8662.30 usec
122: 8388608 bytes 5 times –> 7388.76 Mbps in 8661.81 usec
123: 8388611 bytes 5 times –> 7388.18 Mbps in 8662.49 usec
——————————————————————————–
700MB/sec 정도의 속도로 MPI 와 비슷한 Gigabit 의 7배정도의 성능이 나온다.
4. HPL Linpak 성능 테스트
위 자동 인스톨 툴로 일괄 설치를 하면 infiniband + intel compiler 가 설치된 MPICH
가 설치가 된다. 하지만 정확하게 연동이 되지 않으므로 수동으로 다시 설치를 한다.
[root@noco01 src]# tar xzvf mvapich-0.9.5.tar.gz
[root@noco01 src]# cd mvapich-0.9.5
[root@noco01 mvapich-0.9.5]# ./configure –prefix=/usr/local/mvapich-intel -cc=/opt/intel/cc/9.0/bin/icc -c++=/opt/intel/cc/9.0/bin/icc -fc=/opt/intel/fc/9.0/bin/ifort -f90=/opt/intel/fc/9.0/bin/ifort -f90linker=/opt/intel/fc/9.0/bin/ifort –with-arch=LINUX –enable-f77 –enable-f90modules –with-device=vapi –disable-weak-symbols
[root@noco01 mvapich-0.9.5]# make && make install
만일 일괄 설치 패키지에서 설치된 mvapich 를 이용할 경우는 아래와 같이 하면 된다.
# /usr/local/ibgd/mpi/osu/intel/mvapich-0.9.5/bin/mpirun_rsh -rsh -np 4 node01 node01 node02 node02 ./xhpl
그런 후 HPL 테스트 진행
——————————————————————————————————————-
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 20000
NB : 104
PMAP : Row-major process mapping
P : 1
Q : 4
PFACT : Crout
NBMIN : 4
NDIV : 2
RFACT : Right
BCAST : 1ring
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
—————————————————————————-
– The matrix A is randomly generated for each test.
– The following scaled residual checks will be computed:
1) ||Ax-b||_oo / ( eps * ||A||_1 * N )
2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
– The relative machine precision (eps) is taken to be 5.421011e-20
– Computational tests pass if scaled residuals are less than 16.0
============================================================================
T/V N NB P Q Time Gflops
—————————————————————————-
WR10R2C4 20000 104 1 4 736.74 7.240e+00
—————————————————————————-
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 44.6046954 …… FAILED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 43.1710652 …… FAILED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 8.2276427 …… PASSED
||Ax-b||_oo . . . . . . . . . . . . . . . . . = 0.000000
||A||_oo . . . . . . . . . . . . . . . . . . . = 5076.247098
||A||_1 . . . . . . . . . . . . . . . . . . . = 5074.832190
||x||_oo . . . . . . . . . . . . . . . . . . . = 5.419810
||x||_1 . . . . . . . . . . . . . . . . . . . = 20664.162508
============================================================================
Finished 1 tests with the following results:
0 tests completed and passed residual checks,
1 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
—————————————————————————-
End of Tests.
============================================================================
평균 CPU 사용량 : 99%
참고로 100Mbit,1000Mbit 네트워크 환경에서 테스트 한 결과 이다.
**** 100M 환경
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 20000
NB : 104
PMAP : Row-major process mapping
P : 1
Q : 4
PFACT : Crout
NBMIN : 4
NDIV : 2
RFACT : Right
BCAST : 1ring
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
—————————————————————————-
– The matrix A is randomly generated for each test.
– The following scaled residual checks will be computed:
1) ||Ax-b||_oo / ( eps * ||A||_1 * N )
2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
– The relative machine precision (eps) is taken to be 5.421011e-20
– Computational tests pass if scaled residuals are less than 16.0
============================================================================
T/V N NB P Q Time Gflops
—————————————————————————-
WR10R2C4 20000 104 1 4 952.82 5.598e+00
—————————————————————————-
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 35.8555115 …… FAILED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 34.7030870 …… FAILED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 6.6137956 …… PASSED
||Ax-b||_oo . . . . . . . . . . . . . . . . . = 0.000000
||A||_oo . . . . . . . . . . . . . . . . . . . = 5076.247098
||A||_1 . . . . . . . . . . . . . . . . . . . = 5074.832190
||x||_oo . . . . . . . . . . . . . . . . . . . = 5.419810
||x||_1 . . . . . . . . . . . . . . . . . . . = 20664.162508
============================================================================
Finished 1 tests with the following results:
0 tests completed and passed residual checks,
1 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
—————————————————————————-
End of Tests.
============================================================================
평균 CPU 사용량 : 80%