Infiniband(ibgd) 로 HPC Cluster 환경 구축 하기

Infiniband 로 HPC Cluster 환경 구축 하기

작성자 : 서진우

1. 기본 Linux HPC Cluster 환경 구축 하기

        – Network 환경 구축 ( ifcfg-eth0 ..,hostname ..)

        – rsh,ssh 환경 구축

        – intel compiler 설치 ( icc, ifc .. )

2. IBGD install ( Infiniband Gold Distribution ) 하기

IBGD 는 Infiniband 개발 환경 자동 구축 스크립터 툴이다. 기본적인 환경 설정과

시스템 최적화 프로그램 RPM Build 및 install 을 순차적으로 실행해 주는 프로그램이다.

여기에서 infiniband 에서 사용하는 HCA card module 및 infiniband porotocol 을 지원하는

mpich 그리고 Benchmark 툴등을 제공한다.

기본적인 개발 환경이 구축된 상태에서 glib-devel 이 설치 되어 있어야 한다.

IBGD-1.8.0.gz source 를 풀고 install.sh 를 실행한다.

[root@noco01 infini]# tar xzvf IBGD-1.8.0.gz

[root@noco01 infini]# cd IBGD-1.8.0

[root@noco01 IBGD-1.8.0]# ./install.sh

         InfiniBand Gold Distribution (IBGD) Software Installation Menu

          1) View IBGD Installation Guide

          2) Install IBGD Software

          3) Show Installed Software

          4) Configure IPoIB Network Interface, IBADM Server, and OpenSM Server

          5) Uninstall IBGD Software

          6) Build IBGD Software RPMs

          Q) Exit

Select Option [1-6]:

                           -> 2번

          Select IBGD Software

          1) Typical (ib-verbs, ib-ipoib, opensm, ibadm and mpi)

          2) Minimal (ib-verbs only)

          3) All packages (ib-verbs, ib-ipoib, ib-cm, ib-sdp, ib-dapl, ib-srp, opensm, ibadm, mpi, pdsh)

          4) Customize

          Q) Exit

                                ib-ipoib -> device drive

                                ib-verbs ->

                                opensm   -> system monitoring app

                            ibadm    -> admin command

                                mpi      -> infiband mpi

Select Option [1-4]:

                          -> 1번

The following compiler(s) on your system can be used to build/install MPI:  gcc intel

Next you will be prompted to choose the compiler(s) with which to build/install the MPI RPM(s)

Do you wish to create/install an MPI RPM with gcc? [Y/n]:

Do you wish to create/install an MPI RPM with intel? [Y/n]:

Next you will be prompted to enter the number of CPUs in your cluster.

        small:  1 – 63 CPUs

        medium: 64 – 255 CPUs

        big:    256+ CPUs

Please select the size of your cluster [small/medium/big]:

                                                               -> small

Following is the list of IBGD packages that you have chosen

            (some may have been added by the installation program due to package dependencies):

ib-ipoib

ib-verbs

opensm

mpi_osu

ibadm

Preparing to build the IBGD RPMs:

RPM build process uses a temporary directory.

Please enter the temporary directory [/var/tmp/IBGD]:

Please enter IBGD installation directory [/usr/local/ibgd]:

The following compiler(s) will be used to build the MPI RPM(s): intel gcc

Checking dependencies. Please wait …

Building InfiniBand Software RPMs. Please wait…

Building ib RPMs. Please wait…

Running /tmp/ib-1.8.0/build_rpm.sh –prefix /usr/local/ibgd –build_root /var/tmp/IBGD \\

–packages ib_ipoib ib_verbs — -kver 2.6.9-11.EL.rootsmp –ksrc /lib/modules/2.6.9-11.EL.rootsmp/source

.

.

시스템에 최적화된 RPM 을 build 한다…build 가 완성되면 ..

Configuring IPoIB:

The default IPoIB interface configuration is based on a LAN interface configuration.

You may change this default configuration in the following steps.

Enter LAN interface to be used for setting ib0 interface [eth0]:ib0

Configuring IPoIB:

The default IPoIB interface configuration is based on a LAN interface configuration.

You may change this default configuration in the following steps.

Enter LAN interface to be used for setting ib0 interface [eth0]:ib1

ib0 configuration:

  Current IPOIB configuration for ib0

DEVICE=ib0

BOOTPROTO=static

IPADDR=193.168.123.111

NETMASK=255.255.255.0

NETWORK=193.168.123.0

BROADCAST=193.168.123.255

ONBOOT=yes

Do you want to change this configuration? [y/N]: n

IPOIB interface configured successfully

Configuring OpenSM:

Enter OpenSM Server IP Address [192.168.123.111]:

Configuring IBADM:

Please provide IBADM (Infiniband Administration Package) configuration:

Enter IBADM Name Server and In-Band Server IP Address (one per IB subnet) [192.168.123.111]:

Enter In-Band Server Hostname [H-1]:       ->   node01

Enter firmware work directory [/tmp]:

Creating FW directory to be used by IBADM server

Running tar xzvf /usr/local/src/infini/IBGD-1.8.0/SOURCES/ibgd1.8.0.fwrel.tgz

/usr/local/ibgd/FW directory updated with new FW release

Do you want to install IBGranite Cluster Verification Suite (ibgfvs-1.0.0 – beta version) [y/N]?

         InfiniBand Gold Distribution (IBGD) Software Installation Menu

          1) View IBGD Installation Guide

          2) Install IBGD Software

          3) Show Installed Software

          4) Configure IPoIB Network Interface, IBADM Server, and OpenSM Server

          5) Uninstall IBGD Software

          6) Build IBGD Software RPMs

          Q) Exit

Select Option [1-6]:

일단 이 상태에서 설치는 완료된다.

openibd 를 실행하면 자동으로 모듈을 올리고 자동으로 네트워크를 잡는다 하지만..버그 투성

수동으로 네트워크를 잡아주는 것이 좋다

일단 설치가 완료된 상태에서 ibadm 설정을 한다.

[root@noco01 ~]# vi /etc/ibadm.hosts

——————————————————————————————

192.168.123.111

192.168.123.112

.

.

infiniband 로 구성된 모든 노드 ..

그런후 첫 번째 노드에서 다른 모든 노드로 아래 파일을 복사한다.

[root@noco01 ~]# dua2 /etc/ibadm.hosts

[root@noco01 ~]# dua2 /etc/ibadm.conf

[root@noco01 ~]# dua2 /etc/ibfw

그런 후 아래 데몬을 순서대로 실행 한다.

[root@noco01 ~]# dush2 /etc/rc.d/init.d/ibadmd stop

[root@noco01 ~]# dush2 /etc/rc.d/init.d/opensmd stop

[root@noco01 ~]# dush2 /etc/rc.d/init.d/openibd restart

3. 기본 네트워크 성능 테스트 ( perf_main, Netpipe )

– perf_main Test :

noco01 에서 다음 실행 ..

[root@noco01 ~]# perf_main –send -trc -mbw -s 128000 -n 1000

********************************************

*********  perf_main version 10.3  *********

*********  CPU is: 2993.00 Mcps    *********

*********  Architecture X86     *********

********************************************

noco02 에서 다음 실행

[root@noco02 ~]# perf_main -a 192.168.123.111 (noco01 ip)

********************************************

*********  perf_main version 10.3  *********

*********  CPU is: 2993.00 Mcps    *********

*********  Architecture X86     *********

********************************************

그럼 noco01 노드의 콘솔에 아래와 같은 테스트 결과 수치가 나타난다.

************* RC BW Unidirection Test started for port 1  *********************

BW: 935.6 MBytes/sec [size: 128000 bytes, iter: 1000, total 128000000]

************* RC BW Unidirection Test Finished for port 1 *********************

즉 초당 935MB/sec 의 네트워크 대역폭을 지원함을 나타낸다. Gigabit Ethernet의 경우

초당 100MB/sec 의 네트워크 대역폭을 지원하고 있다.

– Netpipe 로 NPmpi 와 NPtcp, NPib 성능 측정

먼저 Netpipe 를 컴파일 한다.

[root@noco01 infini]# tar xzvf NetPIPE_3.6.2.tar.tar

[root@noco01 infini]# cd NetPIPE_3.6.2

makefile 을 현재 mpich 환경에 맞게 수정한다.

[root@noco01 NetPIPE_3.6.2]# vi makefile

MPICC       = /usr/local/ibgd/mpi/osu/gcc/mvapich-0.9.5/bin/mpicc

MTHOME  = /usr/local/ibgd/driver/infinihost

그런 후 아래와 같이 make 실행을 한다.

[root@noco01 NetPIPE_3.6.2]# make mpi

[root@noco01 NetPIPE_3.6.2]# make tcp

[root@noco01 NetPIPE_3.6.2]# make ib

컴파일된 실행 파일을 모든 노드에 동기화 한다.

[root@noco01 NetPIPE_3.6.2]# dua2 *

이제 Npmpi 를 이용하여 mpi 통신에서 사용되는 네트워크 최대 네트워크 대역폭을 측정한다.

*** NPmpi ( infiniband 드라이브가 포함된 MPICH 로 mpi 통신 대역폭 측정 )

[root@noco01 NetPIPE_3.6.2]# /usr/local/ibgd/mpi/osu/gcc/mvapich-0.9.5/bin/mpirun_rsh -rsh -np 2 node01 node02 ./NPmpi

———————————————————————————————————————–

0: noco01

1: noco02

Now starting the main loop

  0:       1 bytes  20491 times –>      1.80 Mbps in       4.24 usec

  1:       2 bytes  23590 times –>      3.61 Mbps in       4.23 usec

  2:       3 bytes  23655 times –>      5.41 Mbps in       4.23 usec

.

.

108: 1572867 bytes     40 times –>   7314.13 Mbps in    1640.66 usec

109: 2097149 bytes     20 times –>   7335.83 Mbps in    2181.07 usec

110: 2097152 bytes     22 times –>   7334.86 Mbps in    2181.36 usec

111: 2097155 bytes     22 times –>   7331.74 Mbps in    2182.30 usec

112: 3145725 bytes     22 times –>   7354.99 Mbps in    3263.09 usec

113: 3145728 bytes     20 times –>   7355.64 Mbps in    3262.80 usec

114: 3145731 bytes     20 times –>   7351.55 Mbps in    3264.62 usec

115: 4194301 bytes     10 times –>   7365.29 Mbps in    4344.70 usec

116: 4194304 bytes     11 times –>   7366.41 Mbps in    4344.04 usec

117: 4194307 bytes     11 times –>   7372.73 Mbps in    4340.32 usec

118: 6291453 bytes     11 times –>   7387.76 Mbps in    6497.23 usec

119: 6291456 bytes     10 times –>   7388.09 Mbps in    6496.95 usec

120: 6291459 bytes     10 times –>   7385.65 Mbps in    6499.09 usec

121: 8388605 bytes      5 times –>   7393.03 Mbps in    8656.80 usec

122: 8388608 bytes      5 times –>   7393.03 Mbps in    8656.80 usec

123: 8388611 bytes      5 times –>   7391.16 Mbps in    8659.00 usec

————————————————————————————————————————–

아래와 같이 최대 700MB/sec 정도(Gigabit 의 7배)의 대역폭이 측정되었다.

*** NPtcp ( 일반 TCP 네트워크 대역폭 )

Node02 에서 아래 실행

[root@noco02 NetPIPE_3.6.2]# ./NPtcp

Node01 에서 아래 실행

[root@noco01 NetPIPE_3.6.2]# ./NPtcp -h node02

—————————————————————————–

.

.

109: 2097149 bytes      3 times –>   1055.77 Mbps in   15154.83 usec

110: 2097152 bytes      3 times –>   1060.92 Mbps in   15081.18 usec

111: 2097155 bytes      3 times –>   1056.31 Mbps in   15147.15 usec

112: 3145725 bytes      3 times –>   1059.30 Mbps in   22656.51 usec

113: 3145728 bytes      3 times –>   1061.99 Mbps in   22599.18 usec

114: 3145731 bytes      3 times –>   1057.49 Mbps in   22695.18 usec

115: 4194301 bytes      3 times –>   1058.45 Mbps in   30232.83 usec

116: 4194304 bytes      3 times –>   1056.31 Mbps in   30294.00 usec

117: 4194307 bytes      3 times –>   1061.43 Mbps in   30148.17 usec

118: 6291453 bytes      3 times –>   1063.24 Mbps in   45144.85 usec

119: 6291456 bytes      3 times –>   1061.07 Mbps in   45237.15 usec

120: 6291459 bytes      3 times –>   1060.43 Mbps in   45264.83 usec

121: 8388605 bytes      3 times –>   1060.78 Mbps in   60332.68 usec

122: 8388608 bytes      3 times –>   1061.87 Mbps in   60271.00 usec

123: 8388611 bytes      3 times –>   1062.30 Mbps in   60246.67 usec

——————————————————————————-

일반적인 Gigabit 정도의 수준으로 나타난다.

*** NPib ( infiniband 전용 대역폭 )

[root@noco01 NetPIPE_3.6.2]# ./NPib -h node02

——————————————————————————-

.

.

108: 1572867 bytes     40 times –>   7299.72 Mbps in    1643.90 usec

109: 2097149 bytes     20 times –>   7327.83 Mbps in    2183.45 usec

110: 2097152 bytes     22 times –>   7326.93 Mbps in    2183.73 usec

111: 2097155 bytes     22 times –>   7329.99 Mbps in    2182.82 usec

112: 3145725 bytes     22 times –>   7354.01 Mbps in    3263.52 usec

113: 3145728 bytes     20 times –>   7354.85 Mbps in    3263.15 usec

114: 3145731 bytes     20 times –>   7353.85 Mbps in    3263.60 usec

115: 4194301 bytes     10 times –>   7367.33 Mbps in    4343.50 usec

116: 4194304 bytes     11 times –>   7368.11 Mbps in    4343.04 usec

117: 4194307 bytes     11 times –>   7368.11 Mbps in    4343.05 usec

118: 6291453 bytes     11 times –>   7382.49 Mbps in    6501.87 usec

119: 6291456 bytes     10 times –>   7382.23 Mbps in    6502.10 usec

120: 6291459 bytes     10 times –>   7382.06 Mbps in    6502.25 usec

121: 8388605 bytes      5 times –>   7388.34 Mbps in    8662.30 usec

122: 8388608 bytes      5 times –>   7388.76 Mbps in    8661.81 usec

123: 8388611 bytes      5 times –>   7388.18 Mbps in    8662.49 usec

——————————————————————————–

700MB/sec 정도의 속도로 MPI 와 비슷한 Gigabit 의 7배정도의 성능이 나온다.

4. HPL Linpak 성능 테스트

위 자동 인스톨 툴로 일괄 설치를 하면 infiniband + intel compiler 가 설치된 MPICH

가 설치가 된다. 하지만 정확하게 연동이 되지 않으므로 수동으로 다시 설치를 한다.

[root@noco01 src]# tar xzvf mvapich-0.9.5.tar.gz

[root@noco01 src]# cd mvapich-0.9.5

[root@noco01 mvapich-0.9.5]# ./configure –prefix=/usr/local/mvapich-intel -cc=/opt/intel/cc/9.0/bin/icc -c++=/opt/intel/cc/9.0/bin/icc -fc=/opt/intel/fc/9.0/bin/ifort -f90=/opt/intel/fc/9.0/bin/ifort -f90linker=/opt/intel/fc/9.0/bin/ifort –with-arch=LINUX –enable-f77 –enable-f90modules –with-device=vapi –disable-weak-symbols

[root@noco01 mvapich-0.9.5]# make && make install

만일 일괄 설치 패키지에서 설치된 mvapich 를 이용할 경우는 아래와 같이 하면 된다.

# /usr/local/ibgd/mpi/osu/intel/mvapich-0.9.5/bin/mpirun_rsh -rsh -np 4 node01 node01 node02 node02 ./xhpl

그런 후 HPL 테스트 진행

——————————————————————————————————————-

Time   : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   20000

NB     :     104

PMAP   : Row-major process mapping

P      :       1

Q      :       4

PFACT  :   Crout

NBMIN  :       4

NDIV   :       2

RFACT  :   Right

BCAST  :   1ring

DEPTH  :       1

SWAP   : Mix (threshold = 64)

L1     : transposed form

U      : transposed form

EQUIL  : yes

ALIGN  : 8 double precision words

—————————————————————————-

– The matrix A is randomly generated for each test.

– The following scaled residual checks will be computed:

   1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )

   2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )

   3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )

– The relative machine precision (eps) is taken to be          5.421011e-20

– Computational tests pass if scaled residuals are less than           16.0

============================================================================

T/V                N    NB     P     Q               Time             Gflops

—————————————————————————-

WR10R2C4       20000   104     1     4             736.74          7.240e+00

—————————————————————————-

||Ax-b||_oo / ( eps * ||A||_1  * N        ) =       44.6046954 …… FAILED

||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =       43.1710652 …… FAILED

||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        8.2276427 …… PASSED

||Ax-b||_oo  . . . . . . . . . . . . . . . . . =           0.000000

||A||_oo . . . . . . . . . . . . . . . . . . . =        5076.247098

||A||_1  . . . . . . . . . . . . . . . . . . . =        5074.832190

||x||_oo . . . . . . . . . . . . . . . . . . . =           5.419810

||x||_1  . . . . . . . . . . . . . . . . . . . =       20664.162508

============================================================================

Finished      1 tests with the following results:

              0 tests completed and passed residual checks,

              1 tests completed and failed residual checks,

              0 tests skipped because of illegal input values.

—————————————————————————-

End of Tests.

============================================================================

평균 CPU 사용량 : 99%

참고로 100Mbit,1000Mbit  네트워크 환경에서 테스트 한 결과 이다.

**** 100M 환경

Time   : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   20000

NB     :     104

PMAP   : Row-major process mapping

P      :       1

Q      :       4

PFACT  :   Crout

NBMIN  :       4

NDIV   :       2

RFACT  :   Right

BCAST  :   1ring

DEPTH  :       1

SWAP   : Mix (threshold = 64)

L1     : transposed form

U      : transposed form

EQUIL  : yes

ALIGN  : 8 double precision words

—————————————————————————-

– The matrix A is randomly generated for each test.

– The following scaled residual checks will be computed:

   1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )

   2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )

   3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )

– The relative machine precision (eps) is taken to be          5.421011e-20

– Computational tests pass if scaled residuals are less than           16.0

============================================================================

T/V                N    NB     P     Q               Time             Gflops

—————————————————————————-

WR10R2C4       20000   104     1     4             952.82          5.598e+00

—————————————————————————-

||Ax-b||_oo / ( eps * ||A||_1  * N        ) =       35.8555115 …… FAILED

||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =       34.7030870 …… FAILED

||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        6.6137956 …… PASSED

||Ax-b||_oo  . . . . . . . . . . . . . . . . . =           0.000000

||A||_oo . . . . . . . . . . . . . . . . . . . =        5076.247098

||A||_1  . . . . . . . . . . . . . . . . . . . =        5074.832190

||x||_oo . . . . . . . . . . . . . . . . . . . =           5.419810

||x||_1  . . . . . . . . . . . . . . . . . . . =       20664.162508

============================================================================

Finished      1 tests with the following results:

              0 tests completed and passed residual checks,

              1 tests completed and failed residual checks,

              0 tests skipped because of illegal input values.

—————————————————————————-

End of Tests.

============================================================================

평균 CPU 사용량 : 80%

서진우

슈퍼컴퓨팅 전문 기업 클루닉스/ 상무(기술이사)/ 정보시스템감리사/ 시스존 블로그 운영자

You may also like...

페이스북/트위트/구글 계정으로 댓글 가능합니다.