PBS Pro 14.x 스케줄러 설치 및 구성

#### PBS Pro 14.x 스케줄러 설치 및 구성

작성일 : 2016년 11월 1일
작성자 : 서진우 (alang@clunix.com)

LSF, SGE 와 함께 대표적인 작업스케줄러인 PBS Pro 가 2016년 6월 오픈되었다.
PBS는 상용버전인 PBSPro 이외에 별도로 오픈 버전으로 OpenPBS, Torque/Maui 가
존재하였다.

현 시점에서 PBS 계열의 공식 버전인 PBSPro 가 오픈됨으로 인해 기존 순수 오픈
소스였던 OpenPBS 와 Torque 사용자는 PBSPro 사용을 검토하게 될것이다.

실제 두가지 환경을 비교해 본 결과, Torque/Maui에서 불안해던 일부 큐잉 기능이
PBSPro 에서는 전혀 발생하지 않는 것을 확인할 수 있었다.

PBSPro 역시 Torque 나 OpenPBS 와 거의 유사한 설치와 설정 방법을 가지고 있다.
명령 체계도 거의 유사하다. 하지만 완전히 동일하진 않다.

본 문서에서는 PBS Pro 를 설치하고 설정하는 방법에 대해 살펴 보도록 한다.

1. RPM rebuild 방법 (패키지 작성 방법)

오픈된 PBS Pro 는 아래 사이트에서 정보와 패키지를 다운을 받을 수 있다.

http://www.pbspro.org/

기본적으로 CentOS 7 기반의 패키지가 배포되지만, 실제 CentOS 7 에서
무난히 설치되진 않는다. SRPM 을 제공하니, 자신의 Centos7 환경에 맞게
RPM 을 rebuild 하길 권장한다.

CentOS 6 환경에서도 지원이 가능하다.
다만 CentOS 6 의 경우 rpm rebuild 시 automake, autoconf 버전을 Centos7
과 유사 버전으로 업그레이드를 시켜야 한다.

– RHEL6.x 의존 패키지 사전 설치

cd /root/TRG_PKG_2016/scheduler/pbspro
# rpm -Uvh hwloc-* libedit-devel-* pciutils-devel-*
# rpm -Uvh postgresql-* unixODBC-* uuid-*
# rpm -Uvh tk-* expat-devel-*
# rpm -Uvh perl-Switch-2.16-1.el6.rf.noarch.rpm –force
# rpm -Uvh sendmail-* procmail-*
// sendmail 은 굳이 설치하지 않아도 됨. postfix 가 존재하다면 ..

// RHEL6.x 환경에서 pbspro 14 버전 rebuild를 위해서는
autoconf, automake 를 업그레이드 해야 한다.

# rpm -Uvh autoconf-2.69-12.2.noarch.rpm automake-1.13.4-3.2.noarch.rpm

– RHEL7.x 의존 패키지 사전 설치

yum install hwloc-devel libedit-devel

rpm -Uvh perl-Switch-2.16-7.el7.noarch.rpm
rpm -Uvh postgresql-*
rpm -Uvh sendmail-8.14.7-4.el7.x86_64.rpm
// sendmail 은 굳이 설치 않아도 됨. postfix 가 존재하다면 ..

rpm -ivh pbspro-server-14.1.0-13.1.clx.x86_64.rpm

– RPM rebuild

# rpm -ivh pbspro-14.1.0-13.1.src.rpm

# cd /root/rpmbuild/SPECS
# vi pbspro.spec
——————————————————-
.
%define pbs_prefix /engrid/enpbs
.
Release: 13.1.clx
——————————————————-
// pbspro.spec 파일은 그래로 사용해도 된다. 여기서는 GridCenter 패키지 설치
정책에 맞게 prefix 를 변경함.

# rpmbuild -ba pbspro.spec

rpmbuild 가 완료되면 아래와 같은 패키지가 생성된다.

pbspro-server-14.1.0-13.1.x86_64.rpm :
관리서버에만 설치, 이중화는 가능 (execution, client 기능 포함)

pbspro-execution-14.1.0-13.1.x86_64.rpm :
계산 서버에 설치 (client 기능 포함)

pbspro-client-14.1.0-13.1.x86_64.rpm :
작업 제출(Job Submit) 호스트에 설치 (대표적으로 로그인서버 : qsub 만 가능)

pbspro-debuginfo-14.1.0-13.1.x86_64.rpm
디버깅 패키징

일반적인 HPC를 구축할 경우..

관리서버 : pbspro-server
계산서버 : pbspro-excution
로그인 서버 : pbspro-client (존재시..)

2. PBSPro 설치방법

– 관리서버 설치 및 설정

# vi /etc/profile.d/enpbs.sh
——————————————————-
#!/bin/sh
export PBS_HOME=/var/spool/pbs
export ENPBS_HOME=/engrid/enhpc
export PATH=/engrid/enpbs/bin:/engrid/enpbs/sbin:$PATH
——————————————————-

# vi /var/spool/pbs/server_priv/nodes
——————————————————-
PBSN000 np=4
PBSN001 np=4
PBSN002 np=4
PBSN003 np=4
——————————————————-

# vi /etc/pbs.conf
——————————————————-
PBS_SERVER=PBSN000
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1 # 관리서버를 계산서버에 포함안시킬 경우. 0
PBS_EXEC=/engrid/enpbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
——————————————————-

# /etc/init.d/pbs start

# source /etc/profile.d/enpbs.sh

/// gridcenter 큐 생성
qmgr -c “create queue gridcenter queue_type=execution”
qmgr -c “set queue gridcenter started=true”
qmgr -c “set queue gridcenter enabled=true”
qmgr -c “set server scheduling=true”
qmgr -c “set server default_queue=gridcenter”

/// 관리자 메일과 root의 스케줄러 접근 권한 설정
qmgr -c “set server mail_from = admin@clunix.com”
qmgr -c “set server acl_roots+=root”

qmgr -c “set queue gridcenter resources_default.ncpus = 1”
qmgr -c “set queue gridcenter resources_default.nodect = 1”

//// 큐에 대한 세부 resource 제한 설정
// 대개 resources 제한 시에는 max 값만 지정해 주는 것이 효율적
qmgr -c “set queue gridcenter max_running = 8” // 작업수, 해당queue 의 전체 slot 수
qmgr -c “set queue gridcenter resources_max.ncpus = 4” // OpenMP,SMP 시 Multi Threads 수
qmgr -c “set queue gridcenter resources_max.nodes = 2” // MPI 시 Multi node 수
qmgr -c “set server resources_max.walltime=1:20:00”

// SMP 노드에서 한노드에 하나의 작업만 실행시키기
qmgr -c “set server node_pack = {true,false }”

/// 기본적으로 PBS 관리서버만 실행서버로 등록이 된다. 추가적인 계산서버는
아래와 같이 직접 추가해 준다.
(아래 설정 부분은 계산서버의 PBS 설치 및 설정이 완료된 후 수행해야한다)

qmgr -c “create node PBSN001”
qmgr -c “create node PBSN002”
qmgr -c “create node PBSN003”

/// 추가한 서버들을 해당 큐에 할당한다.

qmgr -c “set node PBSN000 queue=gridcenter”
qmgr -c “set node PBSN001 queue=gridcenter”
qmgr -c “set node PBSN002 queue=workq”
qmgr -c “set node PBSN003 queue=workq”

/// 서버별로 최대 slot, core, task 수를 지정한다. (기본적으로 core 수와 동일하게..)

qmgr -c “set node PBSN002 resources_available.ncpus=8”
qmgr -c “set node PBSN003 resources_available.ncpus=8”

/// 서버별 공유 정책 지정 (독점, 공유) – 필요시 적용
gmgr -c “set node PBSN003 resources_available.ngpus=1, sharing=default_excl”
gmgr -c “set node PBSN003 resources_available.ngpus=1, sharing=default_shared”

– 계산서버 설치 및 설정

# rpm -Uvh perl-Switch-2.16-7.el7.noarch.rpm
# rpm -Uvh pbspro-execution-14.1.0-13.1.clx.x86_64.rpm

# vi /etc/profile.d/enpbs.sh
————————————————————————-
#!/bin/sh
export PBS_HOME=/var/spool/pbs
export ENPBS_HOME=/engrid/enhpc
export PATH=/engrid/enpbs/bin:/engrid/enpbs/sbin:$PATH
————————————————————————-

# vi /etc/pbs.conf
————————————————————————-
PBS_SERVER=PBSN000
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=1
PBS_EXEC=/engrid/enpbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp
————————————————————————-

# vi /var/spool/pbs/mom_priv/config
————————————————————————-
$clienthost PBSN000
$logevent 0x1ff
————————————————————————-

# /etc/init.d/pbs start

3. 작업 제출 테스트

– 초간단 작업 제출

# echo “sleep 50; hostname” | qsub

– MPI 작업 제출

PBSPro 작업 제출 스크립트에서 MPI 작업 시 -np -machine 옵션 사용하면 안된다.
(자동으로 됨)

PE 적용 방법은 아래와 같다.

// MPI 예제

4 Nodes, 1MPI Processes/node, Total 4MPI Processes
#PBS -l nodes=4:ppn=1

4 Nodes, 2MPI Processes/node, Total 8MPI
#PBS -l nodes=4:ppn=2

4 Nodes, 4MPI Processes/node, Total 16MPI
#PBS -l nodes=4:ppn=4

// 작업제출 스크립트 예제

#PBS -N CPI_LOG
#PBS -l nodes=2:ppn=4
#PBS -S /bin/sh
#PBS -q gridcenter
#PBS -m abe
#PBS -M alang@clunix.com
#PBS -j oe
#PBS -o CPI_LOG.out
#PBS -V
#PBS -v MPI_HOME=/APP/enhpc/mpi/mpich2-gcc-hd

cd $PBS_O_WORKDIR
sleep 10
mpirun ./cpilog

# cat CPI_LOG.out
Process 0 running on PBSN000
Process 1 running on PBSN000
Process 2 running on PBSN000
Process 3 running on PBSN000
Process 4 running on PBSN001
Process 5 running on PBSN001
Process 6 running on PBSN001
Process 7 running on PBSN001

– OpenMP, MPI 혼합 작업 제출

// 작업제출 스크립트 예제

#PBS -N CPI_LOG
#PBS -l select=2:ncpus=4:mpiprocs=4
#PBS -S /bin/sh
#PBS -q workq
#PBS -m abe
#PBS -M alang@clunix.com
#PBS -j oe
#PBS -o CPI_LOG.out
#PBS -V
#PBS -v MPI_HOME=/APP/enhpc/mpi/mpich2-gcc-hd

cd $PBS_O_WORKDIR
echo $PBS_NODEFILE
${MPI_HOME}/bin/mpirun ./cpilog

// OpenMP(SMP) 예제

1 process, 16 OpenMP threads
#PBS -l select=1:ncpus=16

// OpenMP + MPI 예제

1 MPI process/node, 16 OpenMP threads (total : 32)
#PBS -l select=2:ncpus=16

2 MPI processes/node, 8 OpenMP threads (total : 16)
#PBS -l select=2:ncpus=16:mpiprocs=2:ompthreads=8

16 MPI processes/node, no OpenMP
#PBS -l select=2:ncpus=16:mpiprocs=16

32 MPI processes/node (hyperthreading), no OpenMP
#PBS -l select=2:ncpus=32:mpiprocs=32

16 MPI processes/node, 16 OpenMP threads
#PBS -l select=2:ncpus=16:mpiprocs=16
export OMP_NUM_THREADS=16

– PBSPro 스크립트 변수

주요 변수에 대한 테스트 스크립트
# vi pbs_env_value.sh
——————————————————–
#PBS -N PBS_ENVVAR
#PBS -S /bin/sh
#PBS -j oe
#PBS -o PBS_ENVVAR.out
#PBS -V

echo “HOME : $HOME”
echo “LOGNAME : $LOGNAME”
echo “PBS_JOBNAME : $PBS_JOBNAME”
echo “PBS_JOBID : $PBS_JOBID”
echo “PBS_QUEUE : $PBS_QUEUE”
echo “SHELL : $SHELL”
echo “USER : $USER”
echo “PBS_JOBCOOKIE : $PBS_JOBCOOKIE”
echo “PBS_NODENUM : $PBS_NODENUM”
echo “PBS_TASKNUM : $PBS_TASKNUM”
echo “PBS_MOMPORT : $PBS_MOMPORT”
echo “PBS_NODEFILE : $PBS_NODEFILE”
echo “PBS_NNODES : $PBS_NNODES”
echo “TMPDIR : $TMPDIR”
echo “PBS_VERSION : $PBS_VERSION”
echo “PBS_NUM_NODES : $PBS_NUM_NODES”
echo “PBS_NUM_PPN : $PBS_NUM_PPN”
echo “PBS_GPUFILE : $PBS_GPUFILE”
echo “PBS_NP : $PBS_NP ”
echo “PBS_WALLTIME : $PBS_WALLTIME”
echo “PBS_O_HOME : $PBS_O_HOME ”
echo “PBS_O_LANG : $PBS_O_LANG ”
echo “PBS_O_LOGNAME : $PBS_O_LOGNAME”
echo “PBS_O_PATH : $PBS_O_PATH ”
echo “PBS_O_MAIL : $PBS_O_MAIL ”
echo “PBS_O_SHELL : $PBS_O_SHELL”
echo “PBS_O_TZ : $PBS_O_TZ ”
echo “PBS_O_HOST : $PBS_O_HOST”
echo “PBS_O_QUEUE : $PBS_O_QUEUE ”
echo “PBS_O_WORKDIR : $PBS_O_WORKDIR”
———————————————————–

# qsub pbs_env_value.sh

실제 작업 제출 후 결과 확인

# cat PBS_ENVVAR.out
———————————————————–
HOME : /root
LOGNAME : root
PBS_JOBNAME : PBS_ENVVAR
PBS_JOBID : 688.PBSN000
PBS_QUEUE : gridcenter
SHELL : /bin/sh
USER : root
PBS_JOBCOOKIE : 0000000023DCBB0E000000006C3089EE
PBS_NODENUM : 0
PBS_TASKNUM : 1
PBS_MOMPORT : 15003
PBS_NODEFILE : /var/spool/pbs/aux/688.PBSN000
PBS_NNODES :
TMPDIR : /var/tmp/pbs.688.PBSN000
PBS_VERSION :
PBS_NUM_NODES :
PBS_NUM_PPN :
PBS_GPUFILE :
PBS_NP :
PBS_WALLTIME :
PBS_O_HOME : /root
PBS_O_LANG : ko_KR.utf8
PBS_O_LOGNAME : root
PBS_O_PATH : /engrid/enpbs/bin:/engrid/enpbs/sbin:/usr/lib64/qt-3.3/bin:/root/perl5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/engrid/enmcs/bin
PBS_O_MAIL : /var/spool/mail/root
PBS_O_SHELL : /bin/bash
PBS_O_TZ :
PBS_O_HOST : pbsn003
PBS_O_QUEUE : gridcenter
PBS_O_WORKDIR : /root
———————————————————–

4. 작업 및 Queue 모니터링

# qstat
Job id Name User Time Use S Queue
—————- —————- —————- ——– – —–
648.PBSN000 CPI_LOG root 00:00:00 R gridcenter
649.PBSN000 CPI_LOG root 00:00:00 R gridcenter
650.PBSN000 CPI_LOG root 0 Q gridcenter
651.PBSN000 CPI_LOG root 0 Q gridcenter
652.PBSN000 CPI_LOG root 0 Q gridcenter
653.PBSN000 CPI_LOG root 00:00:00 R workq
654.PBSN000 CPI_LOG root 00:00:00 R workq
655.PBSN000 CPI_LOG root 00:00:00 R workq
656.PBSN000 CPI_LOG root 00:00:00 R workq
657.PBSN000 CPI_LOG root 0 Q workq
658.PBSN000 CPI_LOG root 0 Q workq
659.PBSN000 CPI_LOG root 0 Q workq

# qstat -a

PBSN000:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
————— ——– ——– ———- —— — — —— —– – —–
648.PBSN000 root gridcent CPI_LOG 18140 2 4 — — R 00:00
649.PBSN000 root gridcent CPI_LOG 18226 2 4 — — R 00:00
650.PBSN000 root gridcent CPI_LOG — 2 4 — — Q —
651.PBSN000 root gridcent CPI_LOG — 2 4 — — Q —
652.PBSN000 root gridcent CPI_LOG — 2 4 — — Q —
653.PBSN000 root workq CPI_LOG 6592 2 4 — — R 00:00
654.PBSN000 root workq CPI_LOG 6678 2 4 — — R 00:00
655.PBSN000 root workq CPI_LOG 6762 2 4 — — R 00:00
656.PBSN000 root workq CPI_LOG 6848 2 4 — — R 00:00
657.PBSN000 root workq CPI_LOG — 2 4 — — Q —
658.PBSN000 root workq CPI_LOG — 2 4 — — Q —
659.PBSN000 root workq CPI_LOG — 2 4 — — Q —

# qstat -f 652

Job Id: 652.PBSN000
Job_Name = CPI_LOG
Job_Owner = root@pbsn003
.
job_state = R
queue = gridcenter
.
exec_host = pbsn000/0*2+PBSN001/0*2
exec_vnode = (pbsn000:ncpus=2)+(PBSN001:ncpus=2)
.
Resource_List.mpiprocs = 4
Resource_List.ncpus = 4
Resource_List.nodect = 2
Resource_List.nodes = 2:ppn=2

# qstat -s

PBSN000:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
————— ——– ——– ———- —— — — —— —– – —–
862.PBSN000 root gridcent CPI_LOG111 23486 2 4 — — R 00:08
Job run at Sat Nov 19 at 19:14 on (pbsn000:ncpus=2)+(PBSN001:ncpus=2)
863.PBSN000 root gridcent CPI_LOG111 23593 2 4 — — R 00:08
Job run at Sat Nov 19 at 19:14 on (pbsn000:ncpus=2)+(PBSN001:ncpus=2)
868.PBSN000 root workq CPI_LOG222 8115 2 4 — — R 00:01
Job run at Sat Nov 19 at 19:22 on (PBSN002:ncpus=2)+(PBSN003:ncpus=2)
869.PBSN000 root workq CPI_LOG222 8199 2 4 — — R 00:01
Job run at Sat Nov 19 at 19:22 on (PBSN002:ncpus=2)+(PBSN003:ncpus=2)

# qstat -n

PBSN000:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
————— ——– ——– ———- —— — — —— —– – —–
868.PBSN000 root workq CPI_LOG222 8115 2 4 — — R 00:05
PBSN002/0*2+PBSN003/0*2
869.PBSN000 root workq CPI_LOG222 8199 2 4 — — R 00:05
PBSN002/1*2+PBSN003/1*2
870.PBSN000 root workq CPI_LOG222 8283 2 4 — — R 00:05
PBSN002/2*2+PBSN003/2*2
871.PBSN000 root workq CPI_LOG222 8367 2 4 — — R 00:05
PBSN002/3*2+PBSN003/3*2
872.PBSN000 root workq CPI_LOG222 — 2 4 — — Q —

# qstat -Q
Queue Max Tot Ena Str Que Run Hld Wat Trn Ext Type
—————- —– —– — — —– —– —– —– —– —– —-
workq 0 8 yes yes 4 4 0 0 0 0 Exec
gridcenter 0 5 yes yes 3 2 0 0 0 0 Exec

# qstat -B
Server Max Tot Que Run Hld Wat Trn Ext Status
—————- —– —– —– —– —– —– —– —– ———–
PBSN000 0 13 7 6 0 0 0 0 Active

# qstat -q

server: PBSN000

Queue Memory CPU Time Walltime Node Run Que Lm State
—————- —— ——– ——– —- —– —– —- —–
workq — — — — 4 4 — E R
gridcenter — — — — 2 3 — E R
—– —–

# qstat -Bf
Server: PBSN000
server_state = Active
server_host = pbsn000
scheduling = True
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun
:0
acl_hosts = *
acl_roots = root
default_queue = gridcenter
log_events = 511
mail_from = alang@clunix.com
query_other_jobs = True
resources_default.ncpus = 1
default_chunk.ncpus = 1
resources_assigned.mpiprocs = 0
resources_assigned.ncpus = 0
resources_assigned.nodect = 0
scheduler_iteration = 600
FLicenses = 2000000
resv_enable = True
node_fail_requeue = 310
max_array_size = 10000
pbs_license_min = 0
pbs_license_max = 2147483647
pbs_license_linger_time = 31536000
license_count = Avail_Global:1000000 Avail_Local:1000000 Used:0 High_Use:0
Avail_Sockets:1000000 Unused_Sockets:1000000
pbs_version = 14.1.0
eligible_time_enable = False
max_concurrent_provision = 5

# qstat -Qf
Queue: workq
queue_type = Execution
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun
:0
resources_assigned.mpiprocs = 0
resources_assigned.ncpus = 0
resources_assigned.nodect = 0
hasnodes = True
enabled = True
started = True

Queue: gridcenter
queue_type = Execution
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun
:0
resources_default.ncpus = 1
resources_default.nodect = 1
resources_assigned.mpiprocs = 0
resources_assigned.ncpus = 0
resources_assigned.nodect = 0
hasnodes = True
enabled = True
started = True

6 7

# pbsnodes -a

pbsn000
Mom = pbsn000
state = job-busy
pcpus = 4
jobs = 662.PBSN000/0, 662.PBSN000/1, 663.PBSN000/2, 663.PBSN000/3
resources_available.host = pbsn000
resources_available.ncpus = 4
resources_assigned.ncpus = 4
queue = gridcenter

PBSN001
Mom = pbsn001
pcpus = 4
state = job-busy
jobs = 662.PBSN000/0, 662.PBSN000/1, 663.PBSN000/2, 663.PBSN000/3
resources_available.host = pbsn001
resources_available.ncpus = 4
resources_assigned.ncpus = 4
queue = gridcenter

PBSN002
Mom = pbsn002
state = job-busy
pcpus = 4
jobs = 669.PBSN000/0, 669.PBSN000/1, 670.PBSN000/2, 670.PBSN000/3
resources_available.host = pbsn002
resources_available.ncpus = 8
resources_assigned.ncpus = 4
queue = workq

PBSN003
Mom = pbsn003
state = job-busy
pcpus = 4
jobs = 669.PBSN000/0, 669.PBSN000/1, 670.PBSN000/2, 670.PBSN000/3, 671.PBSN000/4, 671.PBSN000/5, 672.PBSN000/6, 672.PBSN000/7
resources_available.ncpus = 8
resources_assigned.ncpus = 8
queue = workq

# pbsnodes -s PBSN000 -o PBSN001 // PBSN001 서버 status offline
# pbsnodes -s PBSN000 -r PBSN001 // PBSN001 서버 status free

# printjob 673
—————————————————
jobid: 673.PBSN000
—————————————————
–attributes–
Job_Name = CPI_LOG
Job_Owner = root@pbsn003
queue = gridcenter
Priority = 0
Resource_List.mpiprocs = 4
Resource_List.ncpus = 4
Resource_List.nodect = 2
Resource_List.nodes = 2:ppn=2
Resource_List.select = 2:ncpus=2:mpiprocs=2
resources_used.cpupercent = 0
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.ncpus = 4
exec_host = pbsn000/1*2+PBSN001/1*2
exec_host2 = pbsn000:15002/1*2+pbsn001:15002/1*2
exec_vnode = (pbsn000:ncpus=2)+(PBSN001:ncpus=2)
schedselect = 2:ncpus=2:mpiprocs=2
comment = Job run at Fri Nov 18 at 14:38 on (pbsn000:ncpus=2)+(PBSN001:ncpus=2)

# printjob -s 685
—————————————————
Jobscript for jobid:685.PBSN000
—————————————————
#PBS -N CPI_LOG
#PBS -l nodes=2:ppn=2
#PBS -S /bin/sh
#PBS -m abe
#PBS -M alang@clunix.com
#PBS -q workq
#PBS -j oe
#PBS -o CPI_LOG.out
#PBS -V
#PBS -v MPI_HOME=/APP/enhpc/mpi/mpich2-gcc-hd

cd $PBS_O_WORKDIR
sleep 120
${MPI_HOME}/bin/mpirun ./cpilog

# tracejob 708

Job: 708.PBSN000

11/18/2016 18:42:30 L Considering job to run
11/18/2016 18:42:30 S Job Queued at request of root@pbsn003, owner = root@pbsn003, job
name = CPI_LOG1, queue = workq
11/18/2016 18:42:30 S Job Run at request of Scheduler@pbsn000 on exec_vnode
(PBSN002:ncpus=1)
11/18/2016 18:42:30 S Job Modified at request of Scheduler@pbsn000
11/18/2016 18:42:30 L Job run
11/18/2016 18:42:30 S enqueuing into workq, state 1 hop 1
11/18/2016 18:42:30 A queue=workq
11/18/2016 18:42:30 A user=root group=root project=_pbs_project_default jobname=CPI_LOG1
queue=workq ctime=1479462150 qtime=1479462150 etime=1479462150
start=1479462150 exec_host=PBSN002/0 exec_vnode=(PBSN002:ncpus=1)
Resource_List.explicit=2000 Resource_List.ncpus=1
Resource_List.nodect=1 Resource_List.place=pack
Resource_List.select=1:ncpus=1 resource_assigned.ncpus=1

– qstat.ge // clunix 개발

pbs 의 대표적 모니터링 명령은 qstat, pbsnodes 임.
qstat 의 경우 job 에 집중하고, pbsnodes 는 서버 정보에 집중을 함.
SGE 의 경우 qstat -f 와 같이 서버와 job 현황을 한번에 볼수 있는데
PBS 의 경우 이런 부분이 다소 부족함.

PBS 환경에서 SGE 의 qstat -f 와 같은 출력 정보를 보여 주는 명령

# qstat.ge -h

ex) qstat.ge [-h] [-a] [-f] [-q]
– qstat.ge : only queue status
– qstat.ge -h : help
– qstat.ge -a : queue and all job status
– qstat.ge -f : queue and all job status
– qstat.ge -q : queue and waiting job status

# qstat.ge

Hostname Queue Use/Tot Cpu,Mem,Swp[%] State
————————————————————————
pbsn000 gridcenter 4/4 1%,7%,0% job-busy
————————————————————————
pbsn001 gridcenter 4/4 0%,3%,0% job-busy
————————————————————————
pbsn002 workq 8/8 0%,6%,0% job-busy
————————————————————————
pbsn003 workq 8/8 0%,3%,0% job-busy
————————————————————————

# qstat.ge -q

Hostname Queue Use/Tot Cpu,Mem,Swp[%] State
————————————————————————
pbsn000 gridcenter 4/4 0%,7%,0% job-busy
————————————————————————
pbsn001 gridcenter 4/4 0%,3%,0% job-busy
————————————————————————
pbsn002 workq 8/8 1%,6%,0% job-busy
————————————————————————
pbsn003 workq 8/8 0%,3%,0% job-busy
————————————————————————

#########################################################################
Waiting Jobs : [jobid, jobname, ncpus, user, jobstat, queue]
#########################################################################
933.PBSN000 CPI_LOG111 4 root Q gridcent
934.PBSN000 CPI_LOG111 4 root Q gridcent
935.PBSN000 CPI_LOG111 4 root Q gridcent
936.PBSN000 CPI_LOG111 4 root Q gridcent
939.PBSN000 CPI_LOG222 8 root Q workq
940.PBSN000 CPI_LOG222 8 root Q workq
941.PBSN000 CPI_LOG222 8 root Q workq
942.PBSN000 CPI_LOG222 8 root Q workq
943.PBSN000 CPI_LOG222 8 root Q workq

# qstat.ge -a

Hostname Queue Use/Tot Cpu,Mem,Swp[%] State
————————————————————————
pbsn000 gridcenter 4/4 0%,7%,0% job-busy
+932 CPI_LOG111111111 2/4 root 0% 11-21 04:48 R
+931 CPI_LOG111111111 2/4 root 0% 11-21 04:48 R

————————————————————————
pbsn001 gridcenter 4/4 0%,3%,0% job-busy
+932 CPI_LOG111111111 2/4 root 0% 11-21 04:48 R
+931 CPI_LOG111111111 2/4 root 0% 11-21 04:48 R

————————————————————————
pbsn002 workq 8/8 0%,6%,0% job-busy
+937 CPI_LOG222222222 4/8 root 0% 11-21 04:48 R
+938 CPI_LOG222222222 4/8 root 0% 11-21 04:48 R

————————————————————————
pbsn003 workq 8/8 0%,3%,0% job-busy
+937 CPI_LOG222222222 4/8 root 0% 11-21 04:48 R
+938 CPI_LOG222222222 4/8 root 0% 11-21 04:48 R

————————————————————————

#########################################################################
Waiting Jobs : [jobid, jobname, ncpus, user, jobstat, queue]
#########################################################################
933.PBSN000 CPI_LOG111 4 root Q gridcent
934.PBSN000 CPI_LOG111 4 root Q gridcent
935.PBSN000 CPI_LOG111 4 root Q gridcent
936.PBSN000 CPI_LOG111 4 root Q gridcent
939.PBSN000 CPI_LOG222 8 root Q workq
940.PBSN000 CPI_LOG222 8 root Q workq
941.PBSN000 CPI_LOG222 8 root Q workq
942.PBSN000 CPI_LOG222 8 root Q workq
943.PBSN000 CPI_LOG222 8 root Q workq

5. 응용 기능

– license management

cd $PBS_HOME/server_priv/
# vi resourcedef
standard type=long
explicit type=long

/etc/rc.d/init.d/pbs restart

cd $PBS_HOME/sched_priv/
# vi sched_config

resources: “ncpus, mem, …. , standard, explicit”

cd $PBS_HOME/server_priv/
# vi config
server_dyn_res: “standard !/engrid/enpbs/enlic/bin/flx_licmon.pl -H RNTMGR01 -p 27000 -f standard –used”
server_dyn_res: “explicit !/engrid/enpbs/enlic/bin/flx_licmon.pl -H RNTMGR01 -p 27000 -f explicit –free”

qmgr -c “set server resources_available.standard=1024”
qmgr -c “set server resources_available.explicit=1024”

qmgr> active node PBSN000,PBSN001,PBSN002,PBSN003
qmgr> set node resources_available.standard=1024
qmrr> set node resources_available.explicit=1024

서진우

슈퍼컴퓨팅 전문 기업 클루닉스/ 상무(기술이사)/ 정보시스템감리사/ 시스존 블로그 운영자

You may also like...

페이스북/트위트/구글 계정으로 댓글 가능합니다.