[클러스터] Sun Grid Engine(SGE)-5.3 설치하기
SGE 설치하기 [[목차]]
1.1. 공통사항 [[목차]]
1) 관리 계정 만들기
설치하려는 클러스터의 모든 노드에 sge라는 계정을 추가한다.
# adduser sge
# passwd sge
2) 서비스 포트 추가
설치하려는 클러스터의 모든 노드의 /etc/services 화일에 sge_commd를 위한
포트를 지정한다. 가능하면 1024번 이하의 포트를 지정한다.
모든 노드가 같은 포트번호를 사용하여야 한다.
# vi /etc/services
…
sge_commd 536/tcp # Sun Grid Engine
:wq
1.2. 마스터 노드에 설치하기 [[목차]]
주의>>
SGE는 NFS로 마운트되어 공유되고 있는 파일 시스템상에 설치해야 한다.
보통 클러스터의 경우 /home 파티션을 공유해서 사용하므로,
/home/sge/SGE 에 설치를 하도록 한다.
1) 다운로드 및 압축 풀기
설치화일은
http://wwws.sun.com/software/gridware/
http://gridengine.sunsource.net/
에서 다운로드가 가능하다.
커널 2.4의 glibc 2.1이상의 x86 리눅스의 경우 sge-5.3p2-common.tar.gz 와
sge-5.3p2-bin-glinux.tar.gz 를 다운 받는다.
sge 계정으로 로그인하여 압축을 푼다.
# su sge
$ mkdir /home/sge/SGE
$ cd /home/sge/SGE
$ tar zxvf sge-5.3p2-common.tar.gz
$ tar zxvf sge-5.3p2-bin-glinux.tar.gz
$ exit
2) install_qmaster 실행
SGE_ROOT 환경 변수를 설정한다.
# export SGE_ROOT=/home/sge/SGE
SGE는 설치 프로그램을 이용하여 설치된다.
마스터 노드에 설치할때는 install_qmaster를, 계산노드에 설치할때는
install_execd를 실행한다.
install_qmaster를 실행한다.
# ./install_qmaster
Welcome to the Grid Engine installation
—————————————
Grid Engine qmaster host installation
————————————-
Before you continue with the installation please read these hints:
– Your terminal window should have a size of at least
80×24 characters
– The INTR character is often bound to the key Ctrl-C.
The term >Ctrl-C< is used during the installation if you
have the possibility to abort the installation
The qmaster installation procedure will take approximately 5-10 minutes.
Hit <RETURN> to continue >> Enter
Confirm Grid Engine default installation settings
————————————————-
The following default settings can be used for an accelerated
installation procedure:
$SGE_ROOT = /home/sge/SGE
service = sge_commd
admin user account = sge
Do you want to use these configuration parameters (y/n) [y] >> Enter
Verifying and setting file permissions
————————————–
Did you install this version with >pkgadd< or did you already
verify and set the file permissions of your distribution (y/n) [y] >> Enter
We do not verify file permissions. Hit <RETURN> to continue >>
Verifying and setting file permissions and owner in >3rd_party<
Verifying and setting file permissions and owner in >bin<
Verifying and setting file permissions and owner in >ckpt<
Verifying and setting file permissions and owner in >examples<
Verifying and setting file permissions and owner in >install_execd<
Verifying and setting file permissions and owner in >install_qmaster<
Verifying and setting file permissions and owner in >mpi<
Verifying and setting file permissions and owner in >pvm<
Verifying and setting file permissions and owner in >qmon<
Verifying and setting file permissions and owner in >util<
Verifying and setting file permissions and owner in >utilbin<
Verifying and setting file permissions and owner in >catman<
Verifying and setting file permissions and owner in >doc<
Verifying and setting file permissions and owner in >man<
Verifying and setting file permissions and owner in >inst_sge<
Verifying and setting file permissions and owner in >inst_sgeee<
Verifying and setting file permissions and owner in >bin<
Verifying and setting file permissions and owner in >lib<
Verifying and setting file permissions and owner in >utilbin<
Your file permissions were set
Hit <RETURN> to continue >> Enter
Making directories
——————
creating directory: default
creating directory: default/common
creating directory: default/common/history
creating directory: default/common/local_conf
creating directory: /home/sge/SGE/default/spool/qmaster
creating directory: /home/sge/SGE/default/spool/qmaster/admin_hosts
creating directory: /home/sge/SGE/default/spool/qmaster/ckpt
creating directory: /home/sge/SGE/default/spool/qmaster/complexes
creating directory: /home/sge/SGE/default/spool/qmaster/exec_hosts
creating directory: /home/sge/SGE/default/spool/qmaster/job_scripts
creating directory: /home/sge/SGE/default/spool/qmaster/jobs
creating directory: /home/sge/SGE/default/spool/qmaster/pe
creating directory: /home/sge/SGE/default/spool/qmaster/queues
creating directory: /home/sge/SGE/default/spool/qmaster/submit_hosts
creating directory: /home/sge/SGE/default/spool/qmaster/usersets
Hit <RETURN> to continue >> Enter
Select default Grid Engine hostname resolving method
—————————————————-
Are all hosts of your cluster in one DNS domain? If this is
the case the hostnames
>hostA< and >hostA.foo.com<
would be treated as equal, because the DNS domain name >foo.com<
is ignored when comparing hostnames.
Are all hosts of your cluster in a single DNS domain (y/n) [y] >> Enter
Ignoring domainname when comparing hostnames.
Hit <RETURN> to continue >> Enter
Grid Engine group id range
————————–
When jobs are started under the control of Grid Engine an additional group id
is set on platforms which do not support jobs.
This additional UNIX group id range must be unused group id’s in your system.
The range must be big enough to provide enough numbers for the maximum number
of Grid Engine jobs running at a single moment on a single host. E.g. a range
like >20000-20100< means, that Grid Engine will use the group id’s from
20000-20100 and thus provides a range for 101 jobs running at the same time
on a single host.
You can change at any time the group id range in your cluster configuration.
Please enter a range >> 20000-20100
Using >20000-20100< as gid range. Hit <RETURN> to continue >> Enter
Creating local configuration
—————————-
Creating >act_qmaster< file
Adding default complexes >host< and >queue<
Adding default parallel environment (PE) for >qmake<
Adding >sge_aliases< path aliases file
Adding >qtask< qtcsh sample default request file
Adding >sge_request< default submit options file
Creating settings files for >.profile/.cshrc<
Hit <RETURN> to continue >> Enter
Grid Engine startup script
————————–
Your Grid Engine cluster wide startup script is installed as:
/home/sge/SGE/default/common/rcsge<
Hit <RETURN> to continue >> Enter
Grid Engine startup script
————————–
We can install the startup script that
Grid Engine is started at machine boot (y/n) [y] >> Enter
Installing startup script /etc/rc.d/rc3.d/S95rcsge
Hit <RETURN> to continue >> Enter
Grid Engine qmaster and scheduler startup
—————————————–
Starting qmaster and scheduler daemon. Please wait …
starting sge_qmaster
starting program: /home/sge/SGE/bin/glinux/sge_commd
using service “sge_commd”
bound to port 536
Reading in complexes:
Complex “host”.
Complex “queue”.
Reading in parallel environments:
PE “make”.
Reading in scheduler configuration
starting sge_schedd
Hit <RETURN> to continue >> Enter
Adding Grid Engine hosts
————————
Please now add the list of hosts, where you will later install your execution
daemons. These hosts will be also added as valid submit hosts.
Please enter a blank separated list of your execution hosts. You may
press <RETURN> if the line is getting too long. Once you are finished
simply press <RETURN> without entering a name.
You also may prepare a file with the hostnames of the machines where you plan
to install Grid Engine. This may be convenient if you are installing Grid
Engine on many hosts.
Do you want to use a file which contains the list of hosts (y/n) [n] >> Enter
Adding admin and submit hosts
—————————–
Please enter a blank seperated list of hosts.
Stop by entering <RETURN>. You may repeat this step until you are
entering an empty list. You will see messages from Grid Engine
when the hosts are added.
Host(s): sdd114
adminhost “sdd114.foo.bar.com” already exists
sdd114.foo.bar.com added to submit host list
Hit <RETURN> to continue >> Enter
Host(s): sdd105
sdd105.foo.bar.com added to administrative host list
sdd105.foo.bar.com added to submit host list
Hit <RETURN> to continue >> Enter
Host(s): sdd106
sdd106.foo.bar.com added to administrative host list
sdd106.foo.bar.com added to submit host list
Hit <RETURN> to continue >> Enter
Host(s): sdd107
sdd107.foo.bar.com added to administrative host list
sdd107.foo.bar.com added to submit host list
Hit <RETURN> to continue >> Enter
Host(s): Enter
Finished adding hosts. Hit <RETURN> to continue >> Enter
Using Grid Engine
—————–
You should now enter the command:
source /home/sge/SGE/default/common/settings.csh
if you are a csh/tcsh user or
# . /home/sge/SGE/default/common/settings.sh
if you are a sh/ksh user.
This will set or expand the following environment variables:
– $SGE_ROOT (always necessary)
– $SGE_CELL (if you are using a cell other than >default<)
– $COMMD_PORT (if you haven’t added the service >sge_commd<)
– $PATH/$path (to find the Grid Engine binaries)
– $MANPATH (to access the manual pages)
Hit <RETURN> to see where Grid Engine logs messages >> Enter
Grid Engine messages
——————–
Grid Engine messages can be found at:
/tmp/qmaster_messages (during qmaster startup)
/tmp/execd_messages (during execution daemon startup)
After startup the daemons log thier messages in their spool directories.
Qmaster: /home/sge/SGE/default/spool/qmaster/messages
Exec daemon: //messages
Do you want to see previous screen about using Grid Engine again (y/n) [n] >> Enter
Your Grid Engine qmaster installation is now completed
——————————————————
Please now login to all hosts where you want to run an execution daemon
and start the execution host installation procedure.
If you want to run an execution daemon on this host, please do not forget
to make the execution host installation in this host as well.
All execution hosts must be administrative hosts during the installation.
All hosts which you added to the list of administrative hosts during this
installation procedure can now be installed.
You may verify your administrative hosts with the command
# qconf -sh
and you may add new administrative hosts with the command
# qconf -ah
실행중인 데몬 확인
# ps -aux –cols=120 | grep sge
root 22398 0.0 0.0 1708 848 ? S 15:26 0:00 /home/sge/SGE/bin
/glinux/sge_commd
sge 22402 0.0 0.1 3192 1756 ? S 15:27 0:00 /home/sge/SGE/bin
/glinux/sge_qmaster
sge 22406 0.0 0.1 2716 1372 ? S 15:27 0:00 /home/sge/SGE/bin
/glinux/sge_schedd
sge_commd는 root 권한으로, sge_qmaster와 sge_schedd는 sge 사용자 권한으로
실행되고 있음을 알 수 있다.
1.3. 계산 노드에 설치하기 [[목차]]
다음 과정을 모든 계산 노드에 대해 똑같이 반복한다.
# rlogin sdd105
# cd /home/sge/SGE
SGE_ROOT 환경 변수를 설정한다.
# export SGE_ROOT=/home/sge/SGE
/etc/services 화일에 포트번호를 추가하였는지 확인한다.
# vi /etc/services
…
sge_commd 536/tcp # Sun Grid Engine
:wq
install_execd를 실행한다.
# ./install_execd
Welcome to the Grid Engine execution host installation
——————————————————
If you haven’t installed the Grid Engine qmaster host yet, you must execute
this step (with >install_qmaster<) prior the execution host installation.
For a sucessfull installation you need a running Grid Engine qmaster. It is
also neccesary that this host is an administrative host.
You can verify your current list of administrative hosts with
the command:
# qconf -sh
You can add an administrative host with the command:
# qconf -ah
The execution host installation will take approximately 5 minutes.
Hit <RETURN> to continue >> Enter
Grid Engine admin user account
——————————
The current directory
/home/sge/SGE
is owned by user
sge
If user >root< does not have write permissions in this directory on *all*
of the machines where Grid Engine will be installed (NFS partitions not
exported for user >root< with read/write permissions) it is recommended to
install Grid Engine that all spool files will be created under the user id
of user >sge<.
IMPORTANT NOTE: The daemons still have to be started by user >root<.
Do you want to install Grid Engine as admin user >sge< (y/n) [y] >> Enter
Installing Grid Engine as admin user >sge<
Hit <RETURN> to continue >> Enter
Checking $SGE_ROOT directory
—————————-
Your $SGE_ROOT directory: /home/sge/SGE
Hit <RETURN> to continue >> Enter
Grid Engine cells
—————–
Please enter cell name which you used for the qmaster
installation or press <RETURN> to use default cell >default< >> Enter
Using cell: >default<
Hit <RETURN> to continue >> Enter
Confirm Grid Engine default installation settings
————————————————-
The following default settings can be used for an accelerated
installation procedure:
$SGE_ROOT = /home/sge/SGE
service = sge_commd
admin user account = sge
Do you want to use these configuration parameters (y/n) [y] >> Enter
Creating local configuration
—————————-
Creating local configuration for host >sdd105.foo.bar.com<
root@sdd105.foo.bar.com modified “sdd105.foo.bar.com” in configuration list
Local configuration for host >sdd105.foo.bar.com< created.
Hit <RETURN> to continue >> Enter
Grid Engine startup script
————————–
We can install the startup script that
Grid Engine is started at machine boot (y/n) [y] >> Enter
Installing startup script /etc/rc.d/rc3.d/S95rcsge
Hit <RETURN> to continue >> Enter
Grid Engine execution daemon startup
————————————
Starting execution daemon daemon. Please wait …
starting sge_execd
starting program: /home/sge/SGE/bin/glinux/sge_commd
using service “sge_commd”
bound to port 536
Hit <RETURN> to continue >> Enter
Adding a default Grid Engine queue for this host
————————————————
We can now add a sample queue for this host with following attributes:
– the queue has the name >sdd105.q<
– the queue provides 1 slot(s) for jobs
– the queue provides access for any user with an account on this machine
– the queue has no Unix resource limits
You do not need to add a queue now, but before running jobs on this host
need to add a queue with >qconf< or the GUI >qmon<.
Do you want to add a default queue for this host (y/n) [y] >> Enter
root@sdd105.foo.bar.com added “sdd105.q” to queue list
Hit <RETURN> to continue >> Enter
Using Grid Engine
—————–
You should now enter the command:
source /home/sge/SGE/default/common/settings.csh
if you are a csh/tcsh user or
# . /home/sge/SGE/default/common/settings.sh
if you are a sh/ksh user.
This will set or expand the following environment variables:
– $SGE_ROOT (always necessary)
– $SGE_CELL (if you are using a cell other than >default<)
– $COMMD_PORT (if you haven’t added the service >sge_commd<)
– $PATH/$path (to find the Grid Engine binaries)
– $MANPATH (to access the manual pages)
Hit <RETURN> to see where Grid Engine logs messages >> Enter
Grid Engine messages
——————–
Grid Engine messages can be found at:
/tmp/qmaster_messages (during qmaster startup)
/tmp/execd_messages (during execution daemon startup)
After startup the daemons log thier messages in their spool directories.
Qmaster: /home/sge/SGE/default/spool/qmaster/messages
Exec daemon: //messages
Do you want to see previous screen about using Grid Engine again (y/n) [n] >> Enter
Your execution daemon installation is now completed.
실행중인 데몬 확인
# ps -aux –cols=120 | grep sge
root 10531 0.0 0.0 1688 824 ? S 15:28 0:00 /home/sge/SGE/bin
/glinux/sge_commd
sge 10533 0.0 0.1 2728 1376 ? S< 15:28 0:00 /home/sge/SGE/bin
/glinux/sge_execd
sge_commd는 root권한으로 sge_commd는 sge 유저 권한으로 실행되고 있음을
알 수 있다.
1.4. 환경설정 [[목차]]
직접 환경 변수를 설정하거나 settings.sh를 읽어들여도 된다.
참고>> csh의 경우는 settings.csh를 사용해야 한다.
settings.sh는 다음 변수들을 설정한다.
– $SGE_ROOT (always necessary)
– $SGE_CELL (if you are using a cell other than >default<)
– $COMMD_PORT (if you haven’t added the service >sge_commd<)
– $PATH/$path (to find the Grid Engine binaries)
– $MANPATH (to access the manual pages)
$ vi ~/.bashrc
…
# SGE
export SGE_ROOT=/home/sge/SGE
export PATH=/home/sge/SGE/bin/glinux:$PATH
export MANPATH=`man –path`:/home/sge/SGE/man
# or
# . /home/sge/SGE/default/common/settings.sh
:wq
1.5. 설치확인 [[목차]]
큐 설정 명령어인 qconf를 이용하여 설치를 확인하자.
참고>> man qconf
$ qconf -sel # execution host list
sdd105.foo.bar.com
sdd106.foo.bar.com
sdd107.foo.bar.com
$ qconf -se sdd105 # execution host definition
hostname sdd105.foo.bar.com
load_scaling NONE
complex_list NONE
complex_values NONE
load_values load_avg=0.000000,load_short=0.000000,load_medium=0.0
00000,load_long=0.000000,arch=glinux,num_proc=1,mem_free=915.554688M,swap_free=1
004.023438M,virtual_free=1919.578125M,mem_total=1004.511719M,swap_total=1004.023
438M,virtual_total=2008.535156M,mem_used=88.957031M,swap_used=0.000000M,virtual_
used=88.957031M,cpu=0.000000,np_load_avg=0.000000,np_load_short=0.000000,np_load
_medium=0.000000,np_load_long=0.000000
processors 1
user_lists NONE
xuser_lists NONE
$ qconf -secl # event client list
ID NAME HOST
————————————————–
1 scheduler sdd114.foo.bar.com
$ qconf -sep
# the number of licenced processors per execution host and in total
HOST PROCESSOR ARCH
===============================================
sdd105.foo.bar.com 1 glinux
sdd106.foo.bar.com 1 glinux
sdd107.foo.bar.com 1 glinux
===============================================
SUM 3
$ qconf -sh # administrative host
sdd114.foo.bar.com
sdd105.foo.bar.com
sdd106.foo.bar.com
sdd107.foo.bar.com
$ qconf -ss # submit host list
sdd114.foo.bar.com
sdd105.foo.bar.com
sdd106.foo.bar.com
sdd107.foo.bar.com
$ qconf -sm # managers list
sge
root
$ qconf -so # operator list
root
$ qconf -sql # list of all currently defined queues
sdd105.q
sdd106.q
sdd107.q
1.6. 데몬 재시작 하기 [[목차]]
SGE 데몬을 재시작 하려면
설치과정에서 /etc/rc.d/init.d/의 rcsge 스크립트를 이용하면 된다.
마스터 노드와 계산노드 모두 같은 방법이다.
# /etc/rc.d/init.d/rcsge stop
Shutting down Grid Engine execution daemon
Shutting down Grid Engine communication daemon
# /etc/rc.d/init.d/rcsge start
starting sge_execd
starting program: /home/sge/SGE/bin/glinux/sge_commd
using service “sge_commd”
bound to port 536
1.7. 간단한 작업 예제 [[목차]]
examples 디렉토리의 간단한 예제를 실행해 보자.
$ cd /home/sge/SGE/examples/jobs/
$ cat simple.sh
#!/bin/sh
…
# request Bourne shell as shell for job
#$ -S /bin/sh
# print date and time
date
# Sleep for 20 seconds
sleep 20
# print date and time again
date
$ qsub simple.sh
your job 1 (“simple.sh”) has been submitted
$ qstat
job-ID prior name user state submit/start at queue maste
r ja-task-ID
——————————————————————————–
————-
1 0 simple.sh sge t 10/18/2002 17:25:12 sdd105.q MASTE
R
참고>> 작업의 상태는
d(eletion), t(ransfering), r(unning), R(estarted), s(uspended), S(uspended),
T(hreshold), w(aiting) or h(old) 를 의미한다.
출력 화일은 홈디렉토리에 생성된다.
참고>> 출력 화일의 위치를 지정하려면
#$ -e path/to/err.file
#$ -o path/to/out.file
명령을 써주면 된다.
$ cd ~/
$ ls -la simple*
-rw-r–r– 1 sge sge 0 Oct 18 17:25 simple.sh.e1
-rw-r–r– 1 sge sge 58 Oct 18 17:25 simple.sh.o1
$ cat simple.sh.o5
Fri Oct 18 17:25:10 KST 2002
Fri Oct 18 17:25:30 KST 2002
SGE와 관련 있는 환경변수는 다음 작업을 실행해 보면 알 수 있다.
자세한 설명은 man qsub를 참조한다.
$ cat env.sh
#!/bin/sh
hostname
pwd
echo PID=$$
echo SGE_O_HOME=$SGE_O_HOME
echo SGE_O_HOST=$SGE_O_HOST
echo SGE_O_LOGNAME=$SGE_O_LOGNAME
echo SGE_O_MAIL=$SGE_O_MAIL
echo SGE_O_PATH=$SGE_O_PATH
echo SGE_O_SHELL=$SGE_O_SHELL
echo SGE_O_WORKDIR=$SGE_O_WORKDIR
echo ARC=$ARC
echo SGE_STDERR_PATH=$SGE_STDERR_PATH
echo SGE_STDOUT_PATH=$SGE_STDOUT_PATH
echo SGE_JOB_SPOOL_DIR=$SGE_JOB_SPOOL_DIR
echo SGE_TASK_ID=$SGE_TASK_ID
echo ENVIRONMENT=$ENVIRONMENT
echo HOME=$HOME
echo HOSTNAME=$HOSTNAME
echo JOB_ID=$JOB_ID
echo JOB_NAME=$JOB_NAME
echo LOGNAME=$LOGNAME
echo NHOSTS=$NHOSTS
echo NQUEUES=$NQUEUES
echo NSLOTS=$NSLOTS
echo PATH=$PATH
echo QUEUE=$QUEUE
echo REQUEST=$REQUEST
echo RESTARTED=$RESTARTED
echo SHELL=$SHELL
echo TMPDIR=$TMPDIR
echo TMP=$TMP
echo USER=$USER
$ qsub env.sh
your job 1 (“env.sh”) has been submitted
$ cat ~/env.sh.o1
sdd107.foo.bar.com
/home/sangwan
PID=10346
SGE_O_HOME=/home/sangwan
SGE_O_HOST=sdd114
SGE_O_LOGNAME=sangwan
SGE_O_MAIL=/var/spool/mail/sangwan
SGE_O_PATH=/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/sge/SGE/bin/glinux
SGE_O_SHELL=/bin/bash
SGE_O_WORKDIR=/home/sangwan/sge
ARC=glinux
SGE_STDERR_PATH=/home/sangwan/env.sh.e1
SGE_STDOUT_PATH=/home/sangwan/env.sh.o1
SGE_JOB_SPOOL_DIR=/home/sge/SGE/default/spool/sdd107/active_jobs/1.1
SGE_TASK_ID=undefined
ENVIRONMENT=BATCH
HOME=/home/sangwan
HOSTNAME=sdd107.foo.bar.com
JOB_ID=1
JOB_NAME=env.sh
LOGNAME=sangwan
NHOSTS=1
NQUEUES=1
NSLOTS=1
PATH=/tmp/1.1.sdd107.q:/usr/local/bin:/usr/ucb:/bin:/usr/bin:/usr/X11R6/bin
QUEUE=sdd107.q
REQUEST=env.sh
RESTARTED=0
SHELL=/bin/csh
TMPDIR=/tmp/1.1.sdd107.q
TMP=/tmp/1.1.sdd107.q
USER=sangwan
참고자료 [[목차]]
1. Sun Grid Engine 5.3 관리 및 사용 설명서
http://wwws.sun.com/software/gridware/
2. Sun Grid Engine 5.3p2 man pages