Shadow
Master Host Requirements ½¦µµ¿ì ¸¶½ºÅÍ ¼¹ö·Î ¼³Á¤Çϱâ À§Çؼ´Â ¾Æ·¡ÀÇ ¿ä±¸Á¶°ÇÀÌ °®Ãß¾îÁ®¾ß ÇÑ´Ù. ½¦µµ¿ì ¸¶½ºÅÍ ¼¹ö´Â sge_shadowd µ¥¸óÀÌ ½ÇÇàµÇ°í ÀÖ¾î¾ßÇÑ´Ù. ½¦µµ¿ì ¸¶½ºÅÍ ¼¹ö´Â sge_qmaster ÀÇ »óÅÂ
Á¤º¸(status information), ÀÛ¾÷ ¼³Á¤(job configuration), µð½ºÅ©¿¡ ³²´Â Å¥¼³Á¤(queue
configuration logged to disk), Ưº°ÇÑ °æ¿ì ¸¶½ºÅÍ ¼¹öÀÇ spool directory ¿Í sge-root/cell/common µð·ºÅ丮¿¡
Àбâ/¾²±â ±ÇÇÑÀÌ ÇÊ¿äÇÏ´Ù. Berkeley DB RPC ¼¹ö ¶Ç´Â classic grid
engine system spooling Àº sge_qmaster ½ºÇøµ À¸·Î »ç¿ëµÇ¾î¾ß ÇÑ´Ù.(¹öŬ¸® DB ¸¦ »ç¿ëÇϹǷΠRPC °¡ ±¸µ¿µÇ¾î¾ß ÇÑ´Ù.
¸ðµç ½¦µµ¿ì ¼¹öµé°ú ¸¶½ºÅÍ ¼¹öµéÀº portmap
¸¦ ±¸µ¿ÇؾßÇÑ´Ù.) (ÀÚ¼¼ÇÑ ³»¿ëÀº N1 Grid Engine 6 Installation GuideÀÇ ":Database
Server and Spooling Host" ¸¦ ÂüÁ¶ÇÑ´Ù.) shadow-master-hostname ÆÄÀÏ¿¡ "shadow master host" À̸§ÀÌ Á¤ÀǵǾî¾ß ÇÑ´Ù. Shadow
Master Hosts File sge-root/cell/common/shadow_master ÆÄÀÏ¿¡ Á¤ÀǵȴÙ. The
format of the shadow master hostname file is as follows: -
The first
line of the file defines the primary master host -
The
following lines define the shadow master hosts, one host per line ù¹øÂ° ¶óÀÎÀº master À̸§À» Á¤ÀÇÇÑ´Ù. µÎ¹øÂ° ¶óÀÎµé ºÎÅÍ´Â ¶óÀδç shadow
master À̸§À» Á¤ÀÇÇÑ´Ù. Starting
Shadow Master Hosts ½¦µµ¿ì ¸¶½ºÅÍ ¼¹ö¿¡¼ sge_shadowds µ¥¸óÀ» ½ÇÇàÇϸéµÈ´Ù. In order to
start a shadow sge_qmaster, the system must be sure either that
the old sge_qmaster has
terminated, or that it will terminate without performing actions that interfere with the newly-started shadow
sge_qmaster. In very rare
circumstances it might be impossible to determine that the old sge_qmaster has terminated or that it will
terminate. In such cases, an error message is logged to the messages log file of the sge_shadowds on the
shadow master hosts. See Chapter
8. Also, any attempts to open a tcp connection to a sge_qmaster daemon permanently fail. If this occurs, make
sure that no master daemon is running, and then restart sge_qmaster manually on any of the shadow master
machines. See ¡°Restarting
Daemons From the Command Line¡± on page 39. Configuring
Shadow Master Hosts Environment Variables There are
three environment variables which affect the takeover time for a shadow
master: -
SGE_DELAY_TIME - This variable
controls the interval in which sge_shadowd pauses if a
takeover bid fails. This value is used only when there are multiple sge_shadowd
instances and they are contending to
be the master. (the default is 600 seconds.) -
SGE_CHECK_INTERVAL - This variable
controls the interval in which the sge_shadowd
checks the heartbeat file (60
seconds by default.) -
SGE_GET_ACTIVE_INTERVAL - This
variable controls the interval when a sge_shadowd
instance tries to take over when the
heartbeat file has not changed. These
variables interact in the following way. 1.
The master host updates the heartbeat file every 30 seconds. 2.
The sge_shadowd daemon
checks for changes to heartbeat file every number of seconds
defined by the SGE_CHECK_INTERVAL variable. So, this value must be greater
than 30 seconds. 3.
If the sge_shadowd daemon
notices that the heartbeat file has been updated updated,
it starts waiting again until it is once more time to check the heartbeat
file. 4.
If the sge_shadowd daemon
notices that the heartbeat file has not been updated, it waits for number of
seconds defined by the SGE_CHECK_INTERVAL variable to expire.
This step lets you make sure that the sge_shadowd daemon is not too agressive
in trying to takeover and allows the master host some leeway in updating
the heartbeat file. 5.
When the SGE_GET_ACTIVE_INTERVAL has expired, sge_shadowd daemon takes
over if heartbeat file is still not updated. A reasonable
configuration might be to set the SGE_CHECK_INTERVAL to be 45 seconds and the SGE_GET_ACTIVE_INTERVAL to be
90 seconds. So, after about 2 minutes, the take over will occur. If you want
to check the operation of the shadow host after
you have configured these environment variables you will have to pull out the master host¡¯s network cable to simulate
a failure. @@ ¼³Á¤ [root@½¦µµ¿ì¼¹öµé common]# pwd /usr/N1/hoho/common [root@½¦µµ¿ì¼¹öµé common]# df -h Filesystem
Size Used Avail Use%
Mounted on /dev/hdc1
2.4G 2.3G 34M 99% / /dev/hdc2
19G 1.3G 17G 8% /opt /dev/hdc3
19G 143M 18G 1% /usr/local /dev/shm
252M
0 252M 0% /dev/shm spoolserver:/usr/N1/hoho/spooldb/
30G 2.3G 28G 8% /usr/N1/hoho/spooldb spoolserver:/usr/N1/hoho/common/
30G 2.3G 28G 8% /usr/N1/hoho/common @ ½¦µµ¿ì ¼¹öµéÀ» µî·ÏÇÑ´Ù. common µð·ºÅ丮´Â ¸ðµç ¼¹öµéÀÌ °øÀ¯Çϰí ÀÖ´Ù. ½¦µµ¿ì¼¹öµé¿¡¸¸ shadow_masters ÆÄÀÏÀÌ ÇÊ¿äÇÏÁö¸¸
¸¶½ºÅÍ ¼¹ö¿¡±îÁö ¿µÇâÀ» ÁÙ °ÍÀÌ´Ù. ¿ø·¡ ¸¶½ºÅÍ ¼¹ö¿¡´Â sge_qmaster, sge_schedd µ¥¸ó¸¸ ¶° ÀÖ¾î¾ß ÇÏÁö¸¸ sge_shadowd µ¥¸óµµ ¶ã °ÍÀÌ´Ù. ¹¹ »ó°ü¾ø´Ù.. ¤»¤» [root@½¦µµ¿ì¼¹öµé common]# vi shadow_masters [root@½¦µµ¿ì¼¹öµé common]# cat shadow_masters server <-- ¸¶½ºÅÍ ¼¹ö file001 <-- ½¦µµ¿ì ¼¹ö [root@½¦µµ¿ì¼¹öµé common]# ./sgemaster -shadowd starting sge_shadowd [root@½¦µµ¿ì¼¹öµé common]# ps
axf |grep sge 6982 pts/0 S+ 3106 ?
S 6947 ?
S [root@file001
common]# ps axf |grep portmap 7077 pts/0 S+ 2775 ?
Ss ------ ´Ù¸¥ È£½ºÆ®¿¡¼ ÀÛ¾÷À» ³Ö¾îº¸ÀÚ... [root@file002
jobs]# qsub ./pascal.sh [root@file002 N1]# ./bin/lx24-x86/qstat -f queuename
qtype used/tot. load_avg
arch
states ---------------------------------------------------------------------------- all.q@file001
BIP 0/1
0.00
lx24-x86
---------------------------------------------------------------------------- all.q@file002
BIP 0/2
0.00
lx24-x86
E ---------------------------------------------------------------------------- all.q@server
BIP 0/2
0.00
lx24-x86
############################################################################ - PENDING JOBS - PENDING JOBS - PENDING
JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 77 0.55500 pascal.sh
root
qw 08/04/2006 78 0.55500 pascal.sh
root
qw 08/04/2006 79 0.55500 pascal.sh
root
qw 08/04/2006 80 0.55500 pascal.sh
root
qw 08/04/2006 81 0.55500 pascal.sh
root
qw 08/04/2006 82 0.55500 pascal.sh
root
qw 08/04/2006 . . . ----- ¸¶½ºÅÍ ¼¹ö¿¡¼ µ¥¸óµéÀ» Á׿亸ÀÚ.. [root@server job_scripts]# /etc/init.d/sgemaster stop Shutting down Grid Engine
scheduler Shutting down Grid Engine qmaster ----- qstat ·Î ÀÛ¾÷À» È®ÀÎÇÏ·Á°í ÇÏÀÚ
server ¿¡ Á¢¼ÓÇÒ¼ö ¾ø´Ù°í ³ª¿Â´Ù. ´ç¿¬ÇϰÚÁö.. ¤»¤»
À§¿¡¼ ¸¶½ºÅÍ ¼¹ö µ¥¸óµéÀ» Á׿´À¸´Ï.. [root@file002 N1]# ./bin/lx24-x86/qstat -f error:
commlib error: can't connect to service (Connection
refused) unable
to contact qmaster using port 536 on host
"server" ----- Àá½ÃÈÄ ½¦µµ¿ì ¼¹ö·Î ¼³Á¤ÇÑ °÷¿¡¼ ¸¶½ºÅÍ µ¥¸óµéÀÌ ¶ã°ÍÀÌ´Ù. [root@file001
N1]# ps axf |grep sge 7349 pts/1 S+ 3106 ?
S 6947 ?
S 7327 ? Sl
7342 ? Sl
----- act_qmaster ÆÄÀÏÀÌ ¾÷µ¥ÀÌÆ® µÇ¾ú´ÂÁö È®ÀÎÇÑ´Ù. [root@file002
common]# pwd /usr/N1/hoho/common ÀÌÀü¿¡´Â ÀÌ ÆÄÀÏ¿¡ server °¡ µé¾î ÀÖ¾úÁö¸¸ Áö±ÝÀº file001 ÆäÀÏ¿À¹öµÈ ½¦µµ¿ì È£½ºÆ® ¸íÀÌ µé¾î°¡ÀÖ´Ù. ±×·¯¹Ç·Î ½ÇÇàÈ£½ºÆ®µéÀÌ ¸¶½ºÅÍ ¼¹ö¸¦ ãÀ» ¼ö ÀÖÀ» °ÍÀÌ´Ù. [root@file002
common]# cat act_qmaster file001 ----- Á»Àü¿¡´Â ¸øÃ£´ø °ÍÀ» act_qmaster ÆÄÀÏ ¾÷µ¥ÀÌÆ®·Î ã°Ô µÇ¾ú´Ù. ÀÛ¾÷µéÀº ¸ðµÎ ½ºÇøµ ¼¹ö µ¥ÀÌÅͺ£À̽º ÆÄÀÏ (½ºÇøµ¼¹ö°¡
°øÀ¯Çϰí ÀÖÀ¸¹Ç·Î) ¿¡¼ Àß °¡Áö°í ¿Ô´Ù. [root@file002 N1]# ./bin/lx24-x86/qstat -f queuename
qtype used/tot. load_avg
arch
states ---------------------------------------------------------------------------- all.q@file001
BIP 0/1
0.00
lx24-x86
---------------------------------------------------------------------------- all.q@file002
BIP 0/2
0.00
lx24-x86
E ---------------------------------------------------------------------------- all.q@server
BIP 0/2
0.00
lx24-x86
############################################################################ - PENDING JOBS - PENDING JOBS - PENDING
JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 77 0.55500 pascal.sh
root
qw 08/04/2006 78 0.55500 pascal.sh
root
qw 08/04/2006 79 0.55500 pascal.sh
root
qw 08/04/2006 80 0.55500 pascal.sh
root
qw 08/04/2006 81 0.55500 pascal.sh root
qw 08/04/2006 82 0.55500
pascal.sh root
qw 08/04/2006 |