Shadow Master Host Requirements

 

½¦µµ¿ì ¸¶½ºÅÍ ¼­¹ö·Î ¼³Á¤Çϱâ À§Çؼ­´Â ¾Æ·¡ÀÇ ¿ä±¸Á¶°ÇÀÌ °®Ãß¾îÁ®¾ß ÇÑ´Ù.

 

½¦µµ¿ì ¸¶½ºÅÍ ¼­¹ö´Â sge_shadowd µ¥¸óÀÌ ½ÇÇàµÇ°í ÀÖ¾î¾ßÇÑ´Ù.

½¦µµ¿ì ¸¶½ºÅÍ ¼­¹ö´Â sge_qmaster ÀÇ »óÅ Á¤º¸(status information), ÀÛ¾÷ ¼³Á¤(job configuration), µð½ºÅ©¿¡ ³²´Â Å¥¼³Á¤(queue configuration logged to disk), Ưº°ÇÑ °æ¿ì ¸¶½ºÅÍ ¼­¹öÀÇ spool directory ¿Í sge-root/cell/common µð·ºÅ丮¿¡ Àбâ/¾²±â ±ÇÇÑÀÌ ÇÊ¿äÇÏ´Ù.

Berkeley DB RPC ¼­¹ö ¶Ç´Â classic grid engine system spooling Àº sge_qmaster ½ºÇøµ À¸·Î »ç¿ëµÇ¾î¾ß ÇÑ´Ù.(¹öŬ¸® DB ¸¦ »ç¿ëÇϹǷΠRPC °¡ ±¸µ¿µÇ¾î¾ß ÇÑ´Ù. ¸ðµç ½¦µµ¿ì ¼­¹öµé°ú ¸¶½ºÅÍ ¼­¹öµéÀº portmap ¸¦ ±¸µ¿ÇؾßÇÑ´Ù.)

(ÀÚ¼¼ÇÑ ³»¿ëÀº N1 Grid Engine 6 Installation GuideÀÇ ":Database Server and Spooling Host" ¸¦ ÂüÁ¶ÇÑ´Ù.)

shadow-master-hostname ÆÄÀÏ¿¡ "shadow master host" À̸§ÀÌ Á¤ÀǵǾî¾ß ÇÑ´Ù.

 

Shadow Master Hosts File

sge-root/cell/common/shadow_master ÆÄÀÏ¿¡ Á¤ÀǵȴÙ.

 

The format of the shadow master hostname file is as follows:

- The first line of the file defines the primary master host

- The following lines define the shadow master hosts, one host per line

 

ù¹øÂ° ¶óÀÎÀº master À̸§À» Á¤ÀÇÇÑ´Ù.

µÎ¹øÂ° ¶óÀÎµé ºÎÅÍ´Â ¶óÀδç shadow master À̸§À» Á¤ÀÇÇÑ´Ù.

 

 

Starting Shadow Master Hosts

 

½¦µµ¿ì ¸¶½ºÅÍ ¼­¹ö¿¡¼­ sge_shadowds µ¥¸óÀ» ½ÇÇàÇϸéµÈ´Ù.

 

In order to start a shadow sge_qmaster, the system must be sure either that the old

sge_qmaster has terminated, or that it will terminate without performing actions

that interfere with the newly-started shadow sge_qmaster.

In very rare circumstances it might be impossible to determine that the old

sge_qmaster has terminated or that it will terminate. In such cases, an error message

is logged to the messages log file of the sge_shadowds on the shadow master hosts.

See Chapter 8. Also, any attempts to open a tcp connection to a sge_qmaster

daemon permanently fail. If this occurs, make sure that no master daemon is running,

and then restart sge_qmaster manually on any of the shadow master machines. See

¡°Restarting Daemons From the Command Line¡± on page 39.

 

 

Configuring Shadow Master Hosts Environment Variables

There are three environment variables which affect the takeover time for a shadow master:

             - SGE_DELAY_TIME - This variable controls the interval in which sge_shadowd

               pauses if a takeover bid fails. This value is used only when there are multiple

      sge_shadowd instances and they are contending to be the master. (the default is

               600 seconds.)

             - SGE_CHECK_INTERVAL - This variable controls the interval in which the

      sge_shadowd checks the heartbeat file (60 seconds by default.)

             - SGE_GET_ACTIVE_INTERVAL - This variable controls the interval when a

      sge_shadowd instance tries to take over when the heartbeat file has not changed.

 

These variables interact in the following way.

             1. The master host updates the heartbeat file every 30 seconds.

             2. The sge_shadowd daemon checks for changes to heartbeat file every number of

                 seconds defined by the SGE_CHECK_INTERVAL variable. So, this value must be

                 greater than 30 seconds.

             3. If the sge_shadowd daemon notices that the heartbeat file has been updated

                 updated, it starts waiting again until it is once more time to check the heartbeat file.

             4. If the sge_shadowd daemon notices that the heartbeat file has not been updated,

                 it waits for number of seconds defined by the SGE_CHECK_INTERVAL variable to

                 expire. This step lets you make sure that the sge_shadowd daemon is not too

                 agressive in trying to takeover and allows the master host some leeway in

                 updating the heartbeat file.

             5. When the SGE_GET_ACTIVE_INTERVAL has expired, sge_shadowd daemon

                 takes over if heartbeat file is still not updated.

 

A reasonable configuration might be to set the SGE_CHECK_INTERVAL to be 45

seconds and the SGE_GET_ACTIVE_INTERVAL to be 90 seconds. So, after about 2

minutes, the take over will occur. If you want to check the operation of the shadow

host after you have configured these environment variables you will have to pull out

the master host¡¯s network cable to simulate a failure.

 

 

 

 

 

@@ ¼³Á¤

 

[root@½¦µµ¿ì¼­¹öµé common]# pwd

/usr/N1/hoho/common

 

[root@½¦µµ¿ì¼­¹öµé common]# df -h

Filesystem            Size  Used Avail Use% Mounted on

/dev/hdc1             2.4G  2.3G   34M  99% /

/dev/hdc2              19G  1.3G   17G   8% /opt

/dev/hdc3              19G  143M   18G   1% /usr/local

/dev/shm              252M     0  252M   0% /dev/shm

spoolserver:/usr/N1/hoho/spooldb/

                       30G  2.3G   28G   8% /usr/N1/hoho/spooldb

spoolserver:/usr/N1/hoho/common/

                       30G  2.3G   28G   8% /usr/N1/hoho/common

 

@ ½¦µµ¿ì ¼­¹öµéÀ» µî·ÏÇÑ´Ù.

common µð·ºÅ丮´Â ¸ðµç ¼­¹öµéÀÌ °øÀ¯Çϰí ÀÖ´Ù.

½¦µµ¿ì¼­¹öµé¿¡¸¸ shadow_masters ÆÄÀÏÀÌ ÇÊ¿äÇÏÁö¸¸ ¸¶½ºÅÍ ¼­¹ö¿¡±îÁö ¿µÇâÀ» ÁÙ °ÍÀÌ´Ù. ¿ø·¡ ¸¶½ºÅÍ ¼­¹ö¿¡´Â sge_qmaster, sge_schedd µ¥¸ó¸¸ ¶° ÀÖ¾î¾ß ÇÏÁö¸¸ sge_shadowd µ¥¸óµµ ¶ã °ÍÀÌ´Ù.

¹¹ »ó°ü¾ø´Ù.. ¤»¤»

 

[root@½¦µµ¿ì¼­¹öµé common]# vi shadow_masters

 

[root@½¦µµ¿ì¼­¹öµé common]# cat shadow_masters

server <-- ¸¶½ºÅÍ ¼­¹ö

file001 <-- ½¦µµ¿ì ¼­¹ö

 

[root@½¦µµ¿ì¼­¹öµé common]# ./sgemaster -shadowd

   starting sge_shadowd

 

[root@½¦µµ¿ì¼­¹öµé common]# ps axf |grep sge

 6982 pts/0    S+     0:00  |       \_ grep sge

 3106 ?        S      0:00 /usr/N1/bin/lx24-x86/sge_execd

 6947 ?        S      0:00 /usr/N1/bin/lx24-x86/sge_shadowd

 

[root@file001 common]# ps axf |grep portmap

 7077 pts/0    S+     0:00  |       \_ grep portmap

 2775 ?        Ss     0:00 portmap

 

------

´Ù¸¥ È£½ºÆ®¿¡¼­ ÀÛ¾÷À» ³Ö¾îº¸ÀÚ...

 

[root@file002 jobs]# qsub ./pascal.sh

 

[root@file002 N1]# ./bin/lx24-x86/qstat -f

queuename                      qtype used/tot. load_avg arch          states

----------------------------------------------------------------------------

all.q@file001                  BIP   0/1       0.00     lx24-x86     

----------------------------------------------------------------------------

all.q@file002                  BIP   0/2       0.00     lx24-x86      E

----------------------------------------------------------------------------

all.q@server                   BIP   0/2       0.00     lx24-x86     

 

############################################################################

 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS

############################################################################

     77 0.55500 pascal.sh  root         qw    08/04/2006 09:24:23     1       

     78 0.55500 pascal.sh  root         qw    08/04/2006 09:24:23     1       

     79 0.55500 pascal.sh  root         qw    08/04/2006 09:24:24     1       

     80 0.55500 pascal.sh  root         qw    08/04/2006 09:24:24     1       

     81 0.55500 pascal.sh  root         qw    08/04/2006 09:24:24     1       

     82 0.55500 pascal.sh  root         qw    08/04/2006 09:24:24     1

.

.

.

-----

¸¶½ºÅÍ ¼­¹ö¿¡¼­ µ¥¸óµéÀ» Á׿亸ÀÚ..

[root@server job_scripts]# /etc/init.d/sgemaster stop

   Shutting down Grid Engine scheduler

   Shutting down Grid Engine qmaster

 

-----

qstat  ·Î ÀÛ¾÷À» È®ÀÎÇÏ·Á°í ÇÏÀÚ server ¿¡ Á¢¼ÓÇÒ¼ö ¾ø´Ù°í ³ª¿Â´Ù. ´ç¿¬ÇϰÚÁö.. ¤»¤» À§¿¡¼­ ¸¶½ºÅÍ ¼­¹ö µ¥¸óµéÀ» Á׿´À¸´Ï..

[root@file002 N1]# ./bin/lx24-x86/qstat -f

error: commlib error: can't connect to service (Connection refused)

unable to contact qmaster using port 536 on host "server"

 

-----

Àá½ÃÈÄ ½¦µµ¿ì ¼­¹ö·Î ¼³Á¤ÇÑ °÷¿¡¼­ ¸¶½ºÅÍ µ¥¸óµéÀÌ ¶ã°ÍÀÌ´Ù.

[root@file001 N1]# ps axf |grep sge

 7349 pts/1    S+     0:00          \_ grep sge

 3106 ?        S      0:00 /usr/N1/bin/lx24-x86/sge_execd

 6947 ?        S      0:00 /usr/N1/bin/lx24-x86/sge_shadowd

 7327 ?        Sl     0:00 /usr/N1/bin/lx24-x86/sge_qmaster

 7342 ?        Sl     0:00 /usr/N1/bin/lx24-x86/sge_schedd

 

-----

act_qmaster ÆÄÀÏÀÌ ¾÷µ¥ÀÌÆ® µÇ¾ú´ÂÁö È®ÀÎÇÑ´Ù.

 

[root@file002 common]# pwd

/usr/N1/hoho/common

 

ÀÌÀü¿¡´Â ÀÌ ÆÄÀÏ¿¡ server °¡ µé¾î ÀÖ¾úÁö¸¸ Áö±ÝÀº file001 ÆäÀÏ¿À¹öµÈ ½¦µµ¿ì È£½ºÆ® ¸íÀÌ µé¾î°¡ÀÖ´Ù.

±×·¯¹Ç·Î ½ÇÇàÈ£½ºÆ®µéÀÌ ¸¶½ºÅÍ ¼­¹ö¸¦ ãÀ» ¼ö ÀÖÀ» °ÍÀÌ´Ù.

 

[root@file002 common]# cat act_qmaster

file001

 

-----

Á»Àü¿¡´Â ¸øÃ£´ø °ÍÀ» act_qmaster ÆÄÀÏ ¾÷µ¥ÀÌÆ®·Î ã°Ô µÇ¾ú´Ù.

ÀÛ¾÷µéÀº ¸ðµÎ ½ºÇøµ ¼­¹ö µ¥ÀÌÅͺ£À̽º ÆÄÀÏ (½ºÇøµ¼­¹ö°¡ °øÀ¯Çϰí ÀÖÀ¸¹Ç·Î) ¿¡¼­ Àß °¡Áö°í ¿Ô´Ù.

 

[root@file002 N1]# ./bin/lx24-x86/qstat -f

queuename                      qtype used/tot. load_avg arch          states

----------------------------------------------------------------------------

all.q@file001                  BIP   0/1       0.00     lx24-x86     

----------------------------------------------------------------------------

all.q@file002                  BIP   0/2       0.00     lx24-x86      E

----------------------------------------------------------------------------

all.q@server                   BIP   0/2       0.00     lx24-x86     

 

############################################################################

 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS

############################################################################

     77 0.55500 pascal.sh  root         qw    08/04/2006 09:24:23     1       

     78 0.55500 pascal.sh  root         qw    08/04/2006 09:24:23     1       

     79 0.55500 pascal.sh  root         qw    08/04/2006 09:24:24     1       

     80 0.55500 pascal.sh  root         qw    08/04/2006 09:24:24     1       

     81 0.55500 pascal.sh  root         qw    08/04/2006 09:24:24     1       

     82 0.55500 pascal.sh  root         qw    08/04/2006 09:24:24     1