SGE Shadow Master install and configure

How to Install the Shadow Master Host

Shadow master hosts are machines in the cluster that can detect a failure of the master daemon and take over its role as master host. When the shadow master daemon detects that the master daemon has failed abnormally, it starts up a new master daemon on the host where the shadow master daemon is running.

The shadow master host file, $SGE_ROOT/$SGE_CELL/common/shadow_masters, contains the name of the primary master host, which is the machine where the master daemon initially runs, followed by the names of the shadow master hosts. The order of the shadow master hosts is significant. The primary master host is the first line in the file. If the primary master host fails to proceed, then the shadow master defined in the second line takes over. If this shadow master also fails, the shadow master defined in the third line takes over, and so forth. You can affect this order by installing shadow master daemons first on hosts that you want to be at the top of this list.

Steps
Log in to the shadow master host as root.

If the $SGE_ROOT environment variable is not set, set it by typing:

# SGE_ROOT=<path_to_installation_directory (the directory MUST contain all SGE files such as SGE binaries)>; export SGE_ROOT

To confirm that you have set the $SGE_ROOT environment variable, type:

# echo $SGE_ROOT

Go to the installation directory.

If the directory where the installation files reside is visible from the shadow master host, change directory (cd) to the installation directory sge-root, and then proceed to the next step.
If the directory is not visible and cannot be made visible, do the following:
Create a local installation directory, sge-root, on the master host.

Copy the installation files to the local installation directory sge-root across the network, for example, by using ftp or rcp.
Change directory (cd) to the local sge-root directory.

Type the inst_sge -sm command.

This command starts the shadow master host installation procedure. You are asked several questions, and you might be required to run some administrative actions.
For a complete installation example, see Example Shadow Master Host Installation.

# ./inst_sge -sm
See Step 1-4 in the Example Shadow Master Host Installation.

Choose an administrative account owner.

See Step 5 in the Example Shadow Master Host Installation. Use the same administrative user as in qmaster installation.

Verify the $SGE_ROOT directory setting.

See Step 6 of the Example Shadow Master Host Installation, the value of $SGE_ROOT in the example is /sge.

Type the name of your cell.
See Step 7 in the Example Shadow Master Host Installation.

Confirm that host is known by the qmaster host.
See Step 8 in the Example Shadow Master Host Installation.

(optional) Specify JMX MBean Server values.
Presented when you installed qmaster with JMX MBean Server. See Step 9 in the Example Shadow Master Host Installation.
Enter the following information:


Confirm creation of the local configuration.
See Step 10 in the Example Shadow Master Host Installation.

Specify whether you want to start the shadow master daemon when the system is booted.
See Step 11 in the Example Shadow Master Host Installation. 

Installation is now complete. 
See Step 12 in the Example Shadow Master Host Installation.

Starting a Shadow Master Host Manually

To start a shadow master host manually, the system must be sure either that the old master daemon has terminated, or that it will terminate without performing actions that interfere with the newly started shadow master.

In very rare circumstances, you might not be able to determine whether the old master daemon has terminated or if it will terminate. In such cases, an error message is logged to the messages log file of the sge_shadowd daemons on the shadow master hosts.

If an attempts to open a tcp connection to a master daemon permanently fails, make sure that no master daemon is running, and then restart the master daemon manually on any of the shadow master machines. See How to Restart Daemons From the Command Line for further details.

Configuring Shadow Master Host Environment Variables

Three environment variables affect the takeover time for a shadow master:

SGE_DELAY_TIME:
This variable controls the interval in which sge_shadowd pauses if
a takeover bid fails. This value is used only when there are multiple                   sge_shadowd instances that are contending to be the master (the default                 is 600 seconds).

SGE_CHECK_INTERVAL:
This variable controls the interval in which the sge_shadowd checks the heartbeat file (the default is 60 seconds).

SGE_GET_ACTIVE_INTERVAL:
This variable controls the interval when a sge_shadowd instance tries to take over when the heartbeat file has not changed.

These variables interact in the following ways:

The master host updates the heartbeat file every 30 seconds.

The sge_shadowd daemon checks for changes to the heartbeat file at an interval defined by the SGE_CHECK_INTERVAL variable.

This value must be greater than 30 seconds.

If the heartbeat file has been updated, the sge_shadowd daemon restarts the waiting clock.
If the heartbeat file has not been updated, the sge_shadowd daemon continues to wait until the designated interval defined by the SGE_CHECK_INTERVAL variable expires.

This action ensures that the sge_shadowd daemon is not too aggressive in trying to take over and allows the master host some leeway in updating the heartbeat file.

When the SGE_GET_ACTIVE_INTERVAL has expired, the sge_shadowd daemon then takes over if the heartbeat file has still not been updated.

A reasonable configuration might be to set the SGE_CHECK_INTERVAL to 45 seconds and the SGE_GET_ACTIVE_INTERVAL to 90 seconds. So, after about two minutes, the takeover will occur. Meanwhile, you get an error message whenever a Grid Engine system command is run. If you want to check the operation of the shadow host after you have configured these environment variables, you will have to disconnect the master host’s network cable to simulate a failure.

Note
The file $SGE_ROOT/$SGE_CELL/common/act_qmaster contains the name of the host that is actually running the sge_qmaster daemon.

If the master daemon is shut down gracefully, the shadow master daemon does not start up. If you want the shadow master daemon to take over after you shut down the master daemon gracefully, remove the lock file that is located in the sge_qmaster spool directory. The default location of this spool directory is $SGE_ROOT/$SGE_CELL/spool/qmaster.


Example Shadow Master Host Installation

The following example shows a complete Sun Grid Engine shadow master host installation. Remember that this is only an optional step in the entire Sun Grid Engine installation process. The steps in this example coordinate with the shadow master host installation, How to Install the Shadow Master Host.


Steps 1-4

% su –
# cd /sge
# ./inst_sge -sm

Shadow Master Host Setup
————————

Make sure, that the host, you wish to configure as a shadow host,
has read/write permissions to the qmaster spool and SGE_ROOT/<cell>/common
directory! For using a shadow master it is recommended to set up a
Berkeley DB Spooling Server

Hit <RETURN> to continue >>  
 

Step 5

Grid Engine admin user account
——————————

The current directory

   /sge

is owned by user

   sgeadmin

If user >root< does not have write permissions in this directory on *all* of the machines where Grid Engine will be installed (NFS partitions not exported for user >root< with read/write permissions)
it is recommended to install Grid Engine that all spool files will be created under the user id
of user >sgeadmin<.

IMPORTANT NOTE: The daemons still have to be started by user >root<.

Do you want to install Grid Engine as admin user >sgeadmin< (y/n) [y] >>

Installing Grid Engine as admin user >sgeadmin<
Hit <RETURN> to continue >>


Step 6

Checking $SGE_ROOT directory
—————————-

The Grid Engine root directory is not set!
Please enter a correct path for SGE_ROOT.

If this directory is not correct (e.g. it may contain an automounter
prefix) enter the correct path to this directory or hit <RETURN>
to use default [/sge] >>
Your $SGE_ROOT directory: /sge

Hit <RETURN> to continue >>

Step 7

Please enter your SGE_CELL directory or use the default [default] >>


Step 8

Checking hostname resolving
—————————

This hostname is known at qmaster as an administrative host.

Hit <RETURN> to continue >>

Step 9

Grid Engine JMX MBean server
—————————-

In order to use the SGE Inspect or the Service Domain Manager (SDM)
SGE adapter you need to configure a JMX server in qmaster. Qmaster
will then load a Java Virtual Machine through a shared library.

Please give some basic parameters for JMX MBean server
We may ask for
   – JAVA_HOME
   – additional JVM arguments (optional)

Detecting suitable JAVA …
Please enter JAVA_HOME or press enter [/usr/jdk/latest] >>
Please enter additional JVM arguments (optional, default is [-Xmx256m]) >>

Using the following JMX MBean server settings.
   libjvm_path              >/usr/jdk/latest/jre/lib/amd64/server/libjvm.so<
   Additional JVM arguments >-Xmx256m<

Do you want to use these data (y/n) [y] >>


Hit <RETURN> to continue >>

Step 10

Creating local configuration
—————————-
sgeadmin@shadow1 modified “shadow1” in configuration list
Local configuration for host >shadow1< created.

Hit <RETURN> to continue >>

Step 11

shadow startup script
———————

Do you want to start shadowd automatically at machine boot?
NOTE: If you select “n” SMF will be not used at all! (y/n) [y] >> y


Hit <RETURN> to continue >>
 

Step 12

Starting sge_shadowd on host shadow1

Shadowhost installation completed!

서진우

슈퍼컴퓨팅 전문 기업 클루닉스/ 상무(기술이사)/ 정보시스템감리사/ 시스존 블로그 운영자

You may also like...

2 Responses

  1. 서진우 말해보세요:

    설정법은 단순한데, shadow 기능을 구현하기 위한 시스템 충족 요건이 명확하지 않네요.
    현실적인 shadow 구성을 위해서는 설정 이전에 시스템 환경적으로 qmaster, qslave,
    spooldb server 와 같이 3대의 서버 구성이 필요합니다. 또한 spooldb 서버가 다운될
    경우에 대한 이중화 방안이 부족하네요.

  2. 서진우 말해보세요:

    Heartbeat와 같은 open HA 기술과 SGE의 master 역활 이전에 대한 원리적인 구조를 잘 알면, 간단한 스크립트 개발만으로도 shadow 방식 보다 더 현실적인고, 괜찮은 SGE 이중화 구현이 가능하더군요. SGE 이중화 구현의 핵심 요소는 spool db 내용의 무결성 보장입니다.

페이스북/트위트/구글 계정으로 댓글 가능합니다.