SGE Tuning guide
발췌 : http://arc.liv.ac.uk/SGE/howto/tuning.html
Grid Engine is a full function, general purpose Distributed Resource Management (DRM) tool. The scheduler component in Grid Engine supports a wide range of different compute farm scenarios. To get the maximum performance from your compute environment it can be worthwhile to review which features are enabled and which are really needed to solve your load management problem. Disabling/Enabling these features can have a performance benefit on the throughput of your cluster. Each feature contains in parentheses when it was introduced. If not otherwise stated, it is available in higher versions as well.
overall cluster tuning (V5.3 + V6.0)
Experience has shown utilization of NFS or similar shared file systems for distributing files required by Grid Engine can have a critical share in both overall network load and file server load. Thus keeping such files locally is always beneficially for overall cluster throughput. The HOWTO Reducing and Eliminating NFS usage by Grid Engine. shows different common choices for accomplishing this.
scheduler monitoring (V5.3 + V6.0)
Scheduler monitoring can be helpful to find out the reason why certain jobs are not dispatched (displayed via qstat). However, providing this information for all jobs at any time can be resource consuming (memory and cpu time) and is usually not needed. To disable scheduler monitoring set schedd_job_info to false in scheduler configuration sched_conf(5).
finished jobs (V5.3 + V6.0)
In case of array jobs the finished job list in qmaster can become quite big. Switching it off will save memory and speed up qstat commands because qstat also fetches the finished jobs list. Set finished_jobs to 0 in global configuration. See sge_conf(5).
job verification (V5.3 + V6.0)
Forcing validation at job submission time can be a valuable tool to prevent non-dispatchable jobs from remaining in pending state foreever. However, It can be a time consuming job to validate jobs, especially in heterogeneous environments with a variety of different execution nodes and consumable resources and where every user has his own job profile. In homogeneous environments with only a couple of different jobs, a general job validation usually can be omitted. Job verification is disabled per default and should only be used (qsub(1): -w [v|e|w]) when needed. [It is enables by default with DRMAA]
load thresholds and suspend thresholds (V5.3 + V6.0)
Load thresholds are needed if you deliberately oversubscribe your machines, and you need a mechanism to prevent excessive system load. Suspend thresholds are also used for this. The other case in which load thresholds are needed is when the execution node is open for interactive load which is not under control of Grid Engine, and you want to prevent the node from being overloaded. If a compute farm is more single-purpose, e. g., each CPU at a compute node is represented by only one queue slot, and no interactive load is expected at these nodes, then load_thresholds can be omitted. To disable both thresholds set load_thresholds to none and suspend_thresholds to none. See queue_conf(5).
Starting with V6.0 load_thresholds areapplicable to consumable resources as well (see queue_conf(5)). Using this feature will have a negative impact on the scheduler performance.
load adjustments (V5.3 + V6.0)
Load adjustments are used to virtually increase the measured load after a job has been dispached. This mechanism is helpful in the case of oversubscribed machines in order to align with load thresholds. Load adjustments should be switched off if they are not needed, because they impose on the scheduler some additional work in connection sorting hosts and load thresholds verification. To disable load adjustments set job_load_adjustments to none and load_adjustment_decay_time to 0 in the scheduler configuration. See sched_conf(5).
scheduling-on-demand (V5.3 + V6.0)
The default for Grid Engine is to start scheduling runs in a fixed scheduling interval (see schedule_interval in schedd_conf(5)). The good thing with fixed intervals is that they limit the cpu time consumption of the qmaster/scheduler. The bad thing is that they throttle the scheduler artificially, resulting in a limited throughput. In many compute farms there are machines specifically dedicated to qmaster/scheduler and in such setups there is no reason for throttling the scheduler. How many seconds one should use for flush times is difficult to say. It depends on the time the scheduler needs for a single run and the number of jobs in the system. A couple test runs with the scheduler profiling (Add profile=1 to the params in the schedd_conf(5).) should give one enough data to select a good value.
Scheduling-on-demand can be configured using the FLUSH_SUBMIT_SEC and FLUSH_FINISH_SEC settings in the schedd_params section of the global cluster configuration. See sge_conf(5).
Scheduling-on-demand can be configured using the FLUSH_SUBMIT_SEC and FLUSH_FINISH_SEC settings in the schedd_conf(5).
If scheduling-on-demand is activated, the throughput of a compute farm is only limited by the power of the machine hosting qmaster/scheduler.
scheduler priority information (V6.0)
After every scheduling interval, the scheduler sends the calculated priority information (tickets, job priority, urgency) to the qmaster. This information is used to order the job output in “qstat -ext”, “-urg”, and “-pri”. The transfer of the information can be turned off by setting report_pjob_tickets to false in schedd_conf(5).
The scheduler contains different policy modules (See sge_priority(5)) to compute the importance of a job:
posix priority policy
waiting time policy
All policies are turned on per default. If one or two of them are not used, it is preferable to turn the policy off by setting its weighting factor to 0 in schedd_conf(5).
resource reservation (V6.0)
A new feature in version 6 is resource reservation to prevent the starvation of jobs with many resource requests. The configuration of the scheduler allows one to enable / disable this feature as well as limit the number of jobs which will get a reservation. Turning off this featuer, by setting max_reservation to 0 in schedd_conf(5), will have a positive impact on the scheduler run time.
If the resource reservation is needed, the number of jobs which will get a reservation from the scheduler should be as small as possible. This is done by setting a small number for max_reservation in schedd_conf(5).
optimization of qmaster memory consumption
In clusters with large quantities of jobs a limiting factor is often the memory footprint required to store all job properties. Experience shows large parts of the memory occupied by the qmaster are used to store each job’s environment as specified via “-v variable_list” or “-V”. End users sometimes perceive it as convenient to simply use “-V”, even though it would have been entirely sufficient to inherit a handful of specific environment variables from the submission environment. Conscious and sparing use of job environment variables has been shown to greatly increase the maximum number of jobs that can be processed with a given amount of main memory by Grid Engine.
intentional use “-b y” to disburden qmaster (V6.0)
Per default Grid Engine qsub job submission sends the job scripts together with the job itself. Since version 6 the -b y option can be used to prevent job scripts from being sent, instead simply sending the path to the executable along with the job. This technique requires that the script be made available elsewhere, but in many cases the script is already available or could easily be made available by means shared file systems. Use of -b y has a beneficial impact on cluster scalability because job scripts do not need to be stored on disk by the qmaster at submission time or be packed with the job when it is delivered to the execd.
EXPERIMENTAL: job filter based on job clases (V6.0u1)
The job filter can be enabled by adding JC_FILTER=1 to the params field in schedd_conf(5). This feature is not documented and, if enabled, can lead some minor problems in the system.
If enabled, the scheduler limits the number of jobs it looks at during a scheduling run. At the beginning of the scheduling run it assigns each job a specific category, based on the job’s requests, priority settings, and the job owner. All scheduling policies will assign the same importance to each job in a category. Therefore, the number of jobs per category will have a FIFO order and can be limited to the number of free slots in the system.
An exception is jobs which request a resource reservation. They are included regardless of the number of jobs in a category.
This setting is turned off per default, because in very rare cases the scheduler can make a wrong decision. It is also advised to turn report_pjob_tickets off when this feature is used. Otherwise “qstat -ext” can report outdated ticket amounts. The information shown with a “qstat -j ” for a job, that was excluded in a scheduling run, is very limited.
A new feature with Grid Engine V6.0 is the ability to store scheduler profiles, e. g. “qconf -ssconf >file”, such as are used during Grid Engine installation. The profiles are not stored internally. With the combination of dynamically changing the scheduler configuration by loading a new profile with “qconf -Msconf <file>” and a cron job, one can switch to a leaner configuration over night and return to a user friendly configuration during the day