Most users know how to check the status of their CPUs, see how much memory is free or find out how much disk space is free. In contrast, keeping tabs on the health and status of GPUs has historically been more difficult. If you don’t know where to look, it can even be difficult to determine the type and capabilities of the GPUs in a system. Thankfully, NVIDIA’s latest hardware and software tools have made good improvements in this respect.
The tool is NVIDIA’s System Management Interface (nvidia-smi
). Depending on the generation of your card, various levels of information can be gathered. Additionally, GPU configuration options (such as ECC memory capability) may be enabled and disabled.
As an aside, if you find that you’re having trouble getting your NVIDIA GPUs to run GPGPU code, nvidia-smi
can be handy. For example, on some systems the proper NVIDIA devices in /dev
are not created at boot. Running a simple nvidia-smi
query as root will initialize all the cards and create the proper devices in /dev
. Other times, it’s just useful to make sure all the GPU cards are visible and communicating properly. Here’s the default output from a recent version with one Tesla K80 GPU card:
Tue Apr 7 12:56:41 2015 +------------------------------------------------------+ | NVIDIA-SMI 346.46 Driver Version: 346.46 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 On | 0000:05:00.0 Off | Off | | N/A 32C P8 26W / 149W | 56MiB / 12287MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 On | 0000:06:00.0 Off | Off | | N/A 29C P8 29W / 149W | 56MiB / 12287MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Persistence Mode
On Linux, you can set GPUs to persistence mode to keep the NVIDIA driver loaded even when no applications are accessing the cards. This is particularly useful when you have a series of short jobs running. Persistence mode uses more power, but prevents the fairly long delays that occur each time a GPU application is started. It is also necessary if you’ve assigned specific clock speeds or power limits to the GPUs (as those changes are lost when the NVIDIA driver is unloaded). Enable persistence mode on all GPUS by running:
nvidia-smi -pm 1
On Windows, nvidia-smi is not able to set persistence mode. Instead, you need to set your computational GPUs to TCC mode. This should be done through NVIDIA’s graphical GPU device management panel.
GPUs supported by nvidia-smi
NVIDIA’s SMI tool supports essentially any NVIDIA GPU released since the year 2011. These include the Tesla, Quadro, GRID and GeForce devices from Fermi and higher architecture families (Kepler, Maxwell, etc).
Supported products include:
Tesla: S1070, S2050, C1060, C2050/70, M2050/70/90, X2070/90, K10, K20, K20X, K40, K80
Quadro: 4000, 5000, 6000, 7000, M2070-Q, K-series, M-series
GeForce: varying levels of support, with fewer metrics available than on the Tesla and Quadro products
Querying GPU Status
Microway’s GPU Test Drive cluster, which we provide as a benchmarking service to our customers, contains a group of NVIDIA’s latest Tesla GPUs. These are NVIDIA’s high-performance compute GPUs and provide a good deal of health and status information. The examples below are taken from this internal cluster.
To list all available NVIDIA devices, run:
[root@md ~]# nvidia-smi -L GPU 0: Tesla K40m (UUID: GPU-d0e093a0-c3b3-f458-5a55-6eb69fxxxxxx) GPU 1: Tesla K40m (UUID: GPU-d105b085-7239-3871-43ef-975ecaxxxxxx)
To list certain details about each GPU, try:
[root@md ~]# nvidia-smi --query-gpu=index,name,uuid,serial --format=csv 0, Tesla K40m, GPU-d0e093a0-c3b3-f458-5a55-6eb69fxxxxxx, 0323913xxxxxx 1, Tesla K40m, GPU-d105b085-7239-3871-43ef-975ecaxxxxxx, 0324214xxxxxx
Monitoring and Managing GPU Boost
The GPU Boost feature which NVIDIA has included with more recent GPUs allows the GPU clocks to vary depending upon load (achieving maximum performance so long as power and thermal headroom are available). However, the amount of available headroom will vary by application (and even by input file!) so users and administrators should keep their eyes on the status of the GPUs.
A listing of available clock speeds can be shown for each GPU (in this case, the Tesla K80):
nvidia-smi -q -d SUPPORTED_CLOCKS GPU 0000:04:00.0 Supported Clocks Memory : 2505 MHz Graphics : 875 MHz Graphics : 862 MHz Graphics : 849 MHz Graphics : 836 MHz Graphics : 823 MHz Graphics : 810 MHz Graphics : 797 MHz Graphics : 784 MHz Graphics : 771 MHz Graphics : 758 MHz Graphics : 745 MHz Graphics : 732 MHz Graphics : 719 MHz Graphics : 705 MHz Graphics : 692 MHz Graphics : 679 MHz Graphics : 666 MHz Graphics : 653 MHz Graphics : 640 MHz Graphics : 627 MHz Graphics : 614 MHz Graphics : 601 MHz Graphics : 588 MHz Graphics : 575 MHz Graphics : 562 MHz Memory : 324 MHz Graphics : 324 MHz
The above output indicates that only two memory clock speeds are supported (2505 MHz and 324 MHz). With the memory running at 2505 MHz, there are 25 supported GPU clock speeds. With the memory running at 324 MHz, only a single GPU clock speed is supported (which is the idle GPU state). On the Tesla K80, GPU Boost automatically manages these speeds and runs as fast as possible. On other models, such as Tesla K40, the administrator must specifically select the desired GPU clock speed.
To review the current GPU clock speed, default clock speed, and maximum possible clock speed, run:
nvidia-smi -q -d CLOCK GPU 0000:04:00.0 Clocks Graphics : 875 MHz SM : 875 MHz Memory : 2505 MHz Applications Clocks Graphics : 875 MHz Memory : 2505 MHz Default Applications Clocks Graphics : 562 MHz Memory : 2505 MHz Max Clocks Graphics : 875 MHz SM : 875 MHz Memory : 2505 MHz SM Clock Samples Duration : 3730.56 sec Number of Samples : 8 Max : 875 MHz Min : 324 MHz Avg : 873 MHz Memory Clock Samples Duration : 3730.56 sec Number of Samples : 8 Max : 2505 MHz Min : 324 MHz Avg : 2500 MHz Clock Policy Auto Boost : On Auto Boost Default : On
Ideally, you’d like all clocks to be running at the highest speed all the time. However, this will not be possible for all applications. To review the current state of each GPU and any reasons for clock slowdowns, use the PERFORMANCE flag:
nvidia-smi -q -d PERFORMANCE GPU 0000:04:00.0 Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active Unknown : Not Active
If any of the GPU clocks is running at a slower speed, one or more of the above Clocks Throttle Reasons will be marked as active. The most concerning condition would be if HW Slowdown or Unknown are active, as these would most likely indicate a power or cooling issue. The remaining conditions typically indicate that the card is idle or has been manually set into a slower mode by a system administrator.
Reviewing System Topology
To properly take advantage of more advanced NVIDIA GPU features (such as GPU Direct), it is often vital that the system topology be properly configured. The topology refers to how the PCI-Express devices (GPUs, InfiniBand HCAs, storage controllers, etc.) connect to each other and to the system’s CPUs. If not correct, it is possible that certain features will slow down or even stop working altogether. To help tackle such questions, recent versions of nvidia-smi
include an experimental system topology view:
nvidia-smi topo --matrix GPU0 GPU1 GPU2 GPU3 mlx4_0 CPU Affinity GPU0 X PIX PHB PHB PHB 0-11 GPU1 PIX X PHB PHB PHB 0-11 GPU2 PHB PHB X PIX PHB 0-11 GPU3 PHB PHB PIX X PHB 0-11 mlx4_0 PHB PHB PHB PHB X Legend: X = Self SOC = Path traverses a socket-level link (e.g. QPI) PHB = Path traverses a PCIe host bridge PXB = Path traverses multiple PCIe internal switches PIX = Path traverses a PCIe internal switch
Reviewing this section will take some getting used to, but can be very valuable. The above configuration shows two Tesla K80 GPUs and one Mellanox FDR InfiniBand HCA all connected to the first CPU of a server. Because the CPUs are 12-core Xeons, the topology tool recommends that jobs be assigned to the first 12 CPU cores (although this will vary by application). Get in touch with one of our HPC GPU experts if you have questions on this topic.
Printing all GPU Details
To list all available data on a particular GPU, specify the ID of the card with -i
. Here’s the output from an older Tesla GPU card:
nvidia-smi -i 0 -q ==============NVSMI LOG============== Timestamp : Mon Dec 5 22:05:49 2011 Driver Version : 270.41.19 Attached GPUs : 2 GPU 0:2:0 Product Name : Tesla M2090 Display Mode : Disabled Persistence Mode : Disabled Driver Model Current : N/A Pending : N/A Serial Number : 032251100xxxx GPU UUID : GPU-2b1486407f70xxxx-98bdxxxx-660cxxxx-1d6cxxxx-9fbd7e7cd9bf55a7cfb2xxxx Inforom Version OEM Object : 1.1 ECC Object : 2.0 Power Management Object : 4.0 PCI Bus : 2 Device : 0 Domain : 0 Device Id : 109110DE Bus Id : 0:2:0 Fan Speed : N/A Memory Usage Total : 5375 Mb Used : 9 Mb Free : 5365 Mb Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Ecc Mode Current : Enabled Pending : Enabled ECC Errors Volatile Single Bit Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Total : 0 Double Bit Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Total : 0 Aggregate Single Bit Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Total : 0 Double Bit Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Total : 0 Temperature Gpu : N/A Power Readings Power State : P12 Power Management : Supported Power Draw : 31.57 W Power Limit : 225 W Clocks Graphics : 50 MHz SM : 100 MHz Memory : 135 MHz
The above example shows an idle card. Here is an excerpt for a card running GPU-accelerated AMBER:
nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER,CLOCK,COMPUTE ==============NVSMI LOG============== Timestamp : Mon Dec 5 22:32:00 2011 Driver Version : 270.41.19 Attached GPUs : 2 GPU 0:2:0 Memory Usage Total : 5375 Mb Used : 1904 Mb Free : 3470 Mb Compute Mode : Default Utilization Gpu : 67 % Memory : 42 % Power Readings Power State : P0 Power Management : Supported Power Draw : 109.83 W Power Limit : 225 W Clocks Graphics : 650 MHz SM : 1301 MHz Memory : 1848 MHz
You’ll notice that unfortunately the earlier M-series passively-cooled Tesla GPUs do not report temperatures to nvidia-smi
. More recent Quadro and Tesla GPUs support a greater quantity of metrics data:
==============NVSMI LOG============== Timestamp : Tue Apr 7 13:01:34 2015 Driver Version : 346.46 Attached GPUs : 2 GPU 0000:05:00.0 Product Name : Tesla K80 Product Brand : Tesla Display Mode : Disabled Display Active : Disabled Persistence Mode : Enabled Accounting Mode : Enabled Accounting Mode Buffer Size : 128 Driver Model Current : N/A Pending : N/A Serial Number : 0324614xxxxxx GPU UUID : GPU-81dexxxx-87xx-4axx-79xx-3ddf4dxxxxxx Minor Number : 0 VBIOS Version : 80.21.1B.00.01 MultiGPU Board : Yes Board ID : 0x300 Inforom Version Image Version : 2080.0200.00.04 OEM Object : 1.1 ECC Object : 3.0 Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A PCI Bus : 0x05 Device : 0x00 Domain : 0x0000 Device Id : 0x102D10DE Bus Id : 0000:05:00.0 Sub System Id : 0x106C10DE GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : PLX Firmware : 0xF0472900 Replays since reset : 0 Tx Throughput : N/A Rx Throughput : N/A Fan Speed : N/A Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active Unknown : Not Active FB Memory Usage Total : 12287 MiB Used : 56 MiB Free : 12231 MiB BAR1 Memory Usage Total : 16384 MiB Used : 2 MiB Free : 16382 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Ecc Mode Current : Disabled Pending : Disabled ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Total : N/A Retired Pages Single Bit ECC : 0 Double Bit ECC : 0 Pending : No Temperature GPU Current Temp : 34 C GPU Shutdown Temp : 93 C GPU Slowdown Temp : 88 C Power Readings Power Management : Supported Power Draw : 25.65 W Power Limit : 149.00 W Default Power Limit : 149.00 W Enforced Power Limit : 149.00 W Min Power Limit : 100.00 W Max Power Limit : 175.00 W Clocks Graphics : 324 MHz SM : 324 MHz Memory : 324 MHz Applications Clocks Graphics : 875 MHz Memory : 2505 MHz Default Applications Clocks Graphics : 562 MHz Memory : 2505 MHz Max Clocks Graphics : 875 MHz SM : 875 MHz Memory : 2505 MHz Clock Policy Auto Boost : On Auto Boost Default : On Processes : None
Of course, we haven’t covered all the possible uses of the nvidia-smi
tool. To read the full list of options, run nvidia-smi -h
(it’s fairly lengthy). If you need to change settings on your cards, you’ll want to look at the device modification section:
-pm, --persistence-mode= Set persistence mode: 0/DISABLED, 1/ENABLED -e, --ecc-config= Toggle ECC support: 0/DISABLED, 1/ENABLED -p, --reset-ecc-errors= Reset ECC error counts: 0/VOLATILE, 1/AGGREGATE -c, --compute-mode= Set MODE for compute applications: 0/DEFAULT, 1/EXCLUSIVE_THREAD, 2/PROHIBITED, 3/EXCLUSIVE_PROCESS --gom= Set GPU Operation Mode: 0/ALL_ON, 1/COMPUTE, 2/LOW_DP -r --gpu-reset Trigger reset of the GPU. Can be used to reset the GPU HW state in situations that would otherwise require a machine reboot. Typically useful if a double bit ECC error has occurred. Reset operations are not guarenteed to work in all cases and should be used with caution. --id= switch is mandatory for this switch -ac --applications-clocks= Specifies <memory,graphics> clocks as a pair (e.g. 2000,800) that defines GPU's speed in MHz while running applications on a GPU. -rac --reset-applications-clocks Resets the applications clocks to the default values. -acp --applications-clocks-permission= Toggles permission requirements for -ac and -rac commands: 0/UNRESTRICTED, 1/RESTRICTED -pl --power-limit= Specifies maximum power management limit in watts. -am --accounting-mode= Enable or disable Accounting Mode: 0/DISABLED, 1/ENABLED -caa --clear-accounted-apps Clears all the accounted PIDs in the buffer. --auto-boost-default= Set the default auto boost policy to 0/DISABLED or 1/ENABLED, enforcing the change only after the last boost client has exited. --auto-boost-permission= Allow non-admin/root control over auto boost mode: 0/UNRESTRICTED, 1/RESTRICTED
With this tool, checking the status and health of NVIDIA GPUs is simple. If you’re looking to monitor the cards over time, then nvidia-smi
might be more resource-intensive than you’d like. For that, have a look at NVIDIA’s GPU Management Library (NVML), which offers C, Perl and Python bindings. Commonly-used cluster tools, such as Ganglia, use these bindings to query GPU status.
This post was last updated on 2015-07-07