XenServer Hosts with Broadcom 10GbE Crash because of SWIOTLB Exhaustion

XenServer Hosts with Broadcom 10GbE Crash because of SWIOTLB Exhaustion

See Applicable Products

Symptoms or Error

XenServer hosts with Broadcom NetXtreme II 10GbE NICs (driver bnx2x) and Jumbo Frames (MTU=9000) enabled, the control domain (dom0) might crash because of Software IOMMU Translation Lookaside Buffer (SWIOTLB) exhaustion.

Example:
In /var/log/kern.log when trying to start a Virtual Machine (VM) (domain ID 174 in this case), dom0 ran out of swiotlb space and triggered a kernel panic which cannot be handled causing a host crash.

Out of SW-IOMMU space that was originated from PCI device bnx2x NICs, they form a bond which was dedicated for storage traffic, jumbo frames was enabled (MTU 9000) for the network.

  1. 0000:04:00.0
  2. 0000:04:00.1
Sep 17 05:00:53 bn2-3c16-ix-4104 kernel: [2777292.770268] /local/domain/174/device/vif/0: Initialising Sep 17 05:00:53 bn2-3c16-ix-4104 kernel: [2777292.770364] /local/domain/174/device/vif/0: Initialising Sep 17 05:00:53 bn2-3c16-ix-4104 kernel: [2777293.185673] device vif174.0 entered promiscuous mode Sep 17 05:00:54 bn2-3c16-ix-4104 kernel: [2777293.656203] device tap174.0 entered promiscuous mode Sep 17 05:01:08 bn2-3c16-ix-4104 kernel: [2777308.547867] blkback: event-channel 6 Sep 17 05:01:08 bn2-3c16-ix-4104 kernel: [2777308.548169] blkback: ring-ref 8 Sep 17 05:01:08 bn2-3c16-ix-4104 kernel: [2777308.548451] blkback: protocol 1 (x86_32-abi) Sep 17 05:01:09 bn2-3c16-ix-4104 kernel: [2777308.567878] blkback: event-channel 7 Sep 17 05:01:09 bn2-3c16-ix-4104 kernel: [2777308.568295] blkback: ring-ref 9 Sep 17 05:01:09 bn2-3c16-ix-4104 kernel: [2777308.568538] blkback: protocol 1 (x86_32-abi) Sep 17 05:01:17 bn2-3c16-ix-4104 kernel: [2777317.259514] /local/domain/174/device/vif/0: Initialising Sep 17 05:01:17 bn2-3c16-ix-4104 kernel: [2777317.262401] /local/domain/174/device/vif/0: Closing Sep 17 05:01:17 bn2-3c16-ix-4104 kernel: [2777317.404232] /local/domain/174/device/vif/0: Closed Sep 17 05:01:17 bn2-3c16-ix-4104 kernel: [2777317.520978] /local/domain/174/device/vif/0: Initialising Sep 17 05:01:17 bn2-3c16-ix-4104 kernel: [2777317.520983] frontend_changed: backend/vif/174/0: prepare for reconnect Sep 17 05:01:17 bn2-3c16-ix-4104 kernel: [2777317.531581] /local/domain/174/device/vif/0: Connected Sep 17 05:01:18 bn2-3c16-ix-4104 kernel: [2777317.679414] device vif174.0 entered promiscuous mode Sep 17 05:05:06 bn2-3c16-ix-4104 kernel: [2777545.972710] PCI-DMA: Out of SW-IOMMU space for 25158 bytes at device 0000:04:00.0 Sep 17 05:05:14 bn2-3c16-ix-4104 kernel: [2777553.832648] PCI-DMA: Out of SW-IOMMU space for 16918 bytes at device 0000:04:00.0 Sep 17 05:07:29 bn2-3c16-ix-4104 kernel: [2777688.580119] PCI-DMA: Out of SW-IOMMU space for 17574 bytes at device 0000:04:00.0 Sep 17 05:24:16 bn2-3c16-ix-4104 kernel: [2778695.616522] PCI-DMA: Out of SW-IOMMU space for 24086 bytes at device 0000:04:00.1 Sep 17 05:32:52 bn2-3c16-ix-4104 kernel: [2779211.511111] PCI-DMA: Out of SW-IOMMU space for 14654 bytes at device 0000:04:00.1 Sep 17 05:40:30 bn2-3c16-ix-4104 kernel: [2779670.428243] PCI-DMA: Out of SW-IOMMU space for 30714 bytes at device 0000:04:00.1 Sep 17 05:57:48 bn2-3c16-ix-4104 kernel: [2780708.325217] PCI-DMA: Out of SW-IOMMU space for 17062 bytes at device 0000:04:00.1 Sep 17 06:00:38 bn2-3c16-ix-4104 kernel: [2780877.454820] PCI-DMA: Out of SW-IOMMU space for 17014 bytes at device 0000:04:00.1 Sep 17 06:00:38 bn2-3c16-ix-4104 kernel: [2780877.618497] PCI-DMA: Out of SW-IOMMU space for 21954 bytes at device 0000:04:00.1 Sep 17 06:01:05 bn2-3c16-ix-4104 kernel: [2780905.206857] PCI-DMA: Out of SW-IOMMU space for 16918 bytes at device 0000:04:00.1 Sep 17 06:10:01 bn2-3c16-ix-4104 kernel: [2781441.110785] PCI-DMA: Out of SW-IOMMU space for 19034 bytes at device 0000:04:00.1 Sep 17 06:12:22 bn2-3c16-ix-4104 kernel: [2781581.611895] PCI-DMA: Out of SW-IOMMU space for 15910 bytes at device 0000:04:00.1 Sep 17 06:22:00 bn2-3c16-ix-4104 kernel: [2782159.541557] PCI-DMA: Out of SW-IOMMU space for 25206 bytes at device 0000:04:00.1 Sep 17 06:22:02 bn2-3c16-ix-4104 kernel: [2782161.784505] PCI-DMA: Out of SW-IOMMU space for 17014 bytes at device 0000:04:00.1 Sep 17 06:27:52 bn2-3c16-ix-4104 kernel: [2782511.767750] PCI-DMA: Out of SW-IOMMU space for 17578 bytes at device 0000:04:00.1 Sep 17 06:28:14 bn2-3c16-ix-4104 kernel: [2782534.142746] PCI-DMA: Out of SW-IOMMU space for 17062 bytes at device 0000:04:00.1 Sep 17 06:30:30 bn2-3c16-ix-4104 kernel: [2782669.639121] PCI-DMA: Out of SW-IOMMU space for 19034 bytes at device 0000:04:00.1 Sep 17 06:50:17 bn2-3c16-ix-4104 kernel: klogd 1.4.1, log source = /proc/kmsg started. Sep 17 06:50:17 bn2-3c16-ix-4104 kernel: [    0.000000] Reserving virtual address space above 0xfb400000 Sep 17 06:50:17 bn2-3c16-ix-4104 kernel: [    0.000000] Linux version 2.6.32.43-0.4.1.xs1.8.0.853.170791xen (geeko@buildhost) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-51)) #1 SMP Mon Mar 3 06:36:39 EST 2014

In /var/crash/<timestamp>/dom0.log from crash dump, similar information can be found, starting guest (dom ID 174) triggered the swiotlb exhaustion and subsequently caused a kernel panic which crashed the host.

<4>[2777292.770268] /local/domain/174/device/vif/0: Initialising <4>[2777292.770364] /local/domain/174/device/vif/0: Initialising <6>[2777293.185673] device vif174.0 entered promiscuous mode <6>[2777293.656203] device tap174.0 entered promiscuous mode <6>[2777308.547867] blkback: event-channel 6 <6>[2777308.548169] blkback: ring-ref 8 <6>[2777308.548451] blkback: protocol 1 (x86_32-abi) <6>[2777308.567878] blkback: event-channel 7 <6>[2777308.568295] blkback: ring-ref 9 <6>[2777308.568538] blkback: protocol 1 (x86_32-abi) <4>[2777317.259514] /local/domain/174/device/vif/0: Initialising <4>[2777317.262401] /local/domain/174/device/vif/0: Closing <4>[2777317.404232] /local/domain/174/device/vif/0: Closed <4>[2777317.520978] /local/domain/174/device/vif/0: Initialising <6>[2777317.520983] frontend_changed: backend/vif/174/0: prepare for reconnect <4>[2777317.531581] /local/domain/174/device/vif/0: Connected <6>[2777317.679414] device vif174.0 entered promiscuous mode <3>[2777545.972710] PCI-DMA: Out of SW-IOMMU space for 25158 bytes at device 0000:04:00.0 <3>[2777553.832648] PCI-DMA: Out of SW-IOMMU space for 16918 bytes at device 0000:04:00.0 <3>[2777688.580119] PCI-DMA: Out of SW-IOMMU space for 17574 bytes at device 0000:04:00.0 <3>[2778695.616522] PCI-DMA: Out of SW-IOMMU space for 24086 bytes at device 0000:04:00.1 <3>[2779211.511111] PCI-DMA: Out of SW-IOMMU space for 14654 bytes at device 0000:04:00.1 <3>[2779670.428243] PCI-DMA: Out of SW-IOMMU space for 30714 bytes at device 0000:04:00.1 <3>[2780708.325217] PCI-DMA: Out of SW-IOMMU space for 17062 bytes at device 0000:04:00.1 <3>[2780877.454820] PCI-DMA: Out of SW-IOMMU space for 17014 bytes at device 0000:04:00.1 <3>[2780877.618497] PCI-DMA: Out of SW-IOMMU space for 21954 bytes at device 0000:04:00.1 <3>[2780905.206857] PCI-DMA: Out of SW-IOMMU space for 16918 bytes at device 0000:04:00.1 <3>[2781441.110785] PCI-DMA: Out of SW-IOMMU space for 19034 bytes at device 0000:04:00.1 <3>[2781581.611895] PCI-DMA: Out of SW-IOMMU space for 15910 bytes at device 0000:04:00.1 <3>[2782159.541557] PCI-DMA: Out of SW-IOMMU space for 25206 bytes at device 0000:04:00.1 <3>[2782161.784505] PCI-DMA: Out of SW-IOMMU space for 17014 bytes at device 0000:04:00.1 <3>[2782511.767750] PCI-DMA: Out of SW-IOMMU space for 17578 bytes at device 0000:04:00.1 <3>[2782534.142746] PCI-DMA: Out of SW-IOMMU space for 17062 bytes at device 0000:04:00.1 <3>[2782669.639121] PCI-DMA: Out of SW-IOMMU space for 19034 bytes at device 0000:04:00.1 <3>[2783556.069976] PCI-DMA: Out of SW-IOMMU space for 33302 bytes at device 0000:04:00.1 <0>[2783556.069986] Kernel panic - not syncing: DMA: Random memory could be DMA read <0>[2783556.069987] <4>[2783556.069994] Pid: 1196, comm: netback/2 Not tainted 2.6.32.43-0.4.1.xs1.8.0.853.170791xen #1 <4>[2783556.069996] Call Trace: <4>[2783556.070003] [<c01346cb>] panic+0x4b/0x150 <4>[2783556.070007] [<c026c228>] swiotlb_full+0x68/0x80 <4>[2783556.070009] [<c026c518>] swiotlb_map_page+0x108/0x120 <4>[2783556.070012] [<c026c410>] ? swiotlb_map_page+0x0/0x120 <4>[2783556.070040] [<f3d7247f>] bnx2x_start_xmit+0x1bf/0x1a70 [bnx2x] <4>[2783556.070045] [<c03534e0>] ? skb_gso_segment+0xc0/0x250 <4>[2783556.070049] [<c02fa08d>] ? netbk_wake_queue+0x2d/0x60 <4>[2783556.070054] [<c02fa1ab>] ? netbk_p0_event+0x1b/0x20 <4>[2783556.070056] [<c02f8e8f>] ? netbk_int+0x4f/0x70 <4>[2783556.070060] [<c0105e01>] ? show_interrupts+0x481/0x510 <4>[2783556.070063] [<c03538b6>] dev_hard_start_xmit+0x246/0x490 <4>[2783556.070077] [<c0366c9d>] sch_direct_xmit+0x17d/0x200 <4>[2783556.070080] [<c02f4804>] ? netif_idx_release+0x54/0x70 <4>[2783556.070082] [<c03570c1>] dev_queue_xmit+0x291/0x4b0 <4>[2783556.070089] [<f555b62e>] netdev_send+0x5e/0x350 [openvswitch_mod] <4>[2783556.070093] [<c0193386>] ? free_hot_page+0x26/0x50 <4>[2783556.070099] [<f5558d02>] ovs_vport_send+0x12/0x50 [openvswitch_mod] <4>[2783556.070103] [<f5550371>] do_output+0x21/0x40 [openvswitch_mod] <4>[2783556.070107] [<f5550853>] do_execute_actions+0x4c3/0x710 [openvswitch_mod] <4>[2783556.070109] [<c034a32d>] ? __kfree_skb+0x3d/0x90 <4>[2783556.070124] [<f3d68178>] ? bnx2x_free_tx_pkt+0x1e8/0x2b0 [bnx2x] <4>[2783556.070131] [<f555bfc1>] ? flex_array_get+0x51/0x70 [openvswitch_mod] <4>[2783556.070135] [<f5550b74>] ovs_execute_actions+0x74/0xd0 [openvswitch_mod] <4>[2783556.070140] [<f5552394>] ovs_dp_process_received_packet+0x54/0xf0 [openvswitch_mod] <4>[2783556.070145] [<c0125635>] ? __wake_up+0x45/0x60 <4>[2783556.070151] [<f5559525>] ovs_vport_receive+0x75/0x90 [openvswitch_mod] <4>[2783556.070157] [<f555b33f>] netdev_frame_hook+0x4f/0x90 [openvswitch_mod] <4>[2783556.070160] [<c035290b>] netif_receive_skb+0x1bb/0x6a0 <4>[2783556.070163] [<c016a9d5>] ? handle_IRQ_event+0x55/0x180 <4>[2783556.070165] [<c016dd7d>] ? move_masked_irq+0x1d/0xc0 <4>[2783556.070170] [<c0125635>] ? __wake_up+0x45/0x60 <4>[2783556.070173] [<c0356427>] process_backlog+0x97/0xf0 <4>[2783556.070176] [<c0356155>] net_rx_action+0x155/0x260 <4>[2783556.070180] [<c013a8e1>] __do_softirq+0xd1/0x220 <4>[2783556.070183] [<c019775a>] ? put_page+0x3a/0xf0 <4>[2783556.070186] [<c013aaa5>] do_softirq+0x75/0x80 <4>[2783556.070188] [<c03533cf>] netif_rx_ni+0x1f/0x30 <4>[2783556.070190] [<c02f5a22>] net_tx_action+0xfe2/0x1870 <4>[2783556.070199] [<c0126087>] ? update_curr+0x77/0x140 <4>[2783556.070205] [<c02f703e>] netbk_action_thread+0x9e/0x210 <4>[2783556.070208] [<c014e8f0>] ? autoremove_wake_function+0x0/0x50 <4>[2783556.070211] [<c02f6fa0>] ? netbk_action_thread+0x0/0x210 <4>[2783556.070213] [<c014e604>] kthread+0x74/0x80 <4>[2783556.070216] [<c014e590>] ? kthread+0x0/0x80 <4>[2783556.070218] [<c01048ab>] kernel_thread_helper+0x7/0x10

Solution

To resolve this issue, enable Generic Receive Offload (GRO) for Broadcom NetXtreme II 10GbE (bnx2x) NICs.

Note: If bnx2x NICs are bonded, you need to be on XenServer 6.2.0 SP1 with hotfix XS62ESP1004 or later. In addition, apply the changes to comprising NICs of the bond instead of applying directly to the bonds.
In the following example, the bond dedicated for storage traffic is comprised of eth4 and eth5 across the pool.

  1. Before proceeding, ensure that you change the one-liner so that it applies to the correct NICs.
  2. Run the following commands on the pool master (preferred):
    for pifs in $(xe pif-list device=eth4 VLAN=-1 params=uuid | grep uuid | awk ‘{ print $5 }’); do xe pif-param-set other-config:ethtool-gro=”on” uuid=$pifs; done
    for pifs in $(xe pif-list device=eth4 VLAN=-1 params=uuid | grep uuid | awk ‘{ print $5 }’); do xe pif-param-set other-config:ethtool-gro=”on” uuid=$pifs; done
  3. Schedule reboot of the hosts in the pool for the change to take effect.
  4. After rebooting, check if the GRO is enabled:
    ethtool -k <interface>
    ethtool -k <bond>

    If GRO is on, you should see:
    generic-receive-offload: on
    Note: Citrix does not recommend using the following approach:
    Use ethtool -K gro on <dev> to apply the change directly to the NICs and use /etc/rc.local to make it persistent across reboots.
    Note: If you are planning to upgrade to XenServer 6.5, ensure that you revert the changes by clearing the “other-config” key because GRO will be ON by default on XenServer 6.5.
  5. (Optional) Increase SWIOTLB from 64M (default) to 128M by passing swiotlb=128 as the boot parameter and reboot.
    #/opt/xensource/libexec/xen-cmdline –set-dom0 swiotlb=128
    Note
    : Increasing the swiotlb size is not going to fix the issue but it will help to make it take longer to exhaust the SWIOTLB to some extent.

For a list of drivers that support jumbo frames on XenServer 6.2, contact Citrix Technical Support.

Problem Cause

In XenServer version 6.2 and earlier, GRO is not enabled by default. Combination of jumbo frames and certain network drivers result in excessive dom0 SWIOTLB usage.

Excessive SWIOTLB usage causes increased dom0 load, reduced performance for 10GbE NICs and may exhaust SWIOTLB which will in turn trigger a kernel panic and subsequently cause dom0 crash.

The SWIOTLB is a software implementation of an IOMMU. It is used to hide the fact that Dom0 kernel physical address space (PFNs) does not match the machine address space (MFNs) from the Dom0 drivers which expect contiguous buffers available to the hardware. It does this action through a combination of hooking the kernel DMA mapping API in the correct places and bounce buffering (which involves copying PFN data into and out of a special contiguous PFN region created at boot time). Bounce buffering only needs to be used when there is mismatch between the PFN and MFN mapping which occurs when a kernel buffer is larger than 4 kilobytes (page size).

Jumbo frames are up to 9 kilobytes in size and certain drivers create contiguous buffers for receiving jumbo frame sized packet data. This means the SWIOTLB bounce buffering must be used in order to create a machine contiguous region that PCI hardware can DMA to. This also means Dom0 will copy the data out of the SWIOTLB region for every packet received which adds latency and uses Dom0 CPU.

If a driver maps a large number of buffer greater than 4k for receiving data, then it can exhaust the SWIOTLB region (typically 64MiB). This will either cause the mapping driver to crash/error or other drivers which need to map buffers greater than 4k to crash.

GRO can be used to offload some network processing to NICs (their drivers need to support GRO) so as to improve network performance and reduce the usage of SWIOTLB.

Applicable Products

서진우

슈퍼컴퓨팅 전문 기업 클루닉스/ 상무(기술이사)/ 정보시스템감리사/ 시스존 블로그 운영자

You may also like...

페이스북/트위트/구글 계정으로 댓글 가능합니다.