What the hell is IOMMU?

I learned some awesome stuff from a colleague this week! Eager to share.

Back story: an I/O intensive workload was performing below expectations. And an initial analysis was showing that, even under heavy load, the CPUs were almost 25% idle, or in I/O wait.

After some investigation, it turned out that a function “alloc_and_insert_iova_range” was being called many more times under load after the program had been running for a while. Why? IOMMU.

IOMMU is the Input/Output Memory Management Unit. It is a memory management unit connecting a Direct Memory Access (DMA) capable I/O bus to the main memory. Regular MMU connects virtual memory addresses in process memory address space to physical memory locations in RAM. IOMMU does something similar for I/O devices (things like disks and graphics addresses for graphics cards).

IOMMU has some advantages:

  1. It can keep the I/O address space contiguous, even when the underlying memory pages that make up that address space are not.
  2. It protects against malicious DMA attacks - where a malicious device like a USB key tried to get access to system memory it does not own.
  3. And it allows VMs to access system I/O devices that they would not normally have access to.

However, it comes with some disadvantages too. The biggest disadvantage is performance degradation over time. The pages of memory that make up the “working set” of the IOMMU are stored in the Translation Lookaside Buffer (TLB), part of the L1 Data Cache, which maps I/O addresses to physical memory addresses. So if that “working set” of pages every outgrows the size of the TLB, you will start to see cache misses, requiring (I think) entire pages to be flushed from memory and new pages loaded. What that looks like is over time you see more & more TLB cache misses, resulting in IOMMU causing significant performance overhead as memory pages are loaded and unloaded more often than they should be.

The “solution” to this problem is to forego the benefits of IOMMU and allow DMA devices to access memory directly - the kernel parameter iommu.passthrough=1 does that on Linux. I’m still unclear why this is more of an issue on Arm64 than it is on x86 - are TLB entries larger or more copious with larger memory pages? Is the cost of a cache miss higher?

I would have thought the opposite true - that larger pages would mean fewer cache misses, and less TLB pressure. Or is if that if you are using larger pages, the cost of a cache miss is higher, aggravating the cost side when the TLB fills up?

Any insight welcome, I am a newbie at all this low level memory management stuff!


This article provides some information that might be helpful:
Deep Diving Neoverse N1 – Chips and Cheese.

For example:

Zen 2 has a larger L1 data TLB, with 64 entries compared to Neoverse N1’s 48. The L2 TLB on N1 is also smaller, with 1280 entries compared to Zen 2’s 2048. But N1’s has lower latency (5 cycles) than Zen 2’s (7 cycles).


It has also been pointed out to me that my characterization of pages getting flushed from cache when there is a TLB cache miss is not accurate. The TLB only contains the mapping of in-memory pages to physical pages. When the TLB fails, the cost for the cache is traversing a tree of addresses to identify which physical address maps to a virtual memory address.