What can I do to make sure I’m getting the best possible speed and performance when I’m using these processors for heavy workloads?
When you say throughput are you referring to disk read/write IO or network IO - or some combination of both?
Do you have an example of what you mean by “heavy workloads”?
Generally speaking, throughput and latency are in conflict. To improve throughput, you can increase various network and I/O buffers until RAM is your primary constraint, pinning cores for network I/O to avoid context switches impacting bandwidth, and turn off all power saving features to ensure the CPU is always available by putting the CPU governor in performance mode (basically turning off various CPU power saving features). This will impact tail latencies in response time for requests by increasing queue length for processing.
One way to configure a lot of these things in the OS is to use tuned to choose “network-throughput” as a tuned profile, and make sure you have enough RAM per core in cloud instances. For throughput-intensive workloads, making sure that you have enough CPUs to handle the compute requirements of the application is important too, obviously, or you won’t get to that situation where network I/O and RAM are the constraints.
Looking at the network-throughput
tuned profile, which includes the throughput-performance
profile, the highlights are:
[cpu]
governor=performance
energy_perf_bias=performance
min_perf_pct=100
energy_performance_preference=performance
this puts the CPU in performance
mode, and turns off power management features.
[disk]
# Set disk readahead to 4M (large disk I/O buffer).
readahead=>4096
[sysctl]
# Avoid swapping processes out of physical memory aggressively (value is 0-100)
vm.swappiness=10
# Increase the max socket queue length fron the default (128) to a larger value (max in
# Linux is 64K, and 4096 on older kernels)
net.core.somaxconn=>2048
# Increase kernel network buffer size maximums.
#
# The buffer tuning values below do not account for any potential hugepage allocation.
# Ensure that you do not oversubscribe system memory.
net.ipv4.tcp_rmem="4096 131072 16777216"
net.ipv4.tcp_wmem="4096 16384 16777216"
The comments are pretty explanatory here - and you can use these profiles as a basis for your own, with extra configurations or options overridden to your preferred values.
Due to the high core density of Ampere processors, workloads should be well parallelized to take full advantage of the many cores—for example, by adding more worker threads or running multiple workload instances simultaneously. Tools like htop
provide an overview of how workload threads are distributed across the cores.