Waking Threads up as Fast as Possible


I am currently working on a C++ task scheduler, and like most of its kind, it consists of multiple worker threads which have to be supplied with tasks. Lock-free multi-producer/multi-consumer data structures are a vast topic in and of themselves, but assume we have one of those in place and humming along.

The thing left to decide is what a worker will do while no work is available. Intuitively, a miniscule amount of time should pass before re-querying the task queue, so a Sleep(1) seems reasonable (Note: Our system is portable, but i’ll write in terms of the Win32 equivalents in this post), so i went with that and never thought about it again - until i used Remotery to take a look.

As it turns out, this involves a heavy penalty when it comes to latency, the amount of time that passes between work submission and worker wakeup ranges in the milliseconds. This is barely noticeable when dispatching a large amount of work for a throughput-bound algorithm, but costs critical time in a soft-realtime context, i.e. work to be done for each frame. So, let’s start out with a blank slate.

Benchmark Setup

As far as i can tell, it is almost impossible to “properly” benchmark this setting. Microbenchmarking userspace C++ on a single core is complicated (and often questionable) enough as it is given the complexity of modern CPUs, but here we’re additionally dealing with multiple cores, potential false sharing, Ring 3 / Ring 0 switches, non-monotonous timers, and the OS scheduler.

Nonetheless, i attempt it as such: Our main thread spawns multiple “workers”, one for each physical core (excluding itself), and all threads are locked to a single core using SetThreadAffinityMask.

The workers are spinning until a single global std::atomic_bool is true. Immediately afterwards, they execute rdtsc, an intrinsic to query the CPUs timestamp (in cycles), and store the result.

The main thread waits for a while, then sets the bool to true and records its own rdtsc result afterwards. This “wakes” the other threads up, causing their rdtsc recording in turn. The difference is then calculated per thread, and ultimately averaged over 1000 runs (some additional precautions and details omitted here).

Reckless Spinning

The issue seems to be an easy fix: Simply remove the Sleep(1) call and spin-wait. And indeed, this helps a lot - our tasks seem to launch instantly. The cost however is apparent as soon as the CPU fans spin up - our threads never stop working and keep the cores locked at 100%. In the benchmark setting, this performs extremely well, with a median latency of about 140 cycles.

Run Mean Median Min Max SD
1 1134 146 58 4869510 68864.138
2 136 139 60 512 27.695
3 1268 147 60 5597959 79165.394
4 136 122 46 44495 705.821
5 140 142 60 8563 120.955

As you can see, these numbers have such wild outliers that they are barely presentable, however we will be able to make out a pattern in these nonetheless.

SSE2 Pause Instruction

In SSE2, Intel added an instruction specifically to hint an ongoing spin-wait to the CPU, PAUSE. It can be inserted using the _mm_pause() intrinsic, and can significantly improve latency for spinlocks, as well as slightly reduce power consumption. It is entirely unrelated to the OS scheduler however, and as such won’t change the main problem - we still have 100% usage. I did not measure a significant latency difference between using _mm_pause() over not doing anything in the loop body.

Sleeping for 1ms

Back to our initial setup - Sleep(1). The argument is an integer in milliseconds, however as seen before, it is not necessarily respected. The benchmark gives us a median latency of about 26.4 million cycles - worlds apart from the empty spin-wait. The CPU this runs on, the i7-8750H, has a sustained boost clock of about 4Ghz, so this amount of cycles should correspond to ~6.6ms.

Run Mean Median Min Max SD
1 24796282 26438794 10646 58488660 6368942.400
2 26591017 26475796 13945664 53040635 1804528.456
3 26594131 26438770 2488662 99810600 2793445.766
4 26536047 26453647 2979794 57778170 2244278.955
5 26539596 26474311 8885666 50242202 1673861.776

A thread calling Sleep(1) gives up its current time slice, and is marked as “unrunnable” within the OS scheduler for one quantum (which is longer than the 1ms we asked for). This means that our thread, in the best case, has to wait until the end of the current quantum to be swapped back in.

So latency is rather high, but at least the CPU properly idles when there is no work to do.

Manipulating the OS Scheduler

I was a little shocked finding out that you can actually reconfigure the OS scheduler in Win32 to use smaller quanta - no elevated privileges required - with the inconspicuously named timeBeginPeriod. Simply call it with an argument of 1 (millisecond) at startup, and if it returns TIMERR_NOERROR, you’re good to go. And yes, this setting is global, Windows uses the lowest (most granular) setting accross all processes (Linux does not seem to allow anything similar).

From now on, Sleep(1) performs much better, with a median latency of ~2.8 million cycles, or about 0.7ms.

Run Mean Median Min Max SD
1 2431736 2817122 139 60854504 1753174.486
2 2429558 2837240 -60 5587950 1503830.150
3 2430186 2826913 200 15420182 1541328.899
4 2400616 2774255 869 7431026 1462985.834
5 2382802 2794147 600 5854380 1502142.995

Note that you should undo the changes made by calling timeEndPeriod when shutting down your application.

Sleeping for 0ms

Sleeping for short amounts is well and good, but what about sleeping for no time at all? It turns out that Sleep(0) is a special case: The thread still gives up its current time slice, but is not marked unrunnable. And if no other thread has sufficient priority, the call returns without preempting the thread (the STL version of this call is std::this_thread::yield()).

So, with the system generally idle, cores are still locked at 100%, but if other processes are contending for CPU time, they will be able to interleave a little better.

As for latency, this was benchmarked with an idle system, meaning for the vast majority of calls to Sleep(0), it immediately returned. It still is a call to the kernel, and we end up with a median latency of 1250 cycles.

Run Mean Median Min Max SD
1 5303 1248 -255 12948039 209795.153
2 1222 1258 -27 14676 636.369
3 3190 1254 -142 7158861 107734.677
4 1262 1258 -260 94536 1686.480
5 1293 1262 -152 102082 1873.649

Note that Sleep(0) is asynchronous, and does not stall until the next scheduler quantum begins. As such, the timeBeginPeriod trick does not change these numbers.

Waitable Objects

None of the options so far are particularly satisfying, we either lock cores at 100%, or have to accept ~0.7ms of latency. Our last choice is less ergonomic than the ones before it, but involves no polling at all, and properly puts our threads to sleep:

In Win32, WaitForSingleObject is a blocking call used to wait for “events”, which are signalled or unsignalled. We replace our spin-wait with that call, and use SetEvent to signal it from the main thread. The latency median with this technique sits at about 90 thousand cycles:

Run Mean Median Min Max SD
1 110487 89757 3964 4453934 121634.381
2 108167 89063 1196 1062258 70150.576
3 130906 88968 5008 27285256 660781.715
4 312856 90147 -260 166682280 5204398.526
5 115779 89154 207 47079847 671127.082

This seems to be by far the best option, although it does not map trivially to the use case. Events have to be managed and cycled through somehow. Still, latency is comparatively low (at around 0.0225ms), and we use the least possible amount of CPU in the idle case.

Wrapping Up

The considered options fit nicely into a total order by latency.

Option Median Latency Allows Idling
- (spin-wait) 140 cycles no
_mm_pause 140 cycles no (slightly lower power use)
Sleep(0) 1250 cycles no (better interleaving under load)
WaitForSingleObject 90k cycles yes
Sleep(1) with timeBeginPeriod(1) 2.8M cycles yes
Sleep(1) 26.4M cycles yes

The takeaway: Use timeBeginPeriod(1) if Sleep(1) is used anyway, use waitable objects if idling is required (and the events are managable), and spin with _mm_pause() if not. Please take all of the exact figures in this post with a metric ton of salt, their only real purpose is to illustrate the orders of magnitude.