jkunstwald

Waking Threads up as Fast as Possible

Sleep(1) is not enough

  ·   7 min read

I am currently working on a C++ task scheduler, and like most of its kind, it consists of multiple worker threads which have to be supplied with tasks. Lock-free multi-producer/multi-consumer data structures are a vast topic in and of themselves, but assume we have one of those in place and humming along.

The thing left to decide is what a worker will do while no work is available. Intuitively, a miniscule amount of time should pass before re-querying the task queue, so a Sleep(1) seems reasonable (Note: Our system is portable, but i’ll write in terms of the Win32 equivalents in this post), so i went with that and never thought about it again - until i used Remotery to take a look.

As it turns out, this involves a heavy penalty when it comes to latency, the amount of time that passes between work submission and worker wakeup ranges in the milliseconds. This is barely noticeable when dispatching a large amount of work for a throughput-bound algorithm, but costs critical time in a soft-realtime context, i.e. work to be done for each frame. So, let’s start out with a blank slate.

Benchmark Setup

As far as i can tell, it is almost impossible to “properly” benchmark this setting. Microbenchmarking userspace C++ on a single core is complicated (and often questionable) enough as it is given the complexity of modern CPUs, but here we’re additionally dealing with multiple cores, potential false sharing, Ring 3 / Ring 0 switches, non-monotonous timers, and the OS scheduler.

Nonetheless, i attempt it as such: Our main thread spawns multiple “workers”, one for each physical core (excluding itself), and all threads are locked to a single core using SetThreadAffinityMask.

The workers are spinning until a single global std::atomic_bool is true. Immediately afterwards, they execute rdtsc, an intrinsic to query the CPUs timestamp (in cycles), and store the result.

The main thread waits for a while, then sets the bool to true and records its own rdtsc result afterwards. This “wakes” the other threads up, causing their rdtsc recording in turn. The difference is then calculated per thread, and ultimately averaged over 1000 runs (some additional precautions and details omitted here).

Reckless Spinning

The issue seems to be an easy fix: Simply remove the Sleep(1) call and spin-wait. And indeed, this helps a lot - our tasks seem to launch instantly. The cost however is apparent as soon as the CPU fans spin up - our threads never stop working and keep the cores locked at 100%. In the benchmark setting, this performs extremely well, with a median latency of about 140 cycles.

RunMeanMedianMinMaxSD
#1113414658486951068864.138
#21361396051227.695
#3126814760559795979165.394
#41361224644495705.821
#5140142608563120.955

As you can see, these numbers have such wild outliers that they are barely presentable, however we will be able to make out a pattern in these nonetheless.

SSE2 Pause Instruction

In SSE2, Intel added an instruction specifically to hint an ongoing spin-wait to the CPU, PAUSE. It can be inserted using the _mm_pause() intrinsic, and can significantly improve latency for spinlocks, as well as slightly reduce power consumption. It is entirely unrelated to the OS scheduler however, and as such won’t change the main problem - we still have 100% usage. I did not measure a significant latency difference between using _mm_pause() over not doing anything in the loop body.

Sleeping for 1ms

Back to our initial setup - Sleep(1). The argument is an integer in milliseconds, however as seen before, it is not necessarily respected. The benchmark gives us a median latency of about 26.4 million cycles - worlds apart from the empty spin-wait. The CPU this runs on, the i7-8750H, has a sustained boost clock of about 4Ghz, so this amount of cycles should correspond to ~6.6ms.

RunMeanMedianMinMaxSD
#1247962822643879410646584886606368942.400
#2265910172647579613945664530406351804528.456
#326594131264387702488662998106002793445.766
#426536047264536472979794577781702244278.955
#526539596264743118885666502422021673861.776

A thread calling Sleep(1) gives up its current time slice, and is marked as “unrunnable” within the OS scheduler for one quantum (which is longer than the 1ms we asked for). This means that our thread, in the best case, has to wait until the end of the current quantum to be swapped back in.

So latency is rather high, but at least the CPU properly idles when there is no work to do.

Manipulating the OS Scheduler

I was a little shocked finding out that you can actually reconfigure the OS scheduler in Win32 to use smaller quanta - no elevated privileges required - with the inconspicuously named timeBeginPeriod. Simply call it with an argument of 1 (millisecond) at startup, and if it returns TIMERR_NOERROR, you’re good to go. And yes, this setting is global, Windows uses the lowest (most granular) setting accross all processes (Linux does not seem to allow anything similar).

From now on, Sleep(1) performs much better, with a median latency of ~2.8 million cycles, or about 0.7ms.

RunMeanMedianMinMaxSD
#124317362817122139608545041753174.486
#224295582837240-6055879501503830.150
#324301862826913200154201821541328.899
#42400616277425586974310261462985.834
#52382802279414760058543801502142.995

Note that you should undo the changes made by calling timeEndPeriod when shutting down your application.

Sleeping for 0ms

Sleeping for short amounts is well and good, but what about sleeping for no time at all? It turns out that Sleep(0) is a special case: The thread still gives up its current time slice, but is not marked unrunnable. And if no other thread has sufficient priority, the call returns without preempting the thread (the STL version of this call is std::this_thread::yield()).

So, with the system generally idle, cores are still locked at 100%, but if other processes are contending for CPU time, they will be able to interleave a little better.

As for latency, this was benchmarked with an idle system, meaning for the vast majority of calls to Sleep(0), it immediately returned. It still is a call to the kernel, and we end up with a median latency of 1250 cycles.

RunMeanMedianMinMaxSD
#153031248-25512948039209795.153
#212221258-2714676636.369
#331901254-1427158861107734.677
#412621258-260945361686.480
#512931262-1521020821873.649

Note that Sleep(0) is asynchronous, and does not stall until the next scheduler quantum begins. As such, the timeBeginPeriod trick does not change these numbers.

Waitable Objects

None of the options so far are particularly satisfying, we either lock cores at 100%, or have to accept ~0.7ms of latency. Our last choice is less ergonomic than the ones before it, but involves no polling at all, and properly puts our threads to sleep:

In Win32, WaitForSingleObject is a blocking call used to wait for “events”, which are signalled or unsignalled. We replace our spin-wait with that call, and use SetEvent to signal it from the main thread. The latency median with this technique sits at about 90 thousand cycles:

RunMeanMedianMinMaxSD
#11104878975739644453934121634.381
#2108167890631196106225870150.576
#313090688968500827285256660781.715
#431285690147-2601666822805204398.526
#51157798915420747079847671127.082

This seems to be by far the best option, although it does not map trivially to the use case. Events have to be managed and cycled through somehow. Still, latency is comparatively low (at around 0.0225ms), and we use the least possible amount of CPU in the idle case.

Wrapping Up

The considered options fit nicely into a total order by latency.

OptionMedian LatencyAllows Idling
- (spin-wait)140 cyclesno
_mm_pause140 cyclesno (slightly lower power use)
Sleep(0)1250 cyclesno (better interleaving under load)
WaitForSingleObject90k cyclesyes
Sleep(1) with timeBeginPeriod(1)2.8M cyclesyes
Sleep(1)26.4M cyclesyes

The takeaway: Use timeBeginPeriod(1) if Sleep(1) is used anyway, use waitable objects if idling is required (and the events are managable), and spin with _mm_pause() if not. Please take all of the exact figures in this post with a metric ton of salt, their only real purpose is to illustrate the orders of magnitude.