Home > Posts > Waking Threads up as Fast as Possible

Waking Threads up as Fast as Possible

Sleep(1) is not enough

February 14, 2020 · 7 min read

I am currently working on a C++ task scheduler, and like most of its kind, it consists of multiple worker threads which have to be supplied with tasks. Lock-free multi-producer/multi-consumer data structures are a vast topic in and of themselves, but assume we have one of those in place and humming along.

The thing left to decide is what a worker will do while no work is available. Intuitively, a miniscule amount of time should pass before re-querying the task queue, so a Sleep(1) seems reasonable (Note: Our system is portable, but i’ll write in terms of the Win32 equivalents in this post), so i went with that and never thought about it again - until i used Remotery to take a look.

As it turns out, this involves a heavy penalty when it comes to latency, the amount of time that passes between work submission and worker wakeup ranges in the milliseconds. This is barely noticeable when dispatching a large amount of work for a throughput-bound algorithm, but costs critical time in a soft-realtime context, i.e. work to be done for each frame. So, let’s start out with a blank slate.

Benchmark Setup

As far as i can tell, it is almost impossible to “properly” benchmark this setting. Microbenchmarking userspace C++ on a single core is complicated (and often questionable) enough as it is given the complexity of modern CPUs, but here we’re additionally dealing with multiple cores, potential false sharing, Ring 3 / Ring 0 switches, non-monotonous timers, and the OS scheduler.

Nonetheless, i attempt it as such: Our main thread spawns multiple “workers”, one for each physical core (excluding itself), and all threads are locked to a single core using SetThreadAffinityMask.

The workers are spinning until a single global std::atomic_bool is true. Immediately afterwards, they execute rdtsc, an intrinsic to query the CPUs timestamp (in cycles), and store the result.

The main thread waits for a while, then sets the bool to true and records its own rdtsc result afterwards. This “wakes” the other threads up, causing their rdtsc recording in turn. The difference is then calculated per thread, and ultimately averaged over 1000 runs (some additional precautions and details omitted here).

Reckless Spinning

The issue seems to be an easy fix: Simply remove the Sleep(1) call and spin-wait. And indeed, this helps a lot - our tasks seem to launch instantly. The cost however is apparent as soon as the CPU fans spin up - our threads never stop working and keep the cores locked at 100%. In the benchmark setting, this performs extremely well, with a median latency of about 140 cycles.

Run	Mean	Median	Min	Max	SD
#1	1134	146	58	4869510	68864.138
#2	136	139	60	512	27.695
#3	1268	147	60	5597959	79165.394
#4	136	122	46	44495	705.821
#5	140	142	60	8563	120.955

As you can see, these numbers have such wild outliers that they are barely presentable, however we will be able to make out a pattern in these nonetheless.

SSE2 Pause Instruction

In SSE2, Intel added an instruction specifically to hint an ongoing spin-wait to the CPU, PAUSE. It can be inserted using the _mm_pause() intrinsic, and can significantly improve latency for spinlocks, as well as slightly reduce power consumption. It is entirely unrelated to the OS scheduler however, and as such won’t change the main problem - we still have 100% usage. I did not measure a significant latency difference between using _mm_pause() over not doing anything in the loop body.

Sleeping for 1ms

Back to our initial setup - Sleep(1). The argument is an integer in milliseconds, however as seen before, it is not necessarily respected. The benchmark gives us a median latency of about 26.4 million cycles - worlds apart from the empty spin-wait. The CPU this runs on, the i7-8750H, has a sustained boost clock of about 4Ghz, so this amount of cycles should correspond to ~6.6ms.

Run	Mean	Median	Min	Max	SD
#1	24796282	26438794	10646	58488660	6368942.400
#2	26591017	26475796	13945664	53040635	1804528.456
#3	26594131	26438770	2488662	99810600	2793445.766
#4	26536047	26453647	2979794	57778170	2244278.955
#5	26539596	26474311	8885666	50242202	1673861.776

A thread calling Sleep(1) gives up its current time slice, and is marked as “unrunnable” within the OS scheduler for one quantum (which is longer than the 1ms we asked for). This means that our thread, in the best case, has to wait until the end of the current quantum to be swapped back in.

So latency is rather high, but at least the CPU properly idles when there is no work to do.

Manipulating the OS Scheduler

I was a little shocked finding out that you can actually reconfigure the OS scheduler in Win32 to use smaller quanta - no elevated privileges required - with the inconspicuously named timeBeginPeriod. Simply call it with an argument of 1 (millisecond) at startup, and if it returns TIMERR_NOERROR, you’re good to go. And yes, this setting is global, Windows uses the lowest (most granular) setting accross all processes (Linux does not seem to allow anything similar).

From now on, Sleep(1) performs much better, with a median latency of ~2.8 million cycles, or about 0.7ms.

Run	Mean	Median	Min	Max	SD
#1	2431736	2817122	139	60854504	1753174.486
#2	2429558	2837240	-60	5587950	1503830.150
#3	2430186	2826913	200	15420182	1541328.899
#4	2400616	2774255	869	7431026	1462985.834
#5	2382802	2794147	600	5854380	1502142.995

Note that you should undo the changes made by calling timeEndPeriod when shutting down your application.

Sleeping for 0ms

Sleeping for short amounts is well and good, but what about sleeping for no time at all? It turns out that Sleep(0) is a special case: The thread still gives up its current time slice, but is not marked unrunnable. And if no other thread has sufficient priority, the call returns without preempting the thread (the STL version of this call is std::this_thread::yield()).

So, with the system generally idle, cores are still locked at 100%, but if other processes are contending for CPU time, they will be able to interleave a little better.

As for latency, this was benchmarked with an idle system, meaning for the vast majority of calls to Sleep(0), it immediately returned. It still is a call to the kernel, and we end up with a median latency of 1250 cycles.

Run	Mean	Median	Min	Max	SD
#1	5303	1248	-255	12948039	209795.153
#2	1222	1258	-27	14676	636.369
#3	3190	1254	-142	7158861	107734.677
#4	1262	1258	-260	94536	1686.480
#5	1293	1262	-152	102082	1873.649

Note that Sleep(0) is asynchronous, and does not stall until the next scheduler quantum begins. As such, the timeBeginPeriod trick does not change these numbers.

Waitable Objects

None of the options so far are particularly satisfying, we either lock cores at 100%, or have to accept ~0.7ms of latency. Our last choice is less ergonomic than the ones before it, but involves no polling at all, and properly puts our threads to sleep:

In Win32, WaitForSingleObject is a blocking call used to wait for “events”, which are signalled or unsignalled. We replace our spin-wait with that call, and use SetEvent to signal it from the main thread. The latency median with this technique sits at about 90 thousand cycles:

Run	Mean	Median	Min	Max	SD
#1	110487	89757	3964	4453934	121634.381
#2	108167	89063	1196	1062258	70150.576
#3	130906	88968	5008	27285256	660781.715
#4	312856	90147	-260	166682280	5204398.526
#5	115779	89154	207	47079847	671127.082

This seems to be by far the best option, although it does not map trivially to the use case. Events have to be managed and cycled through somehow. Still, latency is comparatively low (at around 0.0225ms), and we use the least possible amount of CPU in the idle case.

Wrapping Up

The considered options fit nicely into a total order by latency.

Option	Median Latency	Allows Idling
`-` (spin-wait)	140 cycles	no
`_mm_pause`	140 cycles	no (slightly lower power use)
`Sleep(0)`	1250 cycles	no (better interleaving under load)
`WaitForSingleObject`	90k cycles	yes
`Sleep(1)` with `timeBeginPeriod(1)`	2.8M cycles	yes
`Sleep(1)`	26.4M cycles	yes

The takeaway: Use timeBeginPeriod(1) if Sleep(1) is used anyway, use waitable objects if idling is required (and the events are managable), and spin with _mm_pause() if not. Please take all of the exact figures in this post with a metric ton of salt, their only real purpose is to illustrate the orders of magnitude.

Parallel CSG Operations on GPU

→