mastodontech.de ist einer von vielen unabhängigen Mastodon-Servern, mit dem du dich im Fediverse beteiligen kannst.
Offen für alle (über 16) und bereitgestellt von Markus'Blog

Serverstatistik:

1,5 Tsd.
aktive Profile

#mutex

0 Beiträge0 Beteiligte0 Beiträge heute

More interesting progress trying to make #swad suitable for very busy sites!

I realized that #TLS (both with #OpenSSL and #LibreSSL) is a *major* bottleneck. With TLS enabled, I couldn't cross 3000 requests per second, with somewhat acceptable response times (most below 500ms). Disabling TLS, I could really see the impact of a #lockfree queue as opposed to one protected by a #mutex. With the mutex, up to around 8000 req/s could be reached on the same hardware. And with a lockfree design, that quickly went beyond 10k req/s, but crashed. 😆

So I read some scientific papers 🙈 ... and redesigned a lot (*). And now it finally seems to work. My latest test reached a throughput of almost 25k req/s, with response times below 10ms for most requests! I really didn't expect to see *this* happen. 🤩 Maybe it could do even more, didn't try yet.

Open issue: Can I do something about TLS? There *must* be some way to make it perform at least a *bit* better...

(*) edit: Here's the design I finally used, with a much simplified "dequeue" because the queues in question are guaranteed to have only a single consumer: dl.acm.org/doi/10.1145/248052.

I recently took a dive into #C11 #atomics to come up with alternative queue implementations not requiring locking some #mutex.

TBH, I have a hard time understanding the #memory #ordering constraints defined by C11. I mean, I code #assembler on a #mos6502 (for the #c64), so caches, pipelines and all that modern crap is kind of alien rocket science anyways 😆.

But seriously, they try to abstract from what the hardware provides (different kinds of memory barrier instructions, IMHO somewhat easier to understand), so the compiler can pick the appropriate one depending on the target CPU. But wrapping your head around their definition really hurts the brain 🙈.

Yesterday, I found a source telling me that #amd64 (or #x86 in general?) always has strong ordering for reads, so no matter which oderding constraint you put in your atomic_load and friends, the compiler will generate the same code and it will work. Oh boy, how should I ever verify my code works on e.g. aarch64 without owning such hardware?

Fortgeführter Thread

Nice, #threadpool overhaul done. Removed two locks (#mutex) and two condition variables, replaced by a single lock and a single #semaphore. 😎 Simplifies the overall structure a lot, and it's probably safe to assume slightly better performance in contended situations as well. And so far, #valgrind's helgrind tool doesn't find anything to complain about. 🙃

Looking at the screenshot, I should probably make #swad default to *two* threads per CPU and expose the setting in the configuration file. When some thread jobs are expected to block, having more threads than CPUs is probably better.

github.com/Zirias/poser/commit