April 2009 Archives

The Password is Courage

I’m not a big anime watcher, usually only falling back on it when I’m in a coding slump and feeling really bored. It was on one such occasion that I picked up Soul Eater. It’s rare to get such a character-driven story out of mainstream anime. The characters, music, art design, style… all of it clicked with me. I fell in love with it from the first episode.

I stopped watching about a month before the story ended so that I could see the final five episodes all in one go. Well, it ended late last month and I only just got around to watching them. Maybe because I was busy, maybe because I didn’t want to admit to myself that it was over. Little of both, probably.

In the end it was a disappointment. Soul Eater could have been so much more, if only they had focused on the characters instead of devolving into a huge action battle. I guess such an ending it is to be expected from something based on a shōnen manga, though from what I’ve heard it deviated quite a bit from the source, which is still ongoing. Maybe I’ll pick that up.

And another…

BSG got a shoutout in last night’s 30 Rock! Elisa (Salma Hayek) was in a shirt, and Frank (Judah Friedlander) said “What the frak?!”.

Salma Hayek in her “What the frak?!” shirt

User Mode Scheduling in Windows 7

Don’t use threads. Or more precisely, don’t over-use them. It’s one of the first thing fledgling programmers learn after they start using threads. This is because threading involves a lot of overhead. In short, using more threads may improve concurrency, but it will give you less overall throughput as more processing is put into simply managing the threads instead of letting them run. So programmers learn to use threads sparingly.

When normal threads run out of time, or block on something like a mutex or I/O, they hand off control to the operating system kernel. The kernel then finds a new thread to run, and switches back to user-mode to run the thread. This context switching is what User Mode Scheduling looks to alleviate.

User Mode Scheduling can be thought of as a cross between threads and thread pools. An application creates one or more UMS scheduler threads—typically one for each processor. It then creates several UMS worker threads for each scheduler thread. The worker threads are the ones that run your actual code. Whenever a worker thread runs out of time, it is put on the end of its scheduler thread’s queue. If a worker thread blocks, it is put on a waiting list to be re-queued by the kernel when whatever it was waiting on finishes. The scheduler thread then takes the worker thread from the top of the queue and starts running it. Like the name suggests, this happens entirely in user-mode, avoiding the expensive user->kernel->user-mode transitions. Letting each thread run for exactly as long as it needs helps to solve the throughput problem. Work is only put into managing threads when absolutely necessary instead of in ever smaller time slices, leaving more time to run your actual code.

A good side effect of this is UMS threads also help to alleviate the cache thrashing problems typical in heavily-threaded applications. Forgetting your data sharing patterns, each thread still needs its own storage for stack space, processor context, and thread-local storage. Every time a context switch happens, some data may need to be pushed out of caches in order to load some kernel-mode code and the next thread’s data. By switching between threads less often, cache can be put to better use for the task at hand.

If you have ever had a chance to use some of the more esoteric APIs included with Windows, you might be wondering why we need UMS threads when we have fibers which offer similar co-operative multitasking. Fibers have a lot of special exceptions. There are things that aren’t safe to do with them. Libraries that rely on thread-local storage, for instance, will likely walk all over themselves if used from within fibers. A UMS thread on the other hand is a full fledged thread—they support TLS and no have no real special things to keep in mind while using them.

I still wouldn’t count out thread pools just yet. UMS threads are still more expensive than a thread pool and the large memory requirements of a thread still apply here, so things like per-client threads in internet daemons are still out of the question if you want to be massively scalable. More likely, UMS threads will be most useful for building thread pools. Most thread pools launch two or three threads per CPU to help stay busy when given blocking tasks, and UMS threads will at least help keep their time slice usage optimal.

From what I understand the team behind Microsoft’s Concurrency Runtime, to be included with Visual C++ 2010, was one of the primary forces behind UMS threads. They worked very closely with the kernel folks to find the most scalable way to enable the super-parallel code that will be possible with the CR.

QRJDQ Lives!

Quake Rocket Jumping done Quick is something new. Think Quake done Quick, but for rocket jumping. Simple huh? The first video, rjxtreme, is up and waiting to be improved!

CSI Cameos

Lots of BSG people in last week’s sci-fi focused CSI (A Space Oddity). I spotted Grace Park (Boomer/Athena/Eight), Kate Vernon (Ellen), Rekha Sharma (Tory), and Ron Moore (series creator). They also mentioned taking a decades-old TV show and updating it to make a hit. Hehe.

Optimizing IP range searching in PeerGuardian

I was working on something completely different last night, when an elegant idea came to mind on how to significantly speed up PeerGuardian’s IP searching. It’s funny how an idea can just pop into the mind about a problem that hasn’t been thought of in a long time.

Right now PeerGuardian uses a binary search to match IPs. This is already pretty efficient, running in ⌈log 2 N⌉—so for 100,000 IP ranges, about 16 tests need to be done. This has the additional advantage of having no memory overhead.

My idea is to use a structure similar to a B+tree, packing as many IPs and branch pointers into a cache line as possible. On today’s architectures, a cache line is typically 64 bytes, so 8 IPs and 9 branch pointers would fit on each node, making it only need to read about ⌈log 9 N⌉ nodes to find a match. So in order to find a match in 100,000 IP ranges, only about 5 nodes would need to be read.

CPUs always read and cache data in blocks (a cache line), so an algorithm that keeps this in mind to minimize memory reads and maximize cache usage should be incredibly fast. Even though this introduces significant overhead for branch pointers (about 2x the storage would be required), it should be far more efficient overall.

But this algorithm improves in another way too: branching. I’m talking in terms of branch instructions, not the branch pointers mentioned above. The fewer branches code takes, the faster a superscalar or pipelined CPU will be able to run your code. For this algorithm, an entire node could be processed (that is, comparing the IPs and determining which branch node to go into) with zero branches using integer SSE2 (PCMPGTD, PMOVMSKB), and bit-scan forward (BSF).

I can’t be sure how much of a speed difference this would make until I code it up, but I bet it would be at least 200% faster. I’ve been too busy to work on PeerGuardian for quite a while, so I don’t know if this will ever make it into PG. We’re looking for a new coder with more time on their hands.

Fast & Furious

Fast & Furious

It’s a video game (Need 4 Speed) made into a movie (The Fast and the Furious) made into a video game movie (Fast & Furious).

Okay, maybe that’s not the official story, but that’s sure how it felt. Their weird use of GPS felt like an in-game HUD. It even had an “ n miles left” announcer voice you might find in time trials. Various scenes felt like the third-person view so common in games like Need 4 Speed.

Adequately fun action, but overall silly story and plenty of bad lines.

New site

New site with blog & projects merged, plus a nicer design.

Kendo and Ramen

Cherry Blossom Festival

Today I went down to Little Tokyo to grab some ramen for lunch, and happened upon the Cherry Blossom Festival. I only stayed a little while, to see some sumo and kendo performances.

I got there late for the sumo unfortunately, but I did get to see a few matches where Dan Kalbfleisch wiped the floor with some other guys.

The kendo demonstration started with sensei Cary Yoshio Mizobe performing tameshigiri—cutting a tatami omote with a katana. His students went on from there to show off their moves with shinai. Sensei Mizobe was explaining one of the moves: tsuki, a stab to the throat apparently difficult enough that he only lets his black belt students perform it, to lessen the risk of not having enough precision and injuring the opponents. He said he was hired to train Brittany Murphy to perform it for her new movie, The Ramen Girl. The only problem is, they wanted him to train her on this advanced move in eight hours. His only advice was to totally fake it out with camera tricks, or risk injury. Thought that was funny :)

Bear McCreary to score Capcom’s Dark Void

8-bit Dark Void

Bear McCreary, composer of Battlestar Galactica, let out the information today that he is scoring Capcom’s new game, Dark Void. I remember some videos of this coming out a while ago, from E3 maybe, and it looked like a pretty fun game but after so much time I had forgotten about it. These new videos make it look as awesome as ever, though, and with Bear scoring it, you know the music will be amazing!

If you like his music, consider posting at Capcom where Bear is trying to get authorization to release a soundtrack CD.