Coding, Page 2

I/O completion ports made easy

I described the basics of I/O completion ports in my last post, but there is still the question of what the easiest way to use them. Here I’ll show a callback-based application design that I’ve found makes a fully asynchronous program really simple to do.

I touched briefly on attaching our own context data to the OVERLAPPED structure we pass along with I/O operations. It’s this same idea that I’ll expand on here. This time, we define a generic structure to use with all our operations, and how our threads will handle them while dequeuing packets:

struct io_context
{
  OVERLAPPED ovl;
  void (*on_completion)(DWORD error, DWORD transferred,
                        struct io_context *ctx);
};

OVERLAPPED *ovl;
ULONG_PTR completionkey;
DWORD transferred;

BOOL ret = GetQueuedCompletionStatus(iocp, &transferred,
   &completionkey, &ovl, INFINITE);

if(ret)
{
  struct io_context *ctx = (struct io_context*)ovl;
  ctx->on_completion(ERROR_SUCCESS, transferred, ctx);
}
else if(ovl)
{
  DWORD err = GetLastError();

  struct io_context *ctx = (struct io_context*)ovl;
  ctx->on_completion(err, transferred, ctx);
}
else
{
  // error out
}

With this, all our I/O operations will have a callback associated with them. When a completion packet is dequeued it gets the error information, if any, and runs the callback. Having every I/O operation use a single callback mechanism greatly simplifies the design of the entire program.

Lets say our app was reading a file and sending out it’s contents. We also want it to prefetch the next buffer so we can start sending right away. Here’s our connection context:

struct connection_context
{
  HANDLE file;
  SOCKET sock;

  WSABUF readbuf;
  WSABUF sendbuf;

  struct io_context readctx;
  struct io_context sendctx;
};

A structure like this is nice because initiating an I/O operation will need no allocations. Note that we need two io_context members because we’re doing a read and send concurrently.

Now the code to use it:

#define BUFFER_SIZE 4096

void begin_read(struct connection_context *ctx)
{
  if(ctx->readbuf.buf)
  {
    // only begin a read if one isn't already running.
    return;
  }

  ctx->readbuf.buf = malloc(BUFFER_SIZE);
  ctx->readbuf.len = 0;

  // zero out io_context structure.
  memset(&ctx->readctx, 0, sizeof(ctx->readctx));

  // set completion callback.
  ctx->readctx.on_completion = read_finished;

  ReadFile(ctx->file, ctx->readbuf.buf, BUFFER_SIZE,
           NULL, &ctx->readctx.ovl);
}

void read_finished(DWORD error, DWORD transferred,
                   struct io_context *ioctx)
{
  // get our connection context.
  struct connection_context *ctx =
     (struct connection_context*)((char*)ioctx -
        offsetof(struct connection_context, readctx));

  if(error != ERROR_SUCCESS)
  {
    // handle error.
    return;
  }

  if(!transferred)
  {
    // reached end of file, close out connection.
    free(ctx->readbuf.buf);
    ctx->readbuf.buf = 0;
    return;
  }

  // send out however much we read from the file.

  ctx->readbuf.len = transferred;

  begin_send(ctx);
}

This gives us a very obvious chain of events: read_finished is called when a read completes. Since we only get an io_context structure in our callback, we need to adjust the pointer to get our full connection_context.

Sending is easy too:

void begin_send(struct connection_context *ctx)
{
  if(ctx->sendbuf.buf)
  {
    // only begin a send if one
    // isn't already running.
    return;
  }

  if(!ctx->recvbuf.len)
  {
    // only begin a send if the
    // read buffer has something.
    return;
  }

  // switch buffers.
  ctx->sendbuf = ctx->readbuf;

  // clear read buffer.
  ctx->readbuf.buf = NULL;
  ctx->readbuf.len = 0;

  // zero out io_context structure.
  memset(&ctx->sendctx, 0, sizeof(ctx->sendctx));

  // set completion callback.
  ctx->sendctx.on_completion = send_finished;

  WSASend(ctx->sock, &ctx->sendbuf, 1, NULL, 0,
          &ctx->sendctx.ovl, NULL);

  // start reading next buffer.
  begin_read(ctx);
}

void send_finished(DWORD error, DWORD transferred,
                   struct io_context *ioctx)
{
  // get our connection context.
  struct connection_context *ctx =
     (struct connection_context*)((char*)ioctx -
        offsetof(struct connection_context, sendctx));

  if(error != ERROR_SUCCESS)
  {
    // handle error.
    return;
  }

  // success, clear send buffer and start next send.

  free(ctx->sendbuf.buf);
  ctx->sendbuf.buf = NULL;

  begin_send(ctx);
}

Pretty much more of the same. Again for brevity I’m leaving out some error checking code and assuming the buffer gets sent out in full. I’m also assuming a single-threaded design—socket and file functions themselves are thread-safe and have nothing to worry about, but the buffer management code here would need some extra locking since it could be run concurrently. But the idea should be clear.

Update: this subject continued in Tips for efficient I/O.

Visual Studio 2010 Beta 1 in four days

Visual Studio 2010 Beta 1 is being launched for MSDN subscribers in four days on the 18th, and for the general public on the 20th!

High Performance I/O on Windows

I/O completion ports were first introduced in Windows NT 4.0 as a highly scalable, multi-processor capable alternative to the then-typical practices of using select, WSAWaitForMultipleEvents, WSAAsyncSelect, or even running a single thread per client. The most optimal way to perform I/O on Windows—short of writing a kernel-mode driver—is to use I/O completion ports.

A recent post on Slashdot claimed sockets have run their course, which I think is absolutely not true! The author seems to believe the Berkeley sockets API is the only way to perform socket I/O, which is nonsense. Much more modern, scalable and high performance APIs exist today such as I/O completion ports on Windows, epoll on Linux, and kqueue on FreeBSD. In light of this I thought I’d write a little about completion ports here.

The old ways of multiplexing I/O still work pretty well for scenarios with a low number of concurrent connections, but when writing a server daemon to handle a thousand or even tens of thousands of concurrent clients at once, we need to use something different. In this sense the old de facto standard Berkeley sockets API has run its course, because the overhead of managing so many connections is simply too great and makes using multiple processors hard.

An I/O completion port is a multi-processor aware queue. You create a completion port, bind file or socket handles to it, and start asynchronous I/O operations. When they complete—either successfully or with an error—a completion packet is queued up on the completion port, which the application can dequeue from multiple threads. The completion port uses some special voodoo to make sure only a specific number of threads can run at once—if one thread blocks in kernel-mode, it will automatically start up another one.

First you need to create a completion port with CreateIoCompletionPort:

HANDLE iocp = CreateIoCompletionPort(INVALID_HANDLE_VALUE,
   NULL, 0, 0);

It’s important to note that NumberOfConcurrentThreads is not the total number of threads that can access the completion port—you can have as many as you want. This instead controls the number of threads it will allow to run concurrently. Once you’ve reached this number, it will block all other threads. If one of the running threads blocks for any reason in kernel-mode, the completion port will automatically start up one of the waiting threads. Specifying 0 for this is equivalent to the number of logical processors in the system.

Next is creating and associating a file or socket handle, using CreateFile, WSASocket, and CreateIoCompletionPort:

#define OPERATION_KEY 1

HANDLE file = CreateFile(L"file.txt", GENERIC_READ,
   FILE_SHARE_READ, NULL, OPEN_EXISTING,
   FILE_FLAG_OVERLAPPED, NULL);

SOCKET sock = WSASocket(AF_INET, SOCK_STREAM, IPPROTO_TCP,
   NULL, 0, WSA_FLAG_OVERLAPPED);

CreateIoCompletionPort(file, iocp, OPERATION_KEY, 0);
CreateIoCompletionPort((HANDLE)sock, iocp, OPERATION_KEY, 0);

Files and sockets must be opened with the FILE_FLAG_OVERLAPPED and WSA_FLAG_OVERLAPPED flags before they are used asynchronously.

Reusing CreateIoCompletionPort for associating file/socket handles is weird design choice from Microsoft but that’s how it’s done. The CompletionKey parameter can be anything you want, it is a value provided when packets are dequeued. I define a OPERATION_KEY here to use as the CompletionKey, the significance of which I’ll get to later.

Next we have to start up some I/O operations. I’ll skip setting up the socket and go right to sending data. We start these operations using ReadFile and WSASend:

OVERLAPPED readop, sendop;
WSABUF sendwbuf;
char readbuf[256], sendbuf[256];

memset(&readop, 0, sizeof(readop));
memset(&sendop, 0, sizeof(sendop));

sendwbuf.len = sizeof(sendbuf);
sendwbuf.buf = sendbuf;

BOOL readstatus = ReadFile(file, readbuf,
   sizeof(readbuf), NULL, &readop);

if(!readstatus)
{
  DWORD readerr = GetLastError();

  if(readerr != ERROR_IO_PENDING)
  {
    // error reading.
  }
}

int writestatus = WSASend(sock, &sendwbuf, 1, NULL,
   0, &sendop, NULL);

if(writestatus)
{
  int writeerr = WSAGetLastError();

  if(writeerr != WSA_IO_PENDING)
  {
    // error sending.
  }
}

Every I/O operation using a completion port has an OVERLAPPED structure associated with it. Windows uses this internally in unspecified ways, only saying we need to zero them out before starting an operation. The OVERLAPPED structure will be given back to us when we dequeue the completion packets, and can be used to pass along our own context data.

I have left out the standard error checking up until now for brevity’s sake, but this one doesn’t work quite like one might expect so it is important here. When starting an I/O operation, an error might not really be an error. If the function succeeds all is well, but if the function fails, it is important to check the error code with GetLastError or WSAGetLastError.

If these functions return ERROR_IO_PENDING or WSA_IO_PENDING, the function actually still completed successfully. All these error codes mean is an asynchronous operation has been started and completion is pending, as opposed to completing immediately. A completion packet will be queued up regardless of the operation completing asynchronously or not.

Dequeuing packets from a completion port is handled by the GetQueuedCompletionStatus function:

This function dequeues completion packets, consisting of the completion key we specified in CreateIoCompletionPort and the OVERLAPPED structure we gave while starting the I/O. If the I/O transferred any data, it will retrieve how many bytes were transferred too. Again the error handling is a bit weird on this one, having three error states.

This can be run from as many threads as you like, even a single one. It is common practice to run a pool of twice the number of threads as there are logical processors available, to keep the CPU active with the aforementioned functionality of starting a new thread if a running one blocks.

Unless you are going for a single-threaded design, I recommend starting two threads per logical CPU. Even if your app is designed to be 100% asynchronous, you will still run into blocking when locking shared data and even get unavoidable hidden blocking I/Os like reading in paged out memory. Keeping two threads per logical CPU will keep the processor busy without overloading the OS with too much context switching.

This is all well and good, but two I/O operations were initiated—a file read and a socket send. We need a way to tell their completion packets apart. This is why we need to attach some state to the OVERLAPPED structure when we call those functions:

struct my_context
{
  OVERLAPPED ovl;
  int isread;
};

struct my_context readop, sendop;

memset(&readop.ovl, 0, sizeof(readop.ovl));
memset(&sendop.ovl, 0, sizeof(sendop.ovl));

readop.isread = 1;
sendop.isread = 0;

ReadFile(file, readbuf, sizeof(readbuf), NULL, &readop.ovl);
WSASend(sock, &sendwbuf, 1, NULL, 0, &sendop.ovl, NULL);

Now we can tell the operations apart when we dequeue them:

OVERLAPPED *ovl;
ULONG_PTR completionkey;
DWORD transferred;

GetQueuedCompletionStatus(iocp, &transferred,
   &completionkey, &ovl, INFINITE);

struct my_context *ctx = (struct my_context*)ovl;

if(ctx->isread)
{
  // read completed.
}
else
{
  // send completed.
}

The last important thing to know is how to queue up your own completion packets. This is useful if you want to split an operation up to be run on the thread pool, or if you want to exit a thread waiting on a call to GetQueuedCompletionStatus. To do this, we use the PostQueuedCompletionStatus function:

#define EXIT_KEY 0

struct my_context ctx;

PostQueuedCompletionStatus(iocp, 0, OPERATION_KEY, &ctx.ovl);
PostQueuedCompletionStatus(iocp, 0, EXIT_KEY, NULL);

Here we post two things onto the queue. The first, we post our own structure onto the queue, to be processed by our thread pool. The second, we give a new completion key: EXIT_KEY. The thread which processes this packet can test if the completion key is EXIT_KEY to know when it needs to stop dequeuing packets and shut down.

Other than the completion port handle, Windows does not use any of the parameters given to PostQueuedCompletionStatus. They are entirely for our use, to be dequeued with GetQueuedCompletionStatus.

That’s all I have to write for now, and should be everything one would need to get started learning these high performance APIs! I will make another post shortly detailing some good patterns for completion port usage, and some optimization tips to ensure efficient usage of these I/O APIs.

Update: this subject continued in I/O completion ports made easy.

User Mode Scheduling in Windows 7

Don’t use threads. Or more precisely, don’t over-use them. It’s one of the first thing fledgling programmers learn after they start using threads. This is because threading involves a lot of overhead. In short, using more threads may improve concurrency, but it will give you less overall throughput as more processing is put into simply managing the threads instead of letting them run. So programmers learn to use threads sparingly.

When normal threads run out of time, or block on something like a mutex or I/O, they hand off control to the operating system kernel. The kernel then finds a new thread to run, and switches back to user-mode to run the thread. This context switching is what User Mode Scheduling looks to alleviate.

User Mode Scheduling can be thought of as a cross between threads and thread pools. An application creates one or more UMS scheduler threads—typically one for each processor. It then creates several UMS worker threads for each scheduler thread. The worker threads are the ones that run your actual code. Whenever a worker thread runs out of time, it is put on the end of its scheduler thread’s queue. If a worker thread blocks, it is put on a waiting list to be re-queued by the kernel when whatever it was waiting on finishes. The scheduler thread then takes the worker thread from the top of the queue and starts running it. Like the name suggests, this happens entirely in user-mode, avoiding the expensive user->kernel->user-mode transitions. Letting each thread run for exactly as long as it needs helps to solve the throughput problem. Work is only put into managing threads when absolutely necessary instead of in ever smaller time slices, leaving more time to run your actual code.

A good side effect of this is UMS threads also help to alleviate the cache thrashing problems typical in heavily-threaded applications. Forgetting your data sharing patterns, each thread still needs its own storage for stack space, processor context, and thread-local storage. Every time a context switch happens, some data may need to be pushed out of caches in order to load some kernel-mode code and the next thread’s data. By switching between threads less often, cache can be put to better use for the task at hand.

If you have ever had a chance to use some of the more esoteric APIs included with Windows, you might be wondering why we need UMS threads when we have fibers which offer similar co-operative multitasking. Fibers have a lot of special exceptions. There are things that aren’t safe to do with them. Libraries that rely on thread-local storage, for instance, will likely walk all over themselves if used from within fibers. A UMS thread on the other hand is a full fledged thread—they support TLS and no have no real special things to keep in mind while using them.

I still wouldn’t count out thread pools just yet. UMS threads are still more expensive than a thread pool and the large memory requirements of a thread still apply here, so things like per-client threads in internet daemons are still out of the question if you want to be massively scalable. More likely, UMS threads will be most useful for building thread pools. Most thread pools launch two or three threads per CPU to help stay busy when given blocking tasks, and UMS threads will at least help keep their time slice usage optimal.

From what I understand the team behind Microsoft’s Concurrency Runtime, to be included with Visual C++ 2010, was one of the primary forces behind UMS threads. They worked very closely with the kernel folks to find the most scalable way to enable the super-parallel code that will be possible with the CR.

Optimizing IP range searching in PeerGuardian

I was working on something completely different last night, when an elegant idea came to mind on how to significantly speed up PeerGuardian’s IP searching. It’s funny how an idea can just pop into the mind about a problem that hasn’t been thought of in a long time.

Right now PeerGuardian uses a binary search to match IPs. This is already pretty efficient, running in ⌈log 2 N⌉—so for 100,000 IP ranges, about 16 tests need to be done. This has the additional advantage of having no memory overhead.

My idea is to use a structure similar to a B+tree, packing as many IPs and branch pointers into a cache line as possible. On today’s architectures, a cache line is typically 64 bytes, so 8 IPs and 9 branch pointers would fit on each node, making it only need to read about ⌈log 9 N⌉ nodes to find a match. So in order to find a match in 100,000 IP ranges, only about 5 nodes would need to be read.

CPUs always read and cache data in blocks (a cache line), so an algorithm that keeps this in mind to minimize memory reads and maximize cache usage should be incredibly fast. Even though this introduces significant overhead for branch pointers (about 2x the storage would be required), it should be far more efficient overall.

But this algorithm improves in another way too: branching. I’m talking in terms of branch instructions, not the branch pointers mentioned above. The fewer branches code takes, the faster a superscalar or pipelined CPU will be able to run your code. For this algorithm, an entire node could be processed (that is, comparing the IPs and determining which branch node to go into) with zero branches using integer SSE2 (PCMPGTD, PMOVMSKB), and bit-scan forward (BSF).

I can’t be sure how much of a speed difference this would make until I code it up, but I bet it would be at least 200% faster. I’ve been too busy to work on PeerGuardian for quite a while, so I don’t know if this will ever make it into PG. We’re looking for a new coder with more time on their hands.

Alex Filo has the right idea, UniformWrapPanel

WPF’s WrapPanel is missing a key feature: the ability to create vertical columns, but still fill the maximum amount of width available. One less annoyance in WPF, thanks to Alex Filo’s UniformWrapPanel.

Qt 4.5 released, still using three year old GCC

Qt 4.5 is out, along with Qt Creator. It’s still using GCC 3.4.5, from a January 2006 codebase. Sigh.

Databinding TextBlocks with XAML

I’ve become frustrated lately with trying to databind a list of strings to a textblock. Ie, if I have:

string[] mytext = new string[] { "a", "b", "c" };

And I want the effective end result to be:

<TextBlock>a, b, c</TextBlock>

Or maybe I want something more complex like hyperlinks:

<TextBlock>
   <Hyperlink>a</Hyperlink>,
   <Hyperlink>b</Hyperlink>,
   <Hyperlink>c</Hyperlink>
</TextBlock>

Basically what I’m looking for is a ItemsControl that works on TextBlocks (or Spans, etc.), with a Separator template to insert those “, ” in between the items.

Your first instinct might be to use a StackPanel instead, but StackPanel wasn’t made for displaying text. It might fool you initially, but you quickly notice it doesn’t behave anything like text: there’s no proper kerning, RTL, wrapping, or anything else the text classes were made to support.

I’m pretty surprised that WPF doesn’t have anything like this, as it seems like displaying lists as text would be a common enough thing for any app. Unfortunately I’m still not well versed enough in WPF to create such a custom control for myself, and haven’t had a whole lot of time to investigate it.

Using HSL colors in WPF

One thing that has always irritated me about WPF is you’re still stuck specifying colors in RGB. HSL just feels so much more natural from a design standpoint. Well, we’re not completely out of luck—we’ve got markup extensions.

So I created my own HslColor and HslBrush extensions which are fairly simple to use:

<Window xmlns:e="clr-namespace:WpfExtensions"
        Background="{e:HslBrush H=300,S=50,L=75,A=80}"/>

Hue is specified in degrees from 0 to 360, while Saturation, Lightness, and Alpha are from 0 to 100. The parameters are all doubles and it converts to scRGB behind the scenes, which means you actually get a much higher color precision than if you had just used the equivalent RGB hex. With Windows 7 having native support for scRGB, this will future-proof your application to make good use of upcoming monitors with Deep Color support.

My Windows Vista/7/8 Wishlist

These are some changes I’ve been trying to get made since Vista entered beta. Now 7’s beta has begun and still chances look bleak. Maybe I’ll have more luck in 8?