ClearType in Windows 7

Posted in Coding, Microsoft on November 4th, 2009 by Cory – Comments Off

One of my big pet peeves with ClearType prior to Windows 7 was that it only anti-aliased horizontally with sub-pixels. This is great for small fonts, because at such a small scale traditional anti-aliasing has a smudging effect, reducing clarity and increasing the font’s weight. For large fonts however, it introduces some very noticeable aliasing on curves, as best seen in the ‘6′ and ‘g’ here:

int64_gdi

You’ve probably noticed this on websites everywhere, but have come to accept it. Depending on your browser and operating system, you can probably see it in the title here. This problem is solved in Windows 7 with the introduction of DirectWrite, which combines ClearType’s horizontal anti-aliasing with regular vertical anti-aliasing when using large font sizes:

int64_dw

Of course, DirectWrite affects more than just Latin characters. Any glyphs with very slight angles will see a huge benefit, such as hiragana:

makoto

Unfortunately, this isn’t a free upgrade. For whatever reason, Microsoft didn’t make all the old GDI functions use DirectWrite’s improvements so to make use of this, all your old GDI and DrawText code will need to be upgraded to use Direct2D and DirectWrite directly, so an old WM_PAINT procedure like this:

PAINTSTRUCT ps;
HDC hdc = BeginPaint(hwnd, &ps);

HFONT font = CreateFont(-96, 0, 0, 0, FW_NORMAL,
                        0, 0, 0, 0, 0, 0, 0, 0, L"Calibri");

SelectObject(hdc, (HGDIOBJ)font);

RECT rc;
GetClientRect(hwnd, &rc);

DrawText(hdc, L"Int64.org", 9, &rc,
         DT_SINGLELINE | DT_CENTER | DT_VCENTER);

EndPaint(hwnd, &ps);

Will turn into this:

ID2D1Factory *d2df;

D2D1CreateFactory(D2D1_FACTORY_TYPE_SINGLE_THREADED,
   __uuidof(ID2D1Factory), 0, (void**)&d2df);

IDWriteFactory *dwf;

DWriteCreateFactory(DWRITE_FACTORY_TYPE_SHARED,
   __uuidof(IDWriteFactory), (IUnknown**)&dwf);

IDWriteTextFormat *dwfmt;

dwf->CreateTextFormat(L"Calibri", 0, DWRITE_FONT_WEIGHT_REGULAR,
   DWRITE_FONT_STYLE_NORMAL, DWRITE_FONT_STRETCH_NORMAL,
   96.0f, L"en-us", &dwfmt);

dwfmt->SetTextAlignment(DWRITE_TEXT_ALIGNMENT_CENTER);
dwfmt->SetParagraphAlignment(DWRITE_PARAGRAPH_ALIGNMENT_CENTER);

RECT rc;
GetClientRect(hwnd, &rc);

D2D1_SIZE_U size = D2D1::SizeU(rc.right - rc.left,
                               rc.bottom - rc.top);

ID2D1HwndRenderTarget *d2drt;

d2df->CreateHwndRenderTarget(D2D1::RenderTargetProperties(),
   D2D1::HwndRenderTargetProperties(hwnd, size), &d2drt);

ID2D1SolidColorBrush *d2db;

d2drt->CreateSolidColorBrush(D2D1::ColorF(D2D1::ColorF::Black),
   &d2db);

D2D1_SIZE_F layoutSize = d2drt->GetSize();
D2D1_RECT_F layoutRect = D2D1::RectF(0.0, 0.0,
   layoutSize.width, layoutSize.height);

d2drt->BeginDraw();
d2drt->DrawText(L"Int64.org", 9, dwfmt, layoutRect, d2db);
d2drt->EndDraw();

This is no small change, and considering this API won’t work on anything but Vista and Windows 7, you’ll be cutting out a lot of users if you specialize for it. While you could probably make a clever DrawText wrapper, Direct2D and DirectWrite are really set up to get you the most benefit if you’re all in. Hopefully general libraries like Pango and Cairo will get updated backends for it.

DirectWrite has other benefits too, like sub-pixel rendering. When you render text in GDI, glyphs will always get snapped to pixels. If you have two letters side by side, it will choose to always start the next letter 1 or 2 pixels away from the last — but what if the current font size says it should actually be a 1.5 pixel distance? In GDI, this will be rounded to 1 or 2. This is also noticeable with kerning, which tries to remove excessive space between specific glyphs such as “Vo”. Because of this, most of the text you see in GDI is very slightly warped. It’s much more apparent when animating, where it causes the text to have a wobbling effect as it constantly snaps from one pixel to the next instead of smoothly transitioning between the two.

DirectWrite’s sub-pixel rendering helps to alleviate this by doing exactly that: glyphs can now start rendering at that 1.5 pixel distance, or any other point in between. Here you can see the differing space between the ‘V’ and ‘o’, as well as a slight difference between the ‘o’s with DirectWrite on the right side, because they are being rendered on sub-pixel offsets:

volcano_subpixel

The difference between animating with sub-pixel rendering and without is staggering when we view it in motion:

volcano_anim

Prior to DirectWrite the normal way to animate like this was to render to a texture with monochrome anti-aliasing (that is, without ClearType), and transform the texture while rendering. The problem with that is the transform will introduce a lot of imperfections without expensive super-sampling, and of course it won’t be able to use ClearType. With DirectWrite you get pixel-perfect ClearType rendering every time.

Apparently WPF 4 is already using Direct2D and DirectWrite to some degree, hopefully there will be high-quality text integrated in Flash’s future. Firefox has also been looking at adding DirectWrite support, but I haven’t seen any news of Webkit (Chrome/Safari) or Opera doing the same. It looks like Firefox might actually get it in before Internet Explorer. Edit: looks like Internet Explorer 9 will use DirectWrite — wonder which will go gold with the feature first?

Direct2D and DirectWrite are included in Windows 7, but Microsoft has backported them in the Platform Update for Windows Server 2008 and Windows Vista so there’s no reason people who are sticking with Vista should be left out. Are there people sticking with Vista?

Efficient stream parsing in C++

Posted in Coding, Feature Article, Scalability on September 1st, 2009 by Cory – Comments Off

A while ago I wrote about creating a good parser and while the non-blocking idea was spot-on, the rest of it really isn’t very good in C++ where we have the power of templates around to help us.

I’m currently finishing up a HTTP library and have been revising my views on stream parsing because of it. I’m still not entirely set on my overall implementation, but I’m nearing completion and am ready to share my ideas. First, I’ll list my requirements:

  • I/O agnostic: the parser does not call any I/O functions and does not care where the data comes from.
  • Pull parsing: expose a basic stream of parsed elements that the program reads one at a time.
  • Non-blocking: when no more elements can be parsed from the input stream, it must immediately return something indicating that instead of waiting for more data.
  • In-situ reuse: for optimal performance and scalability the parser should avoid copying and allocations, instead re-using data in-place from buffers.
  • A simple, easy to follow parser: having the parser directly handle buffers can easily lead to spaghetti code, so I’m simply getting rid of that. The core parser must operate on a single iterator range.

To accomplish this I broke this out into three layers: a core parser, a buffer, and a buffer parser.

The core parser

Designing the core parser was simple. I believe I already have a solid C++ parser design in my XML library, so I’m reusing that concept. This is fully in-situ pull parser that operates on a range of bidirectional iterators and returns back a sub-range of those iterators. The pull function returns ok when it parses a new element, done when it has reached a point that could be considered an end of the stream, and need_more when an element can’t be extracted from the passed in iterator range. Using this parser is pretty simple:

typedef std::deque<char> buffer_type;
typedef http::parser<buffer_type::iterator> parser_type;

buffer_type buffer;

parser_type p;
parser_type::node_type n;
parser_type::result_type r;

do
{
  push_data(buffer); // add data to buffer from whatever I/O source.

  std::deque<char>::iterator first = buffer.begin();

  while((r = p.parse(first, buffer.end(), n)) == http::result_types::ok)
  {
    switch(n.type)
    {
      case http::node_types::method:
      case http::node_types::uri:
      case http::node_types::version:
    }
  }

  buffer.erase(buffer.begin(), first); // remove all the used
                                       // data from the buffer.
} while(r == http::result_types::need_more);

By letting the user pass in a new range of iterators to parse each time, we have the option of updating the stream with more data when need_more is returned. The parse() function also updates the first iterator so that we can pop any data prior to it from the data stream.

By default the parser will throw an exception when it encounters an error. This can be changed by calling an overload and handling the error result type:

typedef std::deque<char> buffer_type;
typedef http::parser<buffer_type::iterator> parser_type;

buffer_type buffer;

parser_type p;
parser_type::node_type n;
parser_type::error_type err;
parser_type::result_type r;

do
{
  push_data(buffer); // add data to buffer from whatever I/O source.

  std::deque<char>::iterator first = buffer.begin();

  while((r = p.parse(first, buffer.end(), n, err)) == http::result_types::ok)
  {
    switch(n.type)
    {
      case http::node_types::method:
      case http::node_types::uri:
      case http::node_types::version:
    }
  }

  buffer.erase(buffer.begin(), first); // remove all the used
                                       // data from the buffer.
} while(r == http::result_types::need_more);

if(r == http::result_types::error)
{
  std::cerr
    << "an error occured at "
    << std::distance(buffer.begin(), err.position())
    << ": "
    << err.what()
    << std::endl;
}

The buffer

Initially I was testing my parser with a deque<char> like above. This let me test the iterator-based parser very easily by incrementally pushing data on, parsing some of it, and popping off what was used. Unfortunately, using a deque means we always have an extra copy, from an I/O buffer into the deque. Iterating a deque is also a lot slower than iterating a range of pointers because of the way deque is usually implemented. This inefficiency is acceptable for testing, but just won't work in a live app.

My buffer class is I/O- and parsing-optimized, operating on pages of data. It allows pages to be inserted directly from I/O without copying. Ones that weren't filled entirely can still be filled later, allowing the user to commit more bytes of a page as readable. One can use scatter/gather I/O to make operations span multiple pages contained in a buffer.

The buffer exposes two types of iterators. The first type is what we are used to in deque: just a general byte stream iterator. But this incurs the same cost as deque: each increment to the iterator must check if it's at the end of the current page and move to the next. A protocol like HTTP can fit a lot of elements into a single 4KiB page, so it doesn't make sense to have this cost. This is where the second iterator comes in: the page iterator. A page can be thought of as a Range representing a subset of the data in the full buffer. Overall the buffer class looks something like this:

struct page
{
  const char *first;    // the first byte of the page.
  const char *last;     // one past the last byte of the page.
  const char *readpos;  // the first readable byte of the page.
  const char *writepos; // the first writable byte of the page,
                        // one past the last readable byte.
};

class buffer
{
public:
  typedef ... size_type;
  typedef ... iterator;
  typedef ... page_iterator;

  void push(page *p); // pushes a page into the buffer.  might
                      // be empty, semi-full, or full.

  page* pop(); // pops the first fully read page from from the buffer.

  void commit_write(size_type numbytes); // merely moves writepos
                                         // by some number of bytes.

  void commit_read(size_type numbytes); // moves readpos by
                                        // some number of bytes.

  iterator begin() const;
  iterator end() const;

  page_iterator pages_begin() const;
  page_iterator pages_end() const;
};

One thing you may notice is it expects you to push() and pop() pages directly onto it, instead of allocating its own. I really hate classes that allocate memory – in terms of scalability the fewer places that allocate memory, the easier it will be to optimize. Because of this I always try to design my code to – if it makes sense – have the next layer up do allocations. When it doesn't make sense, I document it. Hidden allocations are the root of evil.

The buffer parser

Unlike the core parser, the buffer parser isn't a template class. The buffer parser exposes the same functionality as a core parser, but using a buffer instead of iterator ranges.

This is where C++ gives me a big advantage. The buffer parser is actually implemented with two core parsers. The first is a very fast http::parser<const char*>. It uses this to parse as much of a single page as possible, stopping when it encounters need_more and no more data can be added to the page. The second is a http::parser<buffer::iterator>. This gets used when the first parser stops, which will happen very infrequently – only when a HTTP element spans multiple pages.

This is fairly easy to implement, but required a small change to my core parser concept. Because each has separate internal state, I needed to make it so I could move the state between two parsers that use different iterators. The amount of state is actually very small, making this a fast operation.

The buffer parser works with two different iterator types internally, so I chose to always return a buffer::iterator range. The choice was either that or silently copy elements spanning multiple pages, and this way lets the user of the code decide how they want to handle it.

Using the buffer parser is just as easy as before:

http::buffer buffer;
http::buffer_parser p;
http::buffer_parser::node_type n;
http::buffer_parser::result_type r;

do
{
  push_data(buffer); // add data to buffer from whatever I/O source.

  while((r = p.parse(buffer, n)) == http::result_types::ok)
  {
    switch(n.type)
    {
      case http::node_types::method:
      case http::node_types::uri:
      case http::node_types::version:
    }
  }

  pop_used(buffer); // remove all the used
                    // data from the buffer.
} while(r == http::result_types::need_more)

The I/O layer

I'm leaving out an I/O layer for now. I will probably write several small I/O systems for it once I'm satisfied with the parser. Perhaps one using asio, one using I/O completion ports, and one using epoll. I've designed this from the start to be I/O agnostic but with optimizations that facilitate efficient forms of all I/O, so I think it could be an good benchmark of the various I/O subsystems that different platforms provide.

One idea I've got is to use Winsock Kernel to implement a kernel-mode HTTPd. Not a very good idea from a security standpoint, but would still be interesting to see the effects on performance. Because the parser performs no allocation, no I/O calls, and doesn't force the use of exceptions, it should actually be very simple to use in kernel-mode.

C++1x loses Concepts

Posted in Coding on July 22nd, 2009 by Cory – Comments Off

Bjarne Stroustrup and Herb Sutter have both reported on the ISO C++ meeting in Frankfurt a week ago, in which the much-heralded feature “concepts” were removed from C++1x.

Concepts are a powerful feature aimed at improving overloading (basically, removing the extra work in using things like iterator categories) and moving type checking up the ladder so that more reasonable error messages can be produced when a developer passes in the wrong type (think a single error line instead of multiple pages of template crap). Apparently the feature was a lot less solid than most of us thought, with a huge amount of internal arguing within the committee on a lot of the fundamental features of it. It seems that while most agreed concepts were a good idea, nobody could agree on how to implement them.

I’m definitely disappointed by this, but I’m also glad they chose to remove concepts instead of further delaying the standard, or worse putting out a poorly designed one. Instead, it seems like there is hope for a smaller C++ update to come out in 4-5 years that adds a more well thought out concepts feature. There are plenty of other C++1x language features to be happy about for though, like variadic templates, rvalue references, and lambda functions!

You may notice I’ve been saying C++1x here instead of C++0x — that’s because it’s pretty obvious to everyone now that we won’t be getting the next C++ standard in 2009, but more likely 2011 or 2012. Just in time for the end of the world!

I/O Improvements in Windows Vista

Posted in Coding, Feature Article, Scalability on May 30th, 2009 by Cory – Comments Off

My tips for efficient I/O are relevant all the way back to coding for Windows 2000. A lot of time has passed since then though, and for all the criticism it got, Windows Vista actually brought in a few new ways to make I/O even more performant than before.

This will probably be my last post on user-mode I/O until something new and interesting comes along, completing what started a couple weeks ago with High Performance I/O on Windows.

Synchronous completion

In the past, non-blocking I/O was a great way to reduce the stress on a completion port. An unfortunate side-effect of this was an increased amount of syscalls — the last non-blocking call you make will do nothing, only returning WSAEWOULDBLOCK. You would still need to call an asynchronous version to wait for more data.

Windows Vista solved this elegantly with SetFileCompletionNotificationModes. This function lets you tell a file or socket that you don’t want a completion packet queued up when an operation completes synchronously (that is, a function returned success immediately and not ERROR_IO_PENDING). Using this, the last I/O call will always be of some use — either it completes immediately and you can continue processing, or it starts an asynchronous operation and a completion packet will be queued up when it finishes.

Like the non-blocking I/O trick, continually calling this can starve other operations in a completion port if a socket or file feeds data faster than you can process it. Care should be taken to limit the number of times you continue processing synchronous completions.

Reuse memory with file handles

If you want to optimize even more for throughput, you can associate a range of memory with an unbuffered file handle using SetFileIoOverlappedRange. This tells the OS that a block of memory will be re-used, and should be kept locked in memory until the file handle is closed. Of course if you won’t be performing I/O with a handle very often, it might just waste memory.

Dequeue multiple completion packets at once

A new feature to further reduce the stress on a completion port is GetQueuedCompletionStatusEx, which lets you dequeue multiple completion packets in one call.

If you read the docs for it, you’ll eventually realize it only returns error information if the function itself fails — not if an async operation fails. Unfortunately this important information is missing from all the official docs I could find, and searching gives nothing. So how do you get error information out of GetQueuedCompletionStatusEx? Well, after playing around a bit I discovered that you can call GetOverlappedResult or WSAGetOverlappedResult to get it, so not a problem.

This function should only be used if your application has a single thread or handles a high amount of concurrent I/O operations, or you might end up defeating the multithreading baked in to completion ports by not letting it spread completion notifications around. If you can use it, it’s a nice and easy way to boost the performance of your code by lowering contention on a completion port.

Bandwidth reservation

One large change in Windows Vista was I/O scheduling and prioritization. If you have I/O that is dependant on steady streaming like audio or video, you can now use SetFileBandwidthReservation to help ensure it will never be interrupted by something else hogging a disk.

Cancel specific I/O requests

A big pain pre-Vista was the inability to cancel individual I/O operations. The only option was to cancel all operations for a handle, and only from the thread which initiated them.

If it turns out some I/O operation is no longer required, it is now possible to cancel individual I/Os using CancelIoEx. This much needed function replaces the almost useless CancelIo, and opens the doors to sharing file handles between separate operations.

Visual C++ 2010 Beta 1

Posted in Coding, Microsoft on May 19th, 2009 by Cory – Comments Off

Visual Studio 2010 Beta 1 was released yesterday for MSDN subscribers. Probably the most anticipated release in a while for C++ developers, 2010 is Microsoft’s attempt to give C++ first-class support, something which hasn’t been seen since Visual Studio 6.0.

Update: downloads are now available for non-MSDN subscribers.

On the compiler side of things, we get partial C++0x support in the form of lambda expressions, rvalue references, auto, decltype, and static assert. The features are piled on with an improved TR1 library — finally including the much requested stdint.h and cstdint headers, but still lacking inttypes.h.

Also included is the Parallel Patterns Library, a new task-based concurrency library that makes heavy use of the C++0x features for a nice modern design. I mentioned before that on Windows 7 this will make use of a User-mode scheduled thread pool so it should be really efficient. Unfortunately given its proprietary nature I’m not sure how much use it will get.

The first thing you will notice on the IDE side is the inline error checking. Something we’ve enjoyed while editing C# for some time, we now get the red squiggly lines when an error is found. It works fairly well, but support for lambda expressions has not been written yet.

Intellisense has markedly improved since 2008. Using advanced C++ or a Boost library no longer guarantees it breaking. It has worked with nearly all the C++ I’ve thrown at it so far.

You can also see an External Dependencies virtual folder added to your project source, which is dynamically filled with all the files Intellisense will scan. I’ve found it is not terribly useful, though, because even with small projects the header count increases rapidly enough to make the virtual folder become an unintelligible mess.

The problem is only aggravated by libraries like Boost, which have hundreds of headers organized nicely in folders. Putting them into a single virtual folder just doesn’t work.

This release also marks the move to the extensible MSBuild system for C++ projects, which aims to provide functionality similar to GNU make in an XML format.

Perhaps the most obvious change for the overall IDE is that the main UI is now done entirely in WPF. It sounded like a decent plan at first but I’m not too happy with it now. Minor differences from the way native controls behave can be pretty annoying, and the five to twenty second load time makes it less useful for opening random .cpp files, when 2008 would load them in one or two seconds.

C++ ORM framework for SQLite

Posted in Coding on May 18th, 2009 by Cory – Comments Off

Over the past week I’ve been rewriting my rather dated SQLite wrapper to have an efficient, modern C++ feel. The basic wrapper is there, but I was looking for something a little more this time.

While looking at the problem I decided I was spending too much time writing boilerplate SQL for all my types so I decided to look at existing ORM frameworks. I’m pretty picky about my C++ though, and couldn’t find anything I liked so I started writing my own. Instead of creating a tool to generate C++, I wanted to take a pure approach using native C++ types and template metaprogramming.

What I ended up with is not a full ORM framework, and I’m not particularly interested in making it one. All I’m aiming for is removing boilerplate code while leaving it easy to extend it for more complex queries. Here’s what I’ve got so far:

struct my_object
{
  int id;
  std::string value;
  boost::posix_time::ptime time;
};

typedef boost::mpl::vector<
  sqlite3x::column<
    my_object, int, &my_object::id,
    sqlite3x::primary_key, sqlite3x::auto_increment
  >,
  sqlite3x::column<
    my_object, std::string, &my_object::value,
    sqlite3x::unique_key
  >,
  sqlite3x::column<
    my_object, boost::posix_time::ptime, &my_object::time
  >
> my_object_columns;

typedef sqlite3x::table<
  my_object,
  my_object_columns
> my_object_table;

Using it is pretty simple. It uses the primary key as expected, generating the proper WHERE conditions and even extracting the type to let find() and others specify only the primary key:

sqlite3x::connection con("test.db3");

my_object_table my_objects(con, "t_objects");

my_objects.add(my_object());
my_objects.edit(my_object());
my_objects.remove(int());
my_objects.exists(int());
my_objects.find(int());

One benefit of the approach taken is it makes working with single- and multiple-inheritance just as easy:

struct my_derived :
  my_object
{
  float extra;
};

typedef boost::mpl::copy<
  boost::mpl::vector<
    sqlite3x::column<my_derived, float, &my_object::extra>
  >,
  boost::mpl::back_inserter<my_object_columns>
> my_derived_columns;

typedef sqlite3x::table<
  my_derived,
  my_derived_columns
> my_object_table;

The next thing on the list was supporting types not known natively to sqlite3x. I did not want to have the headache of sub-tables, so I took the easy route and implemented basic serialization support:

struct my_derived :
  my_object
{
  std::vector<boost::uuid> uuids;
};

struct uuids_serializer
{
  static void serialize(std::vector<boost::uint8_t> &buffer,
     const std::vector<boost::uuid> &uuids);

  template<typename Iterator>
  static Iterator deserialize(std::vector<boost::uuid> &uuids,
     Iterator first, Iterator last);
};

typedef boost::mpl::copy<
  boost::mpl::vector<
    sqlite3x::column<
      my_derived, float, &my_object::extra,
      sqlite3x::serializer<uuids_serializer>
    >
  >,
  boost::mpl::back_inserter<my_object_columns>
> my_derived_columns;

A few things aren’t finished, like specifying indexes and support for multi-column primary keys.

Overall though, I’m pretty happy with it. The majority of what I use SQLite for doesn’t require many complex queries, so this should greatly help lower the amount of code I have to manage.

Best of all this ORM code is in an entirely isolated header file — if you don’t want it, just don’t include it and you’ll still have access to all the basic SQLite functions. Even with it included I kept to the C++ mantra of “dont pay for what you don’t use” — as it is entirely template-driven, code will only be generated if you actually use it.

Once I’m finished the code will replace what I have up on the SQLite wrapper page, but until then it will exist in the subversion repository only.

Tips for efficient I/O

Posted in Coding, Feature Article, Scalability on May 15th, 2009 by Cory – Comments Off

There are a few things to keep in mind for I/O that can have pretty incredible effects on performance and scalability. It’s not really any new API, but how you use it.

Reduce blocking

The whole point of I/O completion ports is to do operations asynchronously, to keep the CPU busy doing work while waiting for completion. Blocking defeats this by stalling the thread, so it should be avoided whenever possible. Completion ports have a mechanism built in to make blocking less hurtful by starting up a waiting thread when another one blocks, but it is still better to avoid it all together.

This includes memory allocation. Standard system allocators usually have several points where it needs to lock to allow concurrent use, so applications should make use of custom allocators like arenas and pools where possible.

This means I/O should always be performed asynchronously, lock-free algorithms used when available, and any remaining locks should be as optimally placed as possible. Careful application design planning goes a long way here. The toughest area I’ve discovered is database access — as far as I have seen, there have been zero well designed database client libraries. Every one that I’ve seen has forced you to block at some point.

Ideally, the only place the application would block is when retrieving completion packets.

Buffer size and alignment

I/O routines like to lock the pages of the buffers you pass in. That is, it will pin them in physical memory so that they can’t be paged out to a swap file.

The result is if you pass in a 20 byte buffer, on most systems it will lock a full 4096 byte page in memory. Even worse, if the 20 byte buffer has 10 bytes in one page and 10 bytes in another, it will lock both pages — 8192 bytes of memory for a 20 byte I/O. If you have a large number of concurrent operations this can waste a lot of memory!

Because of this, I/O buffers should get special treatment. Functions like malloc() and operator new() should not be used because they have no obligation to provide such optimal alignment for I/O. I like to use VirtualAlloc to allocate large blocks of 1MiB, and divide this up into smaller fixed-sized (usually page-sized, or 4KiB) blocks to be put into a free list.

If buffers are not a multiple of the system page size, extra care should be taken to allocate buffers in a way that keeps them in separate pages from non-buffer data. This will make sure page locking will incur the least amount of overhead while performing I/O.

Limit the number of I/Os

System calls and completion ports have some expense, so limiting the number of I/O calls you do can help to increase throughput. We can use scatter/gather operations to chain several of those page-sized blocks into one single I/O.

A scatter operation is taking data from one source, like a socket, and scattering it into multiple buffers. Inversely a gather operation takes data from multiple buffers and gathers it into one destination.

For sockets, this is easy — we just use the normal WSASend and WSARecv functions, passing in multiple WSABUF structures.

For files it is a little more complex. Here the WriteFileGather and ReadFileScatter functions are used, but some special care must be taken. These functions require page-aligned and -sized buffers to be used, and the number of bytes read/written must be a multiple of the disk’s sector size. The sector size can be obtained with GetDiskFreeSpace.

Non-blocking I/O

Asynchronous operations are key here, but non-blocking I/O still has a place in the world of high performance.

Once an asynchronous operation completes, we can issue non-blocking requests to process any remaining data. Following this pattern reduces the amount of strain on the completion port and helps to keep your operation context hot in the cache for as long as possible.

Care should be taken to not let non-blocking operations continue on for too long, though. If data is received on a socket fast enough and we take so long to process it that there is always more, it could starve other completion notifications waiting to be dequeued.

Throughput or concurrency

A kernel has a limited amount of non-paged memory available to it, and locking one or more pages for each I/O call is a real easy way use it all up. Sometimes if we expect an extreme number of concurrent connections or if we want to limit memory usage, it can be beneficial to delay locking these pages until absolutely required.

An undocumented feature of WSARecv is that you can request a 0-byte receive, which will complete when data has arrived. Then you can issue another WSARecv or use non-blocking I/O to pull out whatever is available. This lets us get notified when data can be received without actually locking memory.

Doing this is very much a choice of throughput or concurrency — it will use more CPU, reducing throughput. But it will allow your application to handle a larger number of concurrent operations. It is most beneficial in a low memory system, or on 32-bit Windows when an extreme number of concurrent operations will be used. 64-bit Windows has a much larger non-paged pool, so it shouldn’t present a problem if you feed it enough physical memory.

Unbuffered I/O

If you are using the file system a lot, your application might be waiting on the synchronous operating system cache. In this case, enabling unbuffered I/O will let file operations bypass the cache and complete more asynchronously.

This is done by calling CreateFile with the FILE_FLAG_NO_BUFFERING flag. Note that with this flag, all file access must be sector aligned — read/write offsets and sizes. Buffers must also be sector aligned.

Operating system caching can have good effects, so this isn’t always advantageous. But if you’ve got a specific I/O pattern or if you do your own caching, it can give a significant performance boost. This is an easy thing to turn on and off, so take some benchmarks.

Update: this subject continued in I/O Improvements in Windows Vista.

Visual Studio 2010 Beta 1 in four days

Posted in Coding, Microsoft on May 14th, 2009 by Cory – Comments Off

Visual Studio 2010 Beta 1 is being launched for MSDN subscribers in four days on the 18th, and for the general public on the 20th!

I/O completion ports made easy

Posted in Coding, Feature Article, Scalability on May 14th, 2009 by Cory – Comments Off

I described the basics of I/O completion ports in my last post, but there is still the question of what the easiest way to use them. Here I’ll show a callback-based application design that I’ve found makes a fully asynchronous program really simple to do.

I touched briefly on attaching our own context data to the OVERLAPPED structure we pass along with I/O operations. It’s this same idea that I’ll expand on here. This time, we define a generic structure to use with all our operations, and how our threads will handle them while dequeuing packets:

struct io_context
{
  OVERLAPPED ovl;
  void (*on_completion)(DWORD error, DWORD transferred,
                        struct io_context *ctx);
};

OVERLAPPED *ovl;
ULONG_PTR completionkey;
DWORD transferred;

BOOL ret = GetQueuedCompletionStatus(iocp, &transferred,
   &completionkey, &ovl, INFINITE);

if(ret)
{
  struct io_context *ctx = (struct io_context*)ovl;
  ctx->on_completion(ERROR_SUCCESS, transferred, ctx);
}
else if(ovl)
{
  DWORD err = GetLastError();

  struct io_context *ctx = (struct io_context*)ovl;
  ctx->on_completion(err, transferred, ctx);
}
else
{
  // error out
}

With this, all our I/O operations will have a callback associated with them. When a completion packet is dequeued it gets the error information, if any, and runs the callback. Having every I/O operation use a single callback mechanism greatly simplifies the design of the entire program.

Lets say our app was reading a file and sending out it’s contents. We also want it to prefetch the next buffer so we can start sending right away. Here’s our connection context:

struct connection_context
{
  HANDLE file;
  SOCKET sock;

  WSABUF readbuf;
  WSABUF sendbuf;

  struct io_context readctx;
  struct io_context sendctx;
};

A structure like this is nice because initiating an I/O operation will need no allocations. Note that we need two io_context members because we’re doing a read and send concurrently.

Now the code to use it:

#define BUFFER_SIZE 4096

void begin_read(struct connection_context *ctx)
{
  if(ctx->readbuf.buf)
  {
    // only begin a read if one isn't already running.
    return;
  }

  ctx->readbuf.buf = malloc(BUFFER_SIZE);
  ctx->readbuf.len = 0;

  // zero out io_context structure.
  memset(&ctx->readctx, 0, sizeof(ctx->readctx));

  // set completion callback.
  ctx->readctx.on_completion = read_finished;

  ReadFile(ctx->file, ctx->readbuf.buf, BUFFER_SIZE,
           NULL, &ctx->readctx.ovl);
}

void read_finished(DWORD error, DWORD transferred,
                   struct io_context *ioctx)
{
  // get our connection context.
  struct connection_context *ctx =
     (struct connection_context*)((char*)ioctx -
        offsetof(struct connection_context, readctx));

  if(error != ERROR_SUCCESS)
  {
    // handle error.
    return;
  }

  if(!transferred)
  {
    // reached end of file, close out connection.
    free(ctx->readbuf.buf);
    ctx->readbuf.buf = 0;
    return;
  }

  // send out however much we read from the file.

  ctx->readbuf.len = transferred;

  begin_send(ctx);
}

This gives us a very obvious chain of events: read_finished is called when a read completes. Since we only get an io_context structure in our callback, we need to adjust the pointer to get our full connection_context.

Sending is easy too:

void begin_send(struct connection_context *ctx)
{
  if(ctx->sendbuf.buf)
  {
    // only begin a send if one
    // isn't already running.
    return;
  }

  if(!ctx->recvbuf.len)
  {
    // only begin a send if the
    // read buffer has something.
    return;
  }

  // switch buffers.
  ctx->sendbuf = ctx->readbuf;

  // clear read buffer.
  ctx->readbuf.buf = NULL;
  ctx->readbuf.len = 0;

  // zero out io_context structure.
  memset(&ctx->sendctx, 0, sizeof(ctx->sendctx));

  // set completion callback.
  ctx->sendctx.on_completion = send_finished;

  WSASend(ctx->sock, &ctx->sendbuf, 1, NULL, 0,
          &ctx->sendctx.ovl, NULL);

  // start reading next buffer.
  begin_read(ctx);
}

void send_finished(DWORD error, DWORD transferred,
                   struct io_context *ioctx)
{
  // get our connection context.
  struct connection_context *ctx =
     (struct connection_context*)((char*)ioctx -
        offsetof(struct connection_context, sendctx));

  if(error != ERROR_SUCCESS)
  {
    // handle error.
    return;
  }

  // success, clear send buffer and start next send.

  free(ctx->sendbuf.buf);
  ctx->sendbuf.buf = NULL;

  begin_send(ctx);
}

Pretty much more of the same. Again for brevity I’m leaving out some error checking code and assuming the buffer gets sent out in full. I’m also assuming a single-threaded design — socket and file functions themselves are thread-safe and have nothing to worry about, but the buffer management code here would need some extra locking since it could be run concurrently. But the idea should be clear.

Update: this subject continued in Tips for efficient I/O.

High Performance I/O on Windows

Posted in Coding, Feature Article, Scalability on May 13th, 2009 by Cory – Comments Off

I/O completion ports were first introduced in Windows NT 4.0 as a highly scalable, multi-processor capable alternative to the then-typical practices of using select, WSAWaitForMultipleEvents, WSAAsyncSelect, or even running a single thread per client. The most optimal way to perform I/O on Windows — short of writing a kernel-mode driver — is to use I/O completion ports.

A recent post on Slashdot claimed sockets have run their course, which I think is absolutely not true! The author seems to believe the Berkeley sockets API is the only way to perform socket I/O, which is nonsense. Much more modern, scalable and high performance APIs exist today such as I/O completion ports on Windows, epoll on Linux, and kqueue on FreeBSD. In light of this I thought I’d write a little about completion ports here.

The old ways of multiplexing I/O still work pretty well for scenarios with a low number of concurrent connections, but when writing a server daemon to handle a thousand or even tens of thousands of concurrent clients at once, we need to use something different. In this sense the old de facto standard Berkeley sockets API has run its course, because the overhead of managing so many connections is simply too great and makes using multiple processors hard.

An I/O completion port is a multi-processor aware queue. You create a completion port, bind file or socket handles to it, and start asynchronous I/O operations. When they complete — either successfully or with an error — a completion packet is queued up on the completion port, which the application can dequeue from multiple threads. The completion port uses some special voodoo to make sure only a specific number of threads can run at once — if one thread blocks in kernel-mode, it will automatically start up another one.

First you need to create a completion port with CreateIoCompletionPort:

HANDLE iocp = CreateIoCompletionPort(INVALID_HANDLE_VALUE,
   NULL, 0, 0);

It’s important to note that NumberOfConcurrentThreads is not the total number of threads that can access the completion port — you can have as many as you want. This instead controls the number of threads it will allow to run concurrently. Once you’ve reached this number, it will block all other threads. If one of the running threads blocks for any reason in kernel-mode, the completion port will automatically start up one of the waiting threads. Specifying 0 for this is equivalent to the number of logical processors in the system.

Next is creating and associating a file or socket handle, using CreateFile, WSASocket, and CreateIoCompletionPort:

#define OPERATION_KEY 1

HANDLE file = CreateFile(L"file.txt", GENERIC_READ,
   FILE_SHARE_READ, NULL, OPEN_EXISTING,
   FILE_FLAG_OVERLAPPED, NULL);

SOCKET sock = WSASocket(AF_INET, SOCK_STREAM, IPPROTO_TCP,
   NULL, 0, WSA_FLAG_OVERLAPPED);

CreateIoCompletionPort(file, iocp, OPERATION_KEY, 0);
CreateIoCompletionPort((HANDLE)sock, iocp, OPERATION_KEY, 0);

Files and sockets must be opened with the FILE_FLAG_OVERLAPPED and WSA_FLAG_OVERLAPPED flags before they are used asynchronously.

Reusing CreateIoCompletionPort for associating file/socket handles is weird design choice from Microsoft but that’s how it’s done. The CompletionKey parameter can be anything you want, it is a value provided when packets are dequeued. I define a OPERATION_KEY here to use as the CompletionKey, the significance of which I’ll get to later.

Next we have to start up some I/O operations. I’ll skip setting up the socket and go right to sending data. We start these operations using ReadFile and WSASend:

OVERLAPPED readop, sendop;
WSABUF sendwbuf;
char readbuf[256], sendbuf[256];

memset(&readop, 0, sizeof(readop));
memset(&sendop, 0, sizeof(sendop));

sendwbuf.len = sizeof(sendbuf);
sendwbuf.buf = sendbuf;

BOOL readstatus = ReadFile(file, readbuf,
   sizeof(readbuf), NULL, &readop);

if(!readstatus)
{
  DWORD readerr = GetLastError();

  if(readerr != ERROR_IO_PENDING)
  {
    // error reading.
  }
}

int writestatus = WSASend(sock, &sendwbuf, 1, NULL,
   0, &sendop, NULL);

if(writestatus)
{
  int writeerr = WSAGetLastError();

  if(writeerr != WSA_IO_PENDING)
  {
    // error sending.
  }
}

Every I/O operation using a completion port has an OVERLAPPED structure associated with it. Windows uses this internally in unspecified ways, only saying we need to zero them out before starting an operation. The OVERLAPPED structure will be given back to us when we dequeue the completion packets, and can be used to pass along our own context data.

I have left out the standard error checking up until now for brevity’s sake, but this one doesn’t work quite like one might expect so it is important here. When starting an I/O operation, an error might not really be an error. If the function succeeds all is well, but if the function fails, it is important to check the error code with GetLastError or WSAGetLastError.

If these functions return ERROR_IO_PENDING or WSA_IO_PENDING, the function actually still completed successfully. All these error codes mean is an asynchronous operation has been started and completion is pending, as opposed to completing immediately. A completion packet will be queued up regardless of the operation completing asynchronously or not.

Dequeuing packets from a completion port is handled by the GetQueuedCompletionStatus function:

OVERLAPPED *ovl;
ULONG_PTR completionkey;
DWORD transferred;

BOOL ret = GetQueuedCompletionStatus(iocp, &transferred,
   &completionkey, &ovl, INFINITE);

if(ret)
{
  // I/O completed successfully.
}
else if(ovl)
{
  // dequeued successfully but the I/O operation
  // failed, get extended information.
  DWORD err = GetLastError();
}
else
{
  // error dequeuing a packet.
}

This function dequeues completion packets, consisting of the completion key we specified in CreateIoCompletionPort and the OVERLAPPED structure we gave while starting the I/O. If the I/O transferred any data, it will retrieve how many bytes were transferred too. Again the error handling is a bit weird on this one, having three error states.

This can be run from as many threads as you like, even a single one. It is common practice to run a pool of twice the number of threads as there are logical processors available, to keep the CPU active with the aforementioned functionality of starting a new thread if a running one blocks.

Unless you are going for a single-threaded design, I recommend starting two threads per logical CPU. Even if your app is designed to be 100% asynchronous, you will still run into blocking when locking shared data and even get unavoidable hidden blocking I/Os like reading in paged out memory. Keeping two threads per logical CPU will keep the processor busy without overloading the OS with too much context switching.

This is all well and good, but two I/O operations were initiated — a file read and a socket send. We need a way to tell their completion packets apart. This is why we need to attach some state to the OVERLAPPED structure when we call those functions:

struct my_context
{
  OVERLAPPED ovl;
  int isread;
};

struct my_context readop, sendop;

memset(&readop.ovl, 0, sizeof(readop.ovl));
memset(&sendop.ovl, 0, sizeof(sendop.ovl));

readop.isread = 1;
sendop.isread = 0;

ReadFile(file, readbuf, sizeof(readbuf), NULL, &readop.ovl);
WSASend(sock, &sendwbuf, 1, NULL, 0, &sendop.ovl, NULL);

Now we can tell the operations apart when we dequeue them:

OVERLAPPED *ovl;
ULONG_PTR completionkey;
DWORD transferred;

GetQueuedCompletionStatus(iocp, &transferred,
   &completionkey, &ovl, INFINITE);

struct my_context *ctx = (struct my_context*)ovl;

if(ctx->isread)
{
  // read completed.
}
else
{
  // send completed.
}

The last important thing to know is how to queue up your own completion packets. This is useful if you want to split an operation up to be run on the thread pool, or if you want to exit a thread waiting on a call to GetQueuedCompletionStatus. To do this, we use the PostQueuedCompletionStatus function:

#define EXIT_KEY 0

struct my_context ctx;

PostQueuedCompletionStatus(iocp, 0, OPERATION_KEY, &ctx.ovl);
PostQueuedCompletionStatus(iocp, 0, EXIT_KEY, NULL);

Here we post two things onto the queue. The first, we post our own structure onto the queue, to be processed by our thread pool. The second, we give a new completion key: EXIT_KEY. The thread which processes this packet can test if the completion key is EXIT_KEY to know when it needs to stop dequeuing packets and shut down.

Other than the completion port handle, Windows does not use any of the parameters given to PostQueuedCompletionStatus. They are entirely for our use, to be dequeued with GetQueuedCompletionStatus.

That’s all I have to write for now, and should be everything one would need to get started learning these high performance APIs! I will make another post shortly detailing some good patterns for completion port usage, and some optimization tips to ensure efficient usage of these I/O APIs.

Update: this subject continued in I/O completion ports made easy.