Limiting asynchrony with task parallelism

When I started re-writ­ing this web­site, I wanted to make good use of my multi-core CPU. Gen­er­at­ing hun­dreds of pages using XSL trans­forms and plenty of pre-pro­cess­ing in C#, there's a lot of par­al­lelism to be had.

I began by using the TPL's data par­al­lelism fea­tures: mainly Parallel.​ForEach and Parallel.​Invoke. These are super easy to use, and made an im­me­di­ate huge dif­fer­ence.

Then the Vi­sual Stu­dio 11 de­vel­oper pre­view came out, and I felt com­pelled to make use of its new async fea­tures. This meant ditch­ing the Par­al­lel meth­ods all to­gether and writ­ing for task par­al­lelism.

There are still parts of the .NET Frame­work which don't sup­port async, and XML is one of them. Be­cause I'm read­ing rel­a­tively small doc­u­ments, I was able to work around these lim­i­ta­tions by asyn­chro­nously fill­ing a Mem­o­ryS­tream from a file and feed­ing the Mem­o­ryS­tream to the XML classes:

Task<FileStream> OpenReadAsync(string fileName)
{
    return Task.Factory.StartNew(state =>
        new FileStream((string)state, FileMode.Open, FileAccess.Read,
                       FileShare.Read, 4096, true), fileName);
}

async Task<XmlReader> CreateXmlReader(string fileName,
                                      XmlReaderSettings settings = null)
{
    MemoryStream ms = new MemoryStream();
    
    using (FileStream fs = await OpenReadAsync(fileName))
    {
        await fs.CopyToAsync(ms);
    }

    ms.Position = 0;
    return XmlReader.Create(ms, settings, fileName);
}

But I had one more prob­lem to solve. For ef­fi­ciency, Parallel.​ForEach par­ti­tions its items into ranges which will be op­er­ated on con­cur­rently. A side ef­fect of this that I was re­ly­ing on was that only so many I/O op­er­a­tions would be able to hap­pen at once. In my new code I'm sim­ply launch­ing all these tasks at once rather than par­ti­tion­ing—this ab­solutely killed per­for­mance as po­ten­tially hun­dreds of con­cur­rent I/Os caused my disk to seek like crazy.

What I ended up doing here was cre­at­ing a ticket sys­tem which can be used to allow only a lim­ited num­ber of I/Os to hap­pen con­cur­rently: es­sen­tially a safe task-based sem­a­phore.

sealed class AsyncLimiter
{
    public AsyncLimiter(int max);
    public Task<IDisposable> Lock();
}

The full im­ple­men­ta­tion is avail­able in Sub­ver­sion and under a 2-clause BSD li­cense. Using it is very sim­ple:

AsyncLimiter limiter = new AsyncLimiter(4);

async Task<FileStream> OpenReadAsync(string fileName)
{
    using (IDisposable limiterlock = await limiter.Lock())
    {
        return await Task.Factory.StartNew(state =>
            new FileStream((string)state, FileMode.Open, FileAccess.Read,
                           FileShare.Read, 4096, true), fileName);
    }
}

async Task<XmlReader> CreateXmlReader(string fileName,
                                      XmlReaderSettings settings = null)
{
    MemoryStream ms = new MemoryStream();

    using (FileStream fs = await OpenReadAsync(fileName))
    using (IDisposable limiterlock = await limiter.Lock())
    {
        await fs.CopyToAsync(ms);
    }

    ms.Position = 0;
    return XmlReader.Create(ms, settings, fileName);
}

When the lock gets dis­posed, it'll let the next op­er­a­tion in line progress. This was sim­ple to im­ple­ment ef­fi­ciently using In­ter­locked meth­ods and a Con­cur­ren­tQueue.

Some op­er­a­tions—file open­ing and ex­is­tence test­ing, di­rec­tory cre­ation, etc.—have no asyn­chro­nous ana­log. For these there is no good so­lu­tion, so I sim­ply wrapped them in a task as in the OpenReadAsync ex­am­ple above. They're rare enough that it hasn't been a prob­lem.

The end re­sult? Ac­tu­ally about 50% bet­ter per­for­mance than using the Par­al­lel meth­ods. When all the files are in cache, I'm able to gen­er­ate this en­tire web­site from scratch in about 0.7 sec­onds.

Everything old is new again

I moved to Word­press when I got sick of lim­i­ta­tions in an XSLT-based site. I've now moved back to an XSLT-based site due to Word­press' heavy SQL usage—Source­Forge's SQL servers are pretty over­loaded, and it was mak­ing this site run pretty slow.

It took me a while to fin­ish the tran­si­tion and there are still a few things I need to iron out, but I think it's ready enough for prime time. It's been a great tran­si­tion, though. I fi­nally got to rewrite that old theme, using clean HTML and CSS. All the lim­i­ta­tions I hated in my old XSLT-based site are gone as well, al­beit with a good deal more pre-pro­cess­ing in­volved.

One of the things I've changed is in how im­ages are scaled: when there isn't enough screen space (such as re­duc­ing the width of the screen), they'll all shrink to fit. This was im­por­tant to me be­cause I've grown to use Win­dows 7's snap­ping fea­ture and it's im­por­tant that sites still work when using only half the screen. This ac­tu­ally re­vealed a bug in Google Chrome and per­haps other We­bkit-based browsers, so hope­fully that gets fixed soon.

An­other thing I've started try­ing is to size float­ing im­ages based on megapix­els in­stead of sim­ply a max­i­mum width/height. This was sim­ple to do and will in­crease aes­thet­ics by en­sur­ing no im­ages ap­pear ab­nor­mally big com­pared to other ones. So far I like the re­sults.

Now that I'm mostly done with this, I should be able to write a lot more. Those promised re­sam­pling ar­ti­cles are com­ing, I swear!

The art of resampling

Years ago when I was a fledg­ling still learn­ing to code, one of the first things I tried cre­at­ing was an image re­sizer. I pre­ferred (and given time, still do) to just have a go at things with­out re­search so while I suc­ceeded in mak­ing a re­sizer, the re­sults were (pre­dictably) poor. I soon got side­tracked and left the code to rot, but the idea re­mained.

It’s taken me over 10 years to find a rea­son to re­visit the sub­ject, but I fi­nally have: gamma com­pres­sion.

In HTML, the color #777 will have roughly half the in­ten­sity of #FFF. This matches our per­cep­tion and makes work­ing with color fairly easy, but the way light re­ally works is much dif­fer­ent. Our eyes per­ceive light on a log­a­rith­mic scale—twice the pho­tons won’t ap­pear twice as bright to us. Trans­form­ing the ac­tual lin­ear in­ten­sity into our fa­mil­iar rep­re­sen­ta­tion is called gamma com­pres­sion.

When we blend two gamma-​com­pressed col­ors, the re­sult is not the same as if we blended two lin­ear col­ors. To cor­rectly blend col­ors, we must first un­com­press them into their lin­ear val­ues. Take the gamma-​com­pressed val­ues 0.1 and 0.9. If we just add 0.1 to 0.9, we’ll of course get 1.0: an 11% change of value. Doing it the cor­rect way, we first de­com­press them into the lin­ear val­ues 0.​01 and 0.​79. Add 0.​01 to 0.​79, re-​com­press, and the re­sult will be 0.​905:​ an 0.5% change of value. Gamma-​ig­no­rant pro­cess­ing gave us a way off re­sult!

For an ex­am­ple, lets take a look at NASA’s “Earth’s City Lights”:

NASA’s “Earth’s City Lights”

Down­siz­ing this 4800×2400 image pre­sents a worst-case sce­nario for a gamma-ig­no­rant re­sizer. The sharp con­trast be­tween the lights and sur­round­ing dark­ness makes the blend­ing error very promi­nent, and the mas­sive down­siz­ing gives it a chance to do a lot of blend­ing.

Gamma-corrected “Earth’s City Lights”

At the top we see the re­sult of gamma-cor­rect re­siz­ing. This is how it's sup­posed to look—you can still see the lights along the African and Aus­tralian coasts. West­ern USA is clearly still very awake, as well as Eu­rope and parts of Asia.

Gamma-ignorant “Earth’s City Lights”

On the bot­tom we see the re­sult of gamma-ig­no­rant re­siz­ing. The fainter lights have been com­pletely drowned out. Aus­tralia and Africa now barely reg­is­ter, and all the other con­ti­nents look far darker over­all. Big dif­fer­ence! The un­for­tu­nate thing is that a ma­jor­ity of re­siz­ers will look like the image on the bot­tom. The in­cor­rect re­sult is often good enough that they ei­ther don’t no­tice or don’t care.

One of these in­cor­rect re­siz­ers is in Avisynth, a scripted video proces­sor used for every­thing from sim­ple DVD dein­ter­lac­ing all the way to heavy restora­tion of old 8mm film. A while ago I was using it to re­size a Quake video and dis­cov­ered that sim­i­lar to the lights of the image above, all the starry tele­porters lost their stars: a clear sign of gamma-​ig­no­rant pro­cess­ing.

I de­cided to make a high-​qual­ity re­siz­ing plu­gin for Avisynth that would use a fully lin­ear, high bit depth pipeline. No short­cuts. Work­ing on it has been a lot of fun and chal­lenge, so I’ll be writ­ing about it here over the next few days.

Asynchronous page faults

With I/O, we’ve got some choices:

  1. Syn­chro­nous, copy­ing from OS cache ( fread). This is the sim­plest form of I/O, but isn’t very scal­able.
  2. Syn­chro­nous, read­ing di­rectly from OS cache (mem­ory map­ping). This is wicked fast and ef­fi­cient once mem­ory is filled, but aside from some cases with read-​ahead, your threads will still block with page faults.
  3. Asyn­chro­nous, copy­ing from OS cache ( ReadFile). Much more scal­able than fread, but each read still in­volves du­pli­cat­ing data from the OS cache into your buffer. Fine if you’re read­ing some data only to mod­ify the buffer in place, but still not very great when you’re treat­ing it as read only (such as to send over a socket).
  4. Asyn­chro­nous, main­tain­ing your own cache ( FILE_FLAG_NO_BUFFERING). More scal­able still than Read­File, but you need to do your own caching and it’s not shared with other processes.

Note that there’s one im­por­tant choice miss­ing: mem­ory map­ping with asyn­chro­nous page faults. As far as I know there are no op­er­at­ing sys­tems that ac­tu­ally offer this—it’s kind of a dream fea­ture of mine. There are two APIs that will help sup­port this:

HANDLE CreateMemoryManager();
BOOL MakeResident(HANDLE, LPVOID, SIZE_T, LPOVERLAPPED);

CreateMemoryManager opens a han­dle to the Win­dows mem­ory man­ager, and MakeResident will fill the pages you spec­ify (re­turn­ing true for syn­chro­nous com­ple­tion, false for error/async like every­thing else). The best of both worlds: fast, easy ac­cess through mem­ory, a full asyn­chro­nous work­flow, and shared cache usage. This would be es­pe­cially use­ful on mod­ern CPUs that offer gi­gan­tic ad­dress spaces.

The mem­ory man­ager al­ready has sim­i­lar func­tion­al­ity in there some­where, so it might not be dif­fi­cult to pull into user-​mode. Just an ed­u­cated guess. Maybe it’d be ter­ri­bly dif­fi­cult. Dream fea­ture!

Is C# the Boost of C-family languages?

For all the cons of giv­ing a sin­gle en­tity con­trol over C#, one pro is that it gives the lan­guage an un­matched agility to try new things in the C fam­ily of lan­guages. LINQ—both its lan­guage in­te­gra­tion and its back­ing APIs—is an in­cred­i­bly pow­er­ful tool for query­ing and trans­form­ing data with very con­cise code. I re­ally can’t ex­press how much I’ve come to love it.

The new async sup­port an­nounced at PDC10 is ba­si­cally the holy grail of async cod­ing, let­ting you focus on what your task is and not how you’re going to im­ple­ment a com­plex async code path for it. It’s an old idea that many async coders have come up with, but, as far as I know, has never been suc­cess­fully im­ple­mented sim­ply be­cause it re­quired too much lan­guage sup­port.

The lack of peer re­view and stan­dards com­mit­tee for .​NET shows—there’s a pretty high rate of turnover as Mi­crosoft tries to iron down the right way to tackle prob­lems, and it re­sults in a very large li­brary with lots of re­dun­dant func­tion­al­ity. As much as this might hurt .​NET, I’m start­ing to view C# as a sort of Boost for the C lan­guage fam­ily. Some great ideas are get­ting real-​world use, and if other lan­guages even­tu­ally feel the need to get some­thing sim­i­lar, they will have a bounty of ex­pe­ri­ence to pull from.

C++, at least, is a ter­ri­fy­ingly com­plex lan­guage. Get­ting new fea­tures into it is an up­hill bat­tle, even when they ad­dress a prob­lem that every­one is frus­trated with. Get­ting com­plex new fea­tures like these into it would be a very long process, with a lot of ar­gu­ing and years of delay. Any extra in­cu­ba­tion time we can give them is a plus.

Justifying the Nook: A case for PDF

I’ve had my Nook for about two months now, and have read about 8 books on it so far. It’s a great lit­tle de­vice that I’ve been un­able to put down in my spare time. My only real gripe is that, like al­most every other e-​reader out there, it has such poor type­set­ting that you have to won­der if it was de­signed by soft­ware en­gi­neers who aren’t big read­ers—it cer­tainly pro­vides the paper look, but it’s still got a long way to go to pro­vide a read­ing ex­pe­ri­ence equal to a dead tree book.

To give an ex­am­ple, there are cur­rently two types of books you can buy on the nook: those that are left-​aligned, and those that are jus­ti­fied.

Left aligned

Here’s the left align­ment. It’s read­able, but it wastes a lot of space on the right edge and isn’t very aes­thetic. Next we have the jus­ti­fied ver­sion:

Bad justify

This kind of very basic jus­ti­fi­ca­tion is pretty poor look­ing—in at­tempt­ing to cre­ate a nice straight mar­gin, it adds a bunch of ugly space be­tween words. This is the same kind of jus­ti­fi­ca­tion web browsers do, and al­most no­body uses it be­cause it looks so bad. I knew about this going in. Sci­ence fic­tion au­thor Robert J. Sawyer wrote a very in­for­ma­tive post about it, and I set about find­ing a way to solve this prob­lem be­fore my nook even ar­rived in the mail.

One great fea­ture about the nook is that it sup­ports view­ing PDFs, with and with­out re-​flow­ing text. With re-​flow­ing turned off, you’re free to man­u­ally type­set the doc­u­ment any way you want, and it will ap­pear ex­actly how you want on the screen. This is what I was look­ing for. With this you can use Mi­crosoft Word to cre­ate a great look­ing PDF. If you’re feel­ing brave and want even bet­ter look­ing text, pro­fes­sional type­set­ting soft­ware like Adobe In­De­sign pro­vides more ad­vanced fea­tures that will give fan­tas­tic re­sults.

Good justify

Here we can see proper hy­phen­ation and jus­ti­fi­ca­tion. A good jus­ti­fi­ca­tion al­go­rithm won’t just add more space be­tween words, but will also be will­ing to shrink it just slightly if it will make things look bet­ter. You can see this in the first line of text, where it fits “adip­isc­ing” in to re­move the white­space that plagued the text in the pre­vi­ous image. It also evenly spaces the en­tire para­graph at once in­stead of just a sin­gle line, and fits more text into each line by hy­phen­at­ing words on line breaks.

It’s look­ing pretty good now, and is al­most like a real book. But there’s a lit­tle more that can be done:

Best justify

Can you spot the dif­fer­ence? I admit, with­out a keen eye and a lit­tle knowl­edge of ty­pog­ra­phy, I won’t ex­pect most peo­ple to. Here’s an an­i­ma­tion to help show the dif­fer­ences bet­ter:

Best justify + good justify comparison animation

There are two things hap­pen­ing here, one more sub­tle than the other. The first is op­ti­cal mar­gin ad­just­ment. This im­proves aes­thet­ics by ad­just­ing each line’s mar­gins ever so slightly to re­duce empty space, giv­ing a more flat look to edges. You can see it on the fifth line, where it com­pen­sates for the empty space under the V’s left edge by mov­ing it out a lit­tle bit. It’s even more no­tice­able with punc­tu­a­tion on the right edge, where it pushes the pe­ri­ods and hy­phens out into the mar­gin.

The sec­ond thing hap­pen­ing is lig­a­ture sub­sti­tu­tion. Cer­tain com­bi­na­tions of char­ac­ters have sim­i­lar fine de­tails in the same spots and can look a lit­tle awk­ward to­gether, and lig­a­tures can make them look bet­ter by com­bin­ing them into a sin­gle spe­cial­ized glyph. You can see this in the mid­dle left of the text, where “of­fi­cae” is slightly al­tered—look closely at the “ffi” and you will see the three let­ters merged to­gether with the first F be­com­ing a lit­tle smaller and the dot over the I merg­ing with the sec­ond F to cre­ate a slightly larger over­hang. Look in your fa­vorite dead tree book, and you’ll prob­a­bly find it in “ff” and “fi” com­bi­na­tions—it’s pretty hard to no­tice with­out look­ing for it, but it is used to sub­tly im­prove leg­i­bil­ity.

There is noth­ing about EPUBs that pre­vent e-​read­ers from per­form­ing this kind of type­set­ting au­to­mat­i­cally. With any luck, one day we’ll get this nice look in all the e-​books we can down­load. The fault is solely within the e-​reader’s soft­ware. Until they start to take type­set­ting se­ri­ously, the only way we’ll get this is with PDFs. Un­for­tu­nately, most e-​books aren’t legally avail­able with­out DRM, mak­ing this kind of dra­matic al­ter­ation im­pos­si­ble with most of the stuff you can buy.

It’s easy to say this isn’t very im­por­tant. After all, it doesn’t af­fect func­tion­al­ity, right? You can still read the first pic­ture! When you’re doing a lot of read­ing, it is im­por­tant. Proper text flow re­duces eye move­ment and takes less work for your brain to process, let­ting you read longer and faster. It also hap­pens to get sig­nif­i­cantly more text onto each page, which means fewer de­lays from page turns—in my ex­pe­ri­ence, it re­duces a book’s page count by around 15%.

There are a lot of com­pelling rea­sons to get an e-​reader. They can store thou­sands of books at once and re­mem­ber your place in each one of them. You can look up un­known words in a dic­tio­nary, and book­mark stuff for later ref­er­ence with­out draw­ing all over a page with a high­lighter. You can browse through mas­sive book stores and read bought books in­stantly with­out ever leav­ing your home. It baf­fles me that they don’t bother to im­prove the sin­gle most im­por­tant part of the de­vice—read­ing!

ClearType in Windows 7

One of my big pet peeves with ClearType prior to Win­dows 7 was that it only anti-aliased hor­i­zon­tally with sub-pix­els. This is great for small fonts, be­cause at such a small scale tra­di­tional anti-alias­ing has a smudg­ing ef­fect, re­duc­ing clar­ity and in­creas­ing the font’s weight. For large fonts how­ever, it in­tro­duces some very no­tice­able alias­ing on curves, as best seen in the ‘6′ and ‘g’ here:

"Int64.org" rendered with GDI

You’ve prob­a­bly no­ticed this on web­sites every­where, but have come to ac­cept it. De­pend­ing on your browser and op­er­at­ing sys­tem, you can prob­a­bly see it in the title here. This prob­lem is solved in Win­dows 7 with the in­tro­duc­tion of Di­rectWrite, which com­bines ClearType’s hor­i­zon­tal anti-alias­ing with reg­u­lar ver­ti­cal anti-alias­ing when using large font sizes:

"Int64.org" rendered with DirectWrite

Of course, Di­rectWrite af­fects more than just Latin char­ac­ters. Any glyphs with very slight an­gles will see a huge ben­e­fit, such as hi­ra­gana:

"まこと" rendered with GDI and DirectWrite

Un­for­tu­nately, this isn’t a free up­grade. For what­ever rea­son, Mi­crosoft didn’t make all the old GDI func­tions use Di­rectWrite’s im­prove­ments so to make use of this, all your old GDI and Draw­Text code will need to be up­graded to use Di­rec­t2D and Di­rectWrite di­rectly, so an old WM_PAINT pro­ce­dure like this:

PAINTSTRUCT ps;
HDC hdc = BeginPaint(hwnd, &ps);

HFONT font = CreateFont(-96, 0, 0, 0, FW_NORMAL,
                        0, 0, 0, 0, 0, 0, 0, 0, L"Calibri");

SelectObject(hdc, (HGDIOBJ)font);

RECT rc;
GetClientRect(hwnd, &rc);

DrawText(hdc, L"Int64.org", 9, &rc,
         DT_SINGLELINE | DT_CENTER | DT_VCENTER);

EndPaint(hwnd, &ps);

Will turn into this:

ID2D1Factory *d2df;

D2D1CreateFactory(D2D1_FACTORY_TYPE_SINGLE_THREADED,
   __uuidof(ID2D1Factory), 0, (void**)&d2df);

IDWriteFactory *dwf;

DWriteCreateFactory(DWRITE_FACTORY_TYPE_SHARED,
   __uuidof(IDWriteFactory), (IUnknown**)&dwf);

IDWriteTextFormat *dwfmt;

dwf->CreateTextFormat(L"Calibri", 0, DWRITE_FONT_WEIGHT_REGULAR,
   DWRITE_FONT_STYLE_NORMAL, DWRITE_FONT_STRETCH_NORMAL,
   96.0f, L"en-us", &dwfmt);

dwfmt->SetTextAlignment(DWRITE_TEXT_ALIGNMENT_CENTER);
dwfmt->SetParagraphAlignment(DWRITE_PARAGRAPH_ALIGNMENT_CENTER);

RECT rc;
GetClientRect(hwnd, &rc);

D2D1_SIZE_U size = D2D1::SizeU(rc.right - rc.left,
                               rc.bottom - rc.top);

ID2D1HwndRenderTarget *d2drt;

d2df->CreateHwndRenderTarget(D2D1::RenderTargetProperties(),
   D2D1::HwndRenderTargetProperties(hwnd, size), &d2drt);

ID2D1SolidColorBrush *d2db;

d2drt->CreateSolidColorBrush(D2D1::ColorF(D2D1::ColorF::Black),
   &d2db);

D2D1_SIZE_F layoutSize = d2drt->GetSize();
D2D1_RECT_F layoutRect = D2D1::RectF(0.0, 0.0,
   layoutSize.width, layoutSize.height);

d2drt->BeginDraw();
d2drt->DrawText(L"Int64.org", 9, dwfmt, layoutRect, d2db);
d2drt->EndDraw();

This is no small change, and con­sid­er­ing this API won’t work on any­thing but Vista and Win­dows 7, you’ll be cut­ting out a lot of users if you spe­cial­ize for it. While you could prob­a­bly make a clever Draw­Text wrap­per, Di­rec­t2D and Di­rectWrite are re­ally set up to get you the most ben­e­fit if you’re all in. Hope­fully gen­eral li­braries like Pango and Cairo will get up­dated back­ends for it.

Di­rectWrite has other ben­e­fits too, like sub-pixel ren­der­ing. When you ren­der text in GDI, glyphs will al­ways get snapped to pix­els. If you have two let­ters side by side, it will choose to al­ways start the next let­ter 1 or 2 pix­els away from the last—but what if the cur­rent font size says it should ac­tu­ally be a 1.5 pixel dis­tance? In GDI, this will be rounded to 1 or 2. This is also no­tice­able with kern­ing, which tries to re­move ex­ces­sive space be­tween spe­cific glyphs such as “Vo”. Be­cause of this, most of the text you see in GDI is very slightly warped. It’s much more ap­par­ent when an­i­mat­ing, where it causes the text to have a wob­bling ef­fect as it con­stantly snaps from one pixel to the next in­stead of smoothly tran­si­tion­ing be­tween the two.

Di­rectWrite’s sub-pixel ren­der­ing helps to al­le­vi­ate this by doing ex­actly that: glyphs can now start ren­der­ing at that 1.5 pixel dis­tance, or any other point in be­tween. Here you can see the dif­fer­ing space be­tween the ‘V’ and ‘o’, as well as a slight dif­fer­ence be­tween the ‘o’s with Di­rectWrite on the right side, be­cause they are being ren­dered on sub-pixel off­sets:

"Volcano" close-up comparison with GDI and DirectWrite

The dif­fer­ence be­tween an­i­mat­ing with sub-pixel ren­der­ing and with­out is stag­ger­ing when we view it in mo­tion:

"Volcano" animation comparison with GDI and DirectWrite

Prior to Di­rectWrite the nor­mal way to an­i­mate like this was to ren­der to a tex­ture with mono­chrome anti-alias­ing (that is, with­out ClearType), and trans­form the tex­ture while ren­der­ing. The prob­lem with that is the trans­form will in­tro­duce a lot of im­per­fec­tions with­out ex­pen­sive su­per-sam­pling, and of course it won’t be able to use ClearType. With Di­rectWrite you get pixel-per­fect ClearType ren­der­ing every time.

Ap­par­ently WPF 4 is al­ready using Di­rec­t2D and Di­rectWrite to some de­gree, hope­fully there will be high-qual­ity text in­te­grated in Flash’s fu­ture. Fire­fox has also been look­ing at adding Di­rectWrite sup­port, but I haven’t seen any news of We­bkit (Chrome/Sa­fari) or Opera doing the same. It looks like Fire­fox might ac­tu­ally get it in be­fore In­ter­net Ex­plorer. Edit: looks like In­ter­net Ex­plorer 9 will use Di­rectWrite—won­der which will go gold with the fea­ture first?

Di­rec­t2D and Di­rectWrite are in­cluded in Win­dows 7, but Mi­crosoft has back­ported them in the Plat­form Up­date for Win­dows Server 2008 and Win­dows Vista so there’s no rea­son peo­ple who are stick­ing with Vista should be left out. Are there peo­ple stick­ing with Vista?

Efficient stream parsing in C++

A while ago I wrote about cre­at­ing a good parser and while the non-block­ing idea was spot-on, the rest of it re­ally isn’t very good in C++ where we have the power of tem­plates around to help us.

I’m cur­rently fin­ish­ing up a HTTP li­brary and have been re­vis­ing my views on stream pars­ing be­cause of it. I’m still not en­tirely set on my over­all im­ple­men­ta­tion, but I’m near­ing com­ple­tion and am ready to share my ideas. First, I’ll list my re­quire­ments:

To ac­com­plish this I broke this out into three lay­ers: a core parser, a buffer, and a buffer parser.

The core parser

De­sign­ing the core parser was sim­ple. I be­lieve I al­ready have a solid C++ parser de­sign in my XML li­brary, so I’m reusing that con­cept. This is fully in-situ pull parser that op­er­ates on a range of bidi­rec­tional it­er­a­tors and re­turns back a sub-range of those it­er­a­tors. The pull func­tion re­turns ok when it parses a new el­e­ment, done when it has reached a point that could be con­sid­ered an end of the stream, and need_more when an el­e­ment can’t be ex­tracted from the passed in it­er­a­tor range. Using this parser is pretty sim­ple:

typedef std::deque<char> buffer_type;
typedef http::parser<buffer_type::iterator> parser_type;

buffer_type buffer;

parser_type p;
parser_type::node_type n;
parser_type::result_type r;

do
{
  push_data(buffer); // add data to buffer from whatever I/O source.

  std::deque<char>::iterator first = buffer.begin();

  while((r = p.parse(first, buffer.end(), n)) == http::result_types::ok)
  {
    switch(n.type)
    {
      case http::node_types::method:
      case http::node_types::uri:
      case http::node_types::version:
    }
  }

  buffer.erase(buffer.begin(), first); // remove all the used
                                       // data from the buffer.
} while(r == http::result_types::need_more);

By let­ting the user pass in a new range of it­er­a­tors to parse each time, we have the op­tion of up­dat­ing the stream with more data when need_more is re­turned. The parse() func­tion also up­dates the first it­er­a­tor so that we can pop any data prior to it from the data stream.

By de­fault the parser will throw an ex­cep­tion when it en­coun­ters an error. This can be changed by call­ing an over­load and han­dling the error re­sult type:

typedef std::deque<char> buffer_type;
typedef http::parser<buffer_type::iterator> parser_type;

buffer_type buffer;

parser_type p;
parser_type::node_type n;
parser_type::error_type err;
parser_type::result_type r;

do
{
  push_data(buffer); // add data to buffer from whatever I/O source.

  std::deque<char>::iterator first = buffer.begin();

  while((r = p.parse(first, buffer.end(), n, err)) == http::result_types::ok)
  {
    switch(n.type)
    {
      case http::node_types::method:
      case http::node_types::uri:
      case http::node_types::version:
    }
  }

  buffer.erase(buffer.begin(), first); // remove all the used
                                       // data from the buffer.
} while(r == http::result_types::need_more);

if(r == http::result_types::error)
{
  std::cerr
    << "an error occured at "
    << std::distance(buffer.begin(), err.position())
    << ": "
    << err.what()
    << std::endl;
}

The buffer

Ini­tially I was test­ing my parser with a deque<char> like above. This let me test the it­er­a­tor-based parser very eas­ily by in­cre­men­tally push­ing data on, pars­ing some of it, and pop­ping off what was used. Un­for­tu­nately, using a deque means we al­ways have an extra copy, from an I/O buffer into the deque. It­er­at­ing a deque is also a lot slower than it­er­at­ing a range of point­ers be­cause of the way deque is usu­ally im­ple­mented. This in­ef­fi­ciency is ac­cept­able for test­ing, but just won't work in a live app.

My buffer class is I/O- and pars­ing-op­ti­mized, op­er­at­ing on pages of data. It al­lows pages to be in­serted di­rectly from I/O with­out copy­ing. Ones that weren't filled en­tirely can still be filled later, al­low­ing the user to com­mit more bytes of a page as read­able. One can use scat­ter/gather I/O to make op­er­a­tions span mul­ti­ple pages con­tained in a buffer.

The buffer ex­poses two types of it­er­a­tors. The first type is what we are used to in deque: just a gen­eral byte stream it­er­a­tor. But this in­curs the same cost as deque: each in­cre­ment to the it­er­a­tor must check if it's at the end of the cur­rent page and move to the next. A pro­to­col like HTTP can fit a lot of el­e­ments into a sin­gle 4KiB page, so it doesn't make sense to have this cost. This is where the sec­ond it­er­a­tor comes in: the page it­er­a­tor. A page can be thought of as a Range rep­re­sent­ing a sub­set of the data in the full buffer. Over­all the buffer class looks some­thing like this:

struct page
{
  const char *first;    // the first byte of the page.
  const char *last;     // one past the last byte of the page.
  const char *readpos;  // the first readable byte of the page.
  const char *writepos; // the first writable byte of the page,
                        // one past the last readable byte.
};

class buffer
{
public:
  typedef ... size_type;
  typedef ... iterator;
  typedef ... page_iterator;

  void push(page *p); // pushes a page into the buffer.  might
                      // be empty, semi-full, or full.

  page* pop(); // pops the first fully read page from from the buffer.

  void commit_write(size_type numbytes); // merely moves writepos
                                         // by some number of bytes.

  void commit_read(size_type numbytes); // moves readpos by
                                        // some number of bytes.

  iterator begin() const;
  iterator end() const;

  page_iterator pages_begin() const;
  page_iterator pages_end() const;
};

One thing you may no­tice is it ex­pects you to push() and pop() pages di­rectly onto it, in­stead of al­lo­cat­ing its own. I re­ally hate classes that al­lo­cate mem­ory – in terms of scal­a­bil­ity the fewer places that al­lo­cate mem­ory, the eas­ier it will be to op­ti­mize. Be­cause of this I al­ways try to de­sign my code to – if it makes sense – have the next layer up do al­lo­ca­tions. When it doesn't make sense, I doc­u­ment it. Hid­den al­lo­ca­tions are the root of evil.

The buffer parser

Un­like the core parser, the buffer parser isn't a tem­plate class. The buffer parser ex­poses the same func­tion­al­ity as a core parser, but using a buffer in­stead of it­er­a­tor ranges.

This is where C++ gives me a big ad­van­tage. The buffer parser is ac­tu­ally im­ple­mented with two core parsers. The first is a very fast http::parser<const char*>. It uses this to parse as much of a sin­gle page as pos­si­ble, stop­ping when it en­coun­ters need_more and no more data can be added to the page. The sec­ond is a http::parser<buffer::iterator>. This gets used when the first parser stops, which will hap­pen very in­fre­quently – only when a HTTP el­e­ment spans mul­ti­ple pages.

This is fairly easy to im­ple­ment, but re­quired a small change to my core parser con­cept. Be­cause each has sep­a­rate in­ter­nal state, I needed to make it so I could move the state be­tween two parsers that use dif­fer­ent it­er­a­tors. The amount of state is ac­tu­ally very small, mak­ing this a fast op­er­a­tion.

The buffer parser works with two dif­fer­ent it­er­a­tor types in­ter­nally, so I chose to al­ways re­turn a buffer::iterator range. The choice was ei­ther that or silently copy el­e­ments span­ning mul­ti­ple pages, and this way lets the user of the code de­cide how they want to han­dle it.

Using the buffer parser is just as easy as be­fore:

http::buffer buffer;
http::buffer_parser p;
http::buffer_parser::node_type n;
http::buffer_parser::result_type r;

do
{
  push_data(buffer); // add data to buffer from whatever I/O source.

  while((r = p.parse(buffer, n)) == http::result_types::ok)
  {
    switch(n.type)
    {
      case http::node_types::method:
      case http::node_types::uri:
      case http::node_types::version:
    }
  }

  pop_used(buffer); // remove all the used
                    // data from the buffer.
} while(r == http::result_types::need_more);

The I/O layer

I'm leav­ing out an I/O layer for now. I will prob­a­bly write sev­eral small I/O sys­tems for it once I'm sat­is­fied with the parser. Per­haps one using asio, one using I/O com­ple­tion ports, and one using epoll. I've de­signed this from the start to be I/O ag­nos­tic but with op­ti­miza­tions that fa­cil­i­tate ef­fi­cient forms of all I/O, so I think it could be an good bench­mark of the var­i­ous I/O sub­sys­tems that dif­fer­ent plat­forms pro­vide.

One idea I've got is to use Winsock Ker­nel to im­ple­ment a ker­nel-mode HTTPd. Not a very good idea from a se­cu­rity stand­point, but would still be in­ter­est­ing to see the ef­fects on per­for­mance. Be­cause the parser per­forms no al­lo­ca­tion, no I/O calls, and doesn't force the use of ex­cep­tions, it should ac­tu­ally be very sim­ple to use in ker­nel-mode.

C++1x loses Concepts

Bjarne Strous­trup and Herb Sut­ter have both re­ported on the ISO C++ meet­ing in Frank­furt a week ago, in which the much-her­alded fea­ture "con­cepts" were re­moved from C++1x.

Con­cepts are a pow­er­ful fea­ture aimed at im­prov­ing over­load­ing (ba­si­cally, re­mov­ing the extra work in using things like it­er­a­tor cat­e­gories) and mov­ing type check­ing up the lad­der so that more rea­son­able error mes­sages can be pro­duced when a de­vel­oper passes in the wrong type (think a sin­gle error line in­stead of mul­ti­ple pages of tem­plate crap). Ap­par­ently the fea­ture was a lot less solid than most of us thought, with a huge amount of in­ter­nal ar­gu­ing within the com­mit­tee on a lot of the fun­da­men­tal fea­tures of it. It seems that while most agreed con­cepts were a good idea, no­body could agree on how to im­ple­ment them.

I'm def­i­nitely dis­ap­pointed by this, but I'm also glad they chose to re­move con­cepts in­stead of fur­ther de­lay­ing the stan­dard, or worse putting out a poorly de­signed one. In­stead, it seems like there is hope for a smaller C++ up­date to come out in 4-5 years that adds a more well thought out con­cepts fea­ture. There are plenty of other C++1x lan­guage fea­tures to be happy about for though, like vari­adic tem­plates, rvalue ref­er­ences, and lambda func­tions!

You may no­tice I've been say­ing C++1x here in­stead of C++0x—that's be­cause it's pretty ob­vi­ous to every­one now that we won't be get­ting the next C++ stan­dard in 2009, but more likely 2011 or 2012. Just in time for the end of the world!

I/O Improvements in Windows Vista

My tips for ef­fi­cient I/O are rel­e­vant all the way back to cod­ing for Win­dows 2000. A lot of time has passed since then though, and for all the crit­i­cism it got, Win­dows Vista ac­tu­ally brought in a few new ways to make I/O even more per­for­mant than be­fore.

This will prob­a­bly be my last post on user-mode I/O until some­thing new and in­ter­est­ing comes along, com­plet­ing what started a cou­ple weeks ago with High Per­for­mance I/O on Win­dows.

Synchronous completion

In the past, non-block­ing I/O was a great way to re­duce the stress on a com­ple­tion port. An un­for­tu­nate side-ef­fect of this was an in­creased amount of syscalls -- the last non-block­ing call you make will do noth­ing, only re­turn­ing WSAE­WOULD­BLOCK. You would still need to call an asyn­chro­nous ver­sion to wait for more data.

Win­dows Vista solved this el­e­gantly with Set­File­Com­ple­tion­No­ti­fi­ca­tion­Modes. This func­tion lets you tell a file or socket that you don't want a com­ple­tion packet queued up when an op­er­a­tion com­pletes syn­chro­nously (that is, a func­tion re­turned suc­cess im­me­di­ately and not ER­ROR_IO_PEND­ING). Using this, the last I/O call will al­ways be of some use -- ei­ther it com­pletes im­me­di­ately and you can con­tinue pro­cess­ing, or it starts an asyn­chro­nous op­er­a­tion and a com­ple­tion packet will be queued up when it fin­ishes.

Like the non-block­ing I/O trick, con­tin­u­ally call­ing this can starve other op­er­a­tions in a com­ple­tion port if a socket or file feeds data faster than you can process it. Care should be taken to limit the num­ber of times you con­tinue pro­cess­ing syn­chro­nous com­ple­tions.

Reuse memory with file handles

If you want to op­ti­mize even more for through­put, you can as­so­ci­ate a range of mem­ory with an un­buffered file han­dle using Set­FileIoOver­lappe­dRange. This tells the OS that a block of mem­ory will be re-used, and should be kept locked in mem­ory until the file han­dle is closed. Of course if you won't be per­form­ing I/O with a han­dle very often, it might just waste mem­ory.

Dequeue multiple completion packets at once

A new fea­ture to fur­ther re­duce the stress on a com­ple­tion port is GetQueued­Com­ple­tion­Sta­tu­sEx, which lets you de­queue mul­ti­ple com­ple­tion pack­ets in one call.

If you read the docs for it, you'll even­tu­ally re­al­ize it only re­turns error in­for­ma­tion if the func­tion it­self fails—not if an async op­er­a­tion fails. Un­for­tu­nately this im­por­tant in­for­ma­tion is miss­ing from all the of­fi­cial docs I could find, and search­ing gives noth­ing. So how do you get error in­for­ma­tion out of GetQueued­Com­ple­tion­Sta­tu­sEx? Well, after play­ing around a bit I dis­cov­ered that you can call GetOver­lappe­dResult or WSAGe­tOver­lappe­dResult to get it, so not a prob­lem.

This func­tion should only be used if your ap­pli­ca­tion has a sin­gle thread or han­dles a high amount of con­cur­rent I/O op­er­a­tions, or you might end up de­feat­ing the mul­ti­thread­ing baked in to com­ple­tion ports by not let­ting it spread com­ple­tion no­ti­fi­ca­tions around. If you can use it, it's a nice and easy way to boost the per­for­mance of your code by low­er­ing con­tention on a com­ple­tion port.

Bandwidth reservation

One large change in Win­dows Vista was I/O sched­ul­ing and pri­or­i­ti­za­tion. If you have I/O that is de­pen­dant on steady stream­ing like audio or video, you can now use Set­File­Band­widthReser­va­tion to help en­sure it will never be in­ter­rupted by some­thing else hog­ging a disk.

Cancel specific I/O requests

A big pain pre-Vista was the in­abil­ity to can­cel in­di­vid­ual I/O op­er­a­tions. The only op­tion was to can­cel all op­er­a­tions for a han­dle, and only from the thread which ini­ti­ated them.

If it turns out some I/O op­er­a­tion is no longer re­quired, it is now pos­si­ble to can­cel in­di­vid­ual I/Os using Can­ce­lIoEx. This much needed func­tion re­places the al­most use­less Can­ce­lIo, and opens the doors to shar­ing file han­dles be­tween sep­a­rate op­er­a­tions.