Efficient stream parsing in C++

A while ago I wrote about creating a good parser and while the non-blocking idea was spot-on, the rest of it really isn’t very good in C++ where we have the power of templates around to help us.

I’m currently finishing up a HTTP library and have been revising my views on stream parsing because of it. I’m still not entirely set on my overall implementation, but I’m nearing completion and am ready to share my ideas. First, I’ll list my requirements:

To accomplish this I broke this out into three layers: a core parser, a buffer, and a buffer parser.

The core parser

Designing the core parser was simple. I believe I already have a solid C++ parser design in my XML library, so I’m reusing that concept. This is fully in-situ pull parser that operates on a range of bidirectional iterators and returns back a sub-range of those iterators. The pull function returns ok when it parses a new element, done when it has reached a point that could be considered an end of the stream, and need_more when an element can’t be extracted from the passed in iterator range. Using this parser is pretty simple:

typedef std::deque<char> buffer_type;
typedef http::parser<buffer_type::iterator> parser_type;

buffer_type buffer;

parser_type p;
parser_type::node_type n;
parser_type::result_type r;

do
{
  push_data(buffer); // add data to buffer from whatever I/O source.

  std::deque<char>::iterator first = buffer.begin();

  while((r = p.parse(first, buffer.end(), n)) == http::result_types::ok)
  {
    switch(n.type)
    {
      case http::node_types::method:
      case http::node_types::uri:
      case http::node_types::version:
    }
  }

  buffer.erase(buffer.begin(), first); // remove all the used
                                       // data from the buffer.
} while(r == http::result_types::need_more);

By letting the user pass in a new range of iterators to parse each time, we have the option of updating the stream with more data when need_more is returned. The parse() function also updates the first iterator so that we can pop any data prior to it from the data stream.

By default the parser will throw an exception when it encounters an error. This can be changed by calling an overload and handling the error result type:

typedef std::deque<char> buffer_type;
typedef http::parser<buffer_type::iterator> parser_type;

buffer_type buffer;

parser_type p;
parser_type::node_type n;
parser_type::error_type err;
parser_type::result_type r;

do
{
  push_data(buffer); // add data to buffer from whatever I/O source.

  std::deque<char>::iterator first = buffer.begin();

  while((r = p.parse(first, buffer.end(), n, err)) == http::result_types::ok)
  {
    switch(n.type)
    {
      case http::node_types::method:
      case http::node_types::uri:
      case http::node_types::version:
    }
  }

  buffer.erase(buffer.begin(), first); // remove all the used
                                       // data from the buffer.
} while(r == http::result_types::need_more);

if(r == http::result_types::error)
{
  std::cerr
    << "an error occured at "
    << std::distance(buffer.begin(), err.position())
    << ": "
    << err.what()
    << std::endl;
}

The buffer

Initially I was testing my parser with a deque<char> like above. This let me test the iterator-based parser very easily by incrementally pushing data on, parsing some of it, and popping off what was used. Unfortunately, using a deque means we always have an extra copy, from an I/O buffer into the deque. Iterating a deque is also a lot slower than iterating a range of pointers because of the way deque is usually implemented. This inefficiency is acceptable for testing, but just won't work in a live app.

My buffer class is I/O- and parsing-optimized, operating on pages of data. It allows pages to be inserted directly from I/O without copying. Ones that weren't filled entirely can still be filled later, allowing the user to commit more bytes of a page as readable. One can use scatter/gather I/O to make operations span multiple pages contained in a buffer.

The buffer exposes two types of iterators. The first type is what we are used to in deque: just a general byte stream iterator. But this incurs the same cost as deque: each increment to the iterator must check if it's at the end of the current page and move to the next. A protocol like HTTP can fit a lot of elements into a single 4KiB page, so it doesn't make sense to have this cost. This is where the second iterator comes in: the page iterator. A page can be thought of as a Range representing a subset of the data in the full buffer. Overall the buffer class looks something like this:

struct page
{
  const char *first;    // the first byte of the page.
  const char *last;     // one past the last byte of the page.
  const char *readpos;  // the first readable byte of the page.
  const char *writepos; // the first writable byte of the page,
                        // one past the last readable byte.
};

class buffer
{
public:
  typedef ... size_type;
  typedef ... iterator;
  typedef ... page_iterator;

  void push(page *p); // pushes a page into the buffer.  might
                      // be empty, semi-full, or full.

  page* pop(); // pops the first fully read page from from the buffer.

  void commit_write(size_type numbytes); // merely moves writepos
                                         // by some number of bytes.

  void commit_read(size_type numbytes); // moves readpos by
                                        // some number of bytes.

  iterator begin() const;
  iterator end() const;

  page_iterator pages_begin() const;
  page_iterator pages_end() const;
};

One thing you may notice is it expects you to push() and pop() pages directly onto it, instead of allocating its own. I really hate classes that allocate memory – in terms of scalability the fewer places that allocate memory, the easier it will be to optimize. Because of this I always try to design my code to – if it makes sense – have the next layer up do allocations. When it doesn't make sense, I document it. Hidden allocations are the root of evil.

The buffer parser

Unlike the core parser, the buffer parser isn't a template class. The buffer parser exposes the same functionality as a core parser, but using a buffer instead of iterator ranges.

This is where C++ gives me a big advantage. The buffer parser is actually implemented with two core parsers. The first is a very fast http::parser<const char*>. It uses this to parse as much of a single page as possible, stopping when it encounters need_more and no more data can be added to the page. The second is a http::parser<buffer::iterator>. This gets used when the first parser stops, which will happen very infrequently – only when a HTTP element spans multiple pages.

This is fairly easy to implement, but required a small change to my core parser concept. Because each has separate internal state, I needed to make it so I could move the state between two parsers that use different iterators. The amount of state is actually very small, making this a fast operation.

The buffer parser works with two different iterator types internally, so I chose to always return a buffer::iterator range. The choice was either that or silently copy elements spanning multiple pages, and this way lets the user of the code decide how they want to handle it.

Using the buffer parser is just as easy as before:

http::buffer buffer;
http::buffer_parser p;
http::buffer_parser::node_type n;
http::buffer_parser::result_type r;

do
{
  push_data(buffer); // add data to buffer from whatever I/O source.

  while((r = p.parse(buffer, n)) == http::result_types::ok)
  {
    switch(n.type)
    {
      case http::node_types::method:
      case http::node_types::uri:
      case http::node_types::version:
    }
  }

  pop_used(buffer); // remove all the used
                    // data from the buffer.
} while(r == http::result_types::need_more);

The I/O layer

I'm leaving out an I/O layer for now. I will probably write several small I/O systems for it once I'm satisfied with the parser. Perhaps one using asio, one using I/O completion ports, and one using epoll. I've designed this from the start to be I/O agnostic but with optimizations that facilitate efficient forms of all I/O, so I think it could be an good benchmark of the various I/O subsystems that different platforms provide.

One idea I've got is to use Winsock Kernel to implement a kernel-mode HTTPd. Not a very good idea from a security standpoint, but would still be interesting to see the effects on performance. Because the parser performs no allocation, no I/O calls, and doesn't force the use of exceptions, it should actually be very simple to use in kernel-mode.

Posted on September 01, 2009 in C++, Coding, Parsing, Scalability

Related Posts