Writing a good parser
From a simple binary protocol used over sockets to a complex XML document, many applications depend on parsing. Why, then, do the great majority of parsers out there just plain suck?
I’ve come to believe a good parser should have two properties:
- It should be completely decoupled from any I/O. The same parser should work on a file, a pipe, a socket, or from straight memory. This not only gets you flexibility but the ability to make test cases simpler: protocol handler acting up? Just write a test case that reads it from a file instead.
- Other than perhaps memory allocation, a parser should never block. To accomplish the above task, I’ve seen several parsers use read/write callbacks. This causes the parser to block on I/O. Should you ever need to use the parser in a scalable environment, this just won’t do—a good server only has one or two threads per CPU and blocking on I/O doesn’t allow this.
Several parsers meet the first requirement, but almost none meet the second: they expect their input to come in without a break in execution. So how do you accomplish this? Lately I’ve been employing this fairly simple design:
struct buffer { struct buffer *next; char *buf; size_t len; }; struct parser { struct buffer *input; struct buffer *lastinput; struct buffer *output; int (*func)(struct parser*, struct *state); }; enum { CONTINUE, NEEDMORE, GOTFOO, GOTBAR }; int parse(struct parser *p) { int ret; while((ret = p->func(p)) == CONTINUE); return ret; }
The idea should be easy to understand:
- Add buffer(s) to input queue.
- If
parse()
returnsNEEDMORE
, add more input to the queue and call it again. - If
parse()
returnsGOTFOO
orGOTBAR
, state is filled with data. - The function pointer is continually updated with a parser specialized for the current data: in this case, a
FOO
or aBAR
, or even just bits and pieces of aFOO
or aBAR
. It returnsCONTINUE
ifparse()
should just call a new function pointer. - As the parser function pointers eat at the input queue, put used buffers into the output stack.
Other than meeting my two requirements above, the best thing about this design? It doesn’t sacrifice cleanliness, and it won’t cause your code size to increase.
Related Posts
- Efficient stream parsing in C++ on September 01, 2009 in Coding, Parsing
- Is C# the Boost of C-family languages? on October 28, 2010 in C, Coding
- C++0x work progressing on November 28, 2007 in Coding
- strncpy is not your friend on January 20, 2008 in C, Coding
- Visual Studio 2008 released, TR1 support coming on November 24, 2007 in Coding