Writing a good parser
From a simple binary protocol used over sockets to a complex XML document, many applications depend on parsing. Why, then, do the great majority of parsers out there just plain suck?
I’ve come to believe a good parser should have two properties:
- It should be completely decoupled from any I/O. The same parser should work on a file, a pipe, a socket, or from straight memory. This not only gets you flexibility but the ability to make test cases simpler: protocol handler acting up? Just write a test case that reads it from a file instead.
- Other than perhaps memory allocation, a parser should never block. To accomplish the above task, I’ve seen several parsers use read/write callbacks. This causes the parser to block on I/O. Should you ever need to use the parser in a scalable environment, this just won’t do—a good server only has one or two threads per CPU and blocking on I/O doesn’t allow this.
Several parsers meet the first requirement, but almost none meet the second: they expect their input to come in without a break in execution. So how do you accomplish this? Lately I’ve been employing this fairly simple design:
struct buffer {
struct buffer *next;
char *buf;
size_t len;
};
struct parser {
struct buffer *input;
struct buffer *lastinput;
struct buffer *output;
int (*func)(struct parser*, struct *state);
};
enum {
CONTINUE,
NEEDMORE,
GOTFOO,
GOTBAR
};
int parse(struct parser *p) {
int ret;
while((ret = p->func(p)) == CONTINUE);
return ret;
}
The idea should be easy to understand:
- Add buffer(s) to input queue.
- If parse() returns NEEDMORE, add more input to the queue and call it again.
- If parse() returns GOTFOO or GOTBAR, state is filled with data.
- The function pointer is continually updated with a parser specialized for the current data: in this case, a FOO or a BAR, or even just bits and pieces of a FOO or a BAR. It returns CONTINUE if parse() should just call a new function pointer.
- As the parser function pointers eat at the input queue, put used buffers into the output stack.
Other than meeting my two requirements above, the best thing about this design? It doesn’t sacrifice cleanliness, and it won’t cause your code size to increase.


