Efficient Documentation Using SQL Grammar Diagrams

(Also published on the Cockroach Labs Blog.)

As CockroachDB approaches beta, user documentation has become increasingly important, and one of the meatiest requirements is documentation of our SQL implementation. For inspiration, I researched how other databases have documented SQL. The most effective example I found was SQLite’s grammar diagrams.

These diagrams feature easy-to-understand railroad diagrams showing the possible options for a SQL statement. Compared to a text representation, these visual diagrams give users an intuitive way to explore the grammar and discover features.

Converting Grammar into Images with Yacc and EBNF

There are various programs that can take a well-specified grammar file and convert it into images. Of the ones I saw, I was most impressed with the Railroad Diagram Generator. It produces linked SVG images that can easily be embedded into a web page and manipulated. However, this generator requires input in EBNF form. The CockroachDB grammar is defined in a yacc file, from which the yaac program generates source code that parses SQL. As yacc has a specified format, it is straightforward to parse and convert to EBNF. One program that does this is yyextract from the cutils package on many Linux distributions. yyextract produces just BNF files. But with some short regexes, it was possible to convert our sql.y into a valid EBNF file that the generator could understand.

Inlining and Simplification AKA Documentation for Humans

With the proof-of-concept complete, I had much more work left to make these diagrams useful to humans. We now had one huge HTML page with every possible option, but what we really needed was something similar to what SQLite provides: a single image that displays top-level, useful information with options to go deeper. Taking ALTER TABLE as an example, it was clear where this would get tricky. ALTER TABLE contains a reference to alter_table_cmds, which allows any number of alter_table_cmd references separated by commas. That’s at least three different statements just to figure out what ALTER TABLE can do. Instead of clicking through to each of those, the useful ones should be inlined into the top ALTER TABLE statement. That is, instead of a referencing other statements, they should be included directly. I accomplished this by writing my own parser for EBNF, parsing the output of yyextract, modifying it, and then feeding it into the diagram generator. This reduced the depth of the statements and made them much more usable. I worked in other helpful simplifications as well. For example, I used a simplification rule to convert awkwardly-defined lists into a nice form with a feedback loop.

However, there are other simplifications I would still like to implement. For example, many statements have IF EXISTS expressions. Currently, these statements have two expressions: one with and one without the IF EXISTS clause. A simplification that combines these two expressions into one would further reduce the complexity of some diagrams.

How to Diagram Unimplemented SQL Statements

As CockroachDB is a new project, many esoteric or difficult parts of the full SQL grammar are not yet implemented. We allow for them in our parser, but they will always produce an error describing them as unimplemented. We want our documentation to be accurate and concise, not cluttered with notes about whether something displayed works or not, so we want to filter unimplemented expressions out of our generated diagrams. The yyextract tool used in the initial proof-of-concept outputs all of the parsing rules listed in the sql.y file, but not their implementations (or lack thereof). Thus, we needed a yacc parser that allows us to fully inspect the grammar. I was not able to find a Go package that could successfully parse our sql.y file. The Go tool itself has a yacc parser and generator, but it is translated from a C program and was not built for this kind of inspection. Yacc is not a complicated language, so it made some sense to build a custom parser ourselves. I used the Go text/template/parse package as a boilerplate, and modified it to produce a yacc AST. With the parsed yacc file in memory, it was possible to remove any expressions that were marked unimplemented as well as statements containing only unimplemented expressions.

Summary

These tools allow us to automatically generate all of the SQL diagrams in our documentation. We have a document describing the full grammar, as well as smaller pages listing single statements. All diagrams link references to the full grammar, making it simple to explore. The code for this is in our repository. Now, anytime we modify the SQL grammar to add a new feature, all the diagrams can be regenerated with a single command.