-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exhaustive MySQL Parser #157
Conversation
logic. Refactor the grammar.
…le for a more useful parse tree, 15x smaller grammar file, and faster parsing time
@bgrgicak tested all plugins in the WordPress plugin directory for installation errors. The top 1000 results are published at https://github.com/bgrgicak/playground-tester/blob/main/logs/2024-09-13-09-22-17/wordpress-seo. A lot of these are about SQL queries. Just migrating to new parser would solve many of these errors and give us a proper foundation to add support for more MySQL features. CC @JanJakes |
d3e623d
to
135f29f
Compare
6618be1
to
9971062
Compare
9971062
to
afc70bb
Compare
@adamziel Thanks for the great and detailed review! Great comments and ideas! I think I now went through all of them Some of the improvements are:
There is still more to be done, better documented, and polished, but I think now it would make sense to do that in another pass. I added TODOs for all issues I'm aware of, including naming, missing documentation, etc.
If we run a development version of MySQL, we can use the SHOW PARSE_TREE statement to inspect what parse tree MySQL produces. I haven't investigated yet to which extent we could really use it in scale (e.g., if we can somehow map it to our ASTs), but it may be something worth exploring. Otherwise, I think we may also need to create a test suite of expected ASTs manually, and then gradually expand it. |
@JanJakes marvelous work ❤️ This looks really good and I don't see anything blocking. Minor nitpicks aside, we're good to merge – thank you so much for your hard work and perseverance here! |
I can't approve my own PR, but I can merge it. Just say when and I'll click the button 🎉 |
@adamziel Thanks! Addressed all but #157 (comment). When I'm sure I know what we want here, I can fix that too. |
This PR is in a great place, monumental work here @JanJakes! |
Exhaustive MySQL Parser (WordPress#157)
Context
This PR ships an exhaustive MySQL lexer and parser that produce a MySQL query AST. This is the first step to significantly improve MySQL compatibility and expand WordPress plugin support on SQLite. It's an easier, more stable, and an easier to maintain method than the current token processing. It will also dramatically improve WordPress Playground experience – database integration is the single largest source of issues.
This PR is part of the Advanced MySQL support project.
See the MySQL parser proposal for additional context.
This PR ships
At the moment, all the new files are omitted from the plugin build, so they have no effect on production whatsoever.
Running tests
The lexer & parser tests suite is not yet integrated into the CI and existing test commands. To run the tests, use:
This will lex / lex & parse all the ~70k queries.
Implementation
Parser
A simple recursive parser to transform
(token stream, grammar) => parse tree
. In this PR, we use MySQL tokens and MySQL grammar, but the same parser could also support XML, IMAP, many other grammars (as long as they have some specific properties).The
parse_recursive()
method is just 100 lines of code (excluding comments). All of the parsing rules are provided by the grammar.run-mysql-driver.php
A quick and dirty implementation of what a
MySQL parse tree ➔ SQLite
database driver could look like. It easily supportsWITH
andUNION
queries that would be really difficult to implement the current SQLite integration plugin.The tree transformation is an order of magnitude easier to read, expand, and maintain than the current implementation. I stand by this, even though the temporary
ParseTreeTools
/SQLiteTokenFactory
API included in this PR seems annoying, and I'd like to ship something better than that. Here's a glimpse:Technical details
MySQL Grammar
We use the MySQL workbench grammar, manually adapted, modified, and fixed, and converted from ANTLR4 format to a PHP array.
The grammar conversion pipeline is done by
convert-grammar.php
and goes like this:query ::= SELECT (ALL | DISTINCT)
becomesquery ::= select %select_fragment0
and%select_fragment0 ::= ALL | DISTINCT
.*
,+
,?
modifiers into separate, right-recursive rules. For example,columns ::= column (',' column)*
becomescolumns ::= column columns_rr
andcolumns_rr ::= ',' column | ε
.The
mysql-grammar.php
file size is ~70kb in size, which is small enough. The parser can handle about 1000 complex SELECT queries per second on a MacBook Pro. It only took a few easy optimizations to go from 50/seconds to 1000/second. There's a lot of further optimization opportunities once we need more speed. We could factor the grammar in different ways, explore other types of lookahead tables, or memoize the matching results per token. However, I don't think we need to do that in the short term. If we spend enough time factoring the grammar, we could potentially switch to a LALR(1) parser and cut most time spent on dealing with ambiguities.Known issues
There are some small issues and incomplete edge cases. Here are the ones I'm currently aware of:
.
, this is possible (e.g.,1ea10.1
is a table name & column name). This is not handled yet, and it may be worth checking if all cases in the identifier part after a.
are handled correctly./*!80038 ... /
and nested comments within them, a nested comment within a non-matched version is not supported (e.g.,SELECT 1 /*!99999 /* */ */
). Additionally, we currently support only 5-digit version specifiers (80038
), but 6 digits should probably work as well (080038
)._utf8
underscore charset should be version-dependent (only on MySQL 5), and maybe some others are too. We can check this bySHOW CHARACTER SET
on different MySQL versions.This list is mainly for me, in order not to forget these. I will later port it into a tracking issue with a checklist.
Updates
Since the thread here is pretty long, here are quick links to the work-in-progress updates:
Next steps
These could be implemented either in follow-up PRs or as updates to this PR – whichever is more convenient:
MySQLOnSQLite
database driver to enable running MySQL queries on SQLite. Read this comment for more context. Use any method that's convenient for generating SQLite queries. Feel free to restructure and redo any APIs proposed in this PR. Be inspired by the idea we may build aMySQLOnPostgres
driver one day, but don't actually build any abstractions upfront. Make the driver generic so it can be used without WordPress. Perhaps it could implement a PDO driver interface?MySQLOnSQLite
driver. For example,SQL_CALC_FOUND_ROWS
option or theINTERVAL
syntax.MySQLOnSQLite
driver and ensure they pass.MySQLOnSQLite
driver instead of the current plumbing.