CS2 Discussion: Project: Revisiting implementation of a new parser #77

bd82 · 2017-03-20T18:46:01Z

Hello. I'm a long time lurker here and would like to (re?)-raise a proposal.
Sorry for the long post 😄

The What:

I'd like to re-raise the issue of implementing a new parser for CoffeeScript.

Some previous related discussions:

Scope:

The scope is intentionally limited to only creating a new parser.
No intent to touch the lexer & re-writer nor to modify the code generation parts.

The why:

As previously discussed the existing CS compiler infrastructure is a limiting
factor in the long term for CoffeeScript.

Strings based code generation
Incompatibility with Babel AST structures
Difficulty in expanding Jison's capabilities
- See pulse of Jison.
- See commit graph of Jison.
- Particularly in the context of language services related capabilities such as error recovery / partial parsing.

Replacing the whole pipeline at once requires more resources than available to this project.
And even if those resources were available it is still a very risky approach.

Therefore an incremental approach is needed.

Architecture:

I propose to create a separation between the syntactic analysis and the AST creation.
This means that logic that creates the AST must not be embedded inside the parser.
Instead the parser should create a more low level structure, a Parse Tree / Concrete Syntax Tree.
which could be transformed afterwards to serve different needs, for example:

Transformation to create the existing CS AST to support the existing compiler backend.
Transformation to create a Babel AST to support a new experimental compiler backend.
Transformation to an enriched AST structure that represents the entire syntactic information to support
language services tool such a formatting & refactoring.

This proposed separation of concerns will help to future proof the CoffeeScript compiler
by enabling future incremental changes such as replacing the compiler backend without
modifying (or diverging from) the compiler frontend (parser).

The How:

Warning Sales pitch incoming

Normally the standard approach to writing a parser for a compiler is to write one "by hand".

See quote from Terence Parr (the creator of Antlr):

In my experience, almost no one uses parser generators to build commercial compilers.

The problem with this approach is that it can be a bit repetitive and error prone work.
And that implementing more advanced capabilities such as fault tolerance capabilities can be complex.
fortunately the last time I needed to write an hand built parser I was too lazy 😸 and instead
created a library that makes it easier to hand build parsers in JavaScript: Chevrotain
without any code generation.

Relevant Highlights:

JSON Grammar example in CoffeeScript.
Performance Benchmark (Jison included).
- About one order of magnitude faster than Jison on Chrome 57.
Automatic CST output creation.
- This in an online playground, the first example also creates a CST output.
Automatic fault tolerance capabilities.

The proposal is to write the new CoffeeScript parser in CoffeeScript (no code generation).
Using the Chevrotain Parsing library.

The who:

I can contribute enough time to try implementing this.
I obviously can't make any promises, but this won't be the first parser I've written so I've got a decent
chance of success.

Risks & Issues:

Factoring away left recursion (for LL(k) parser) may result in uglier parse trees.
Do the CoffeeScript's Token contain full position information?
- A worst case may require changes to the re-writer or even replacing the whole
  lexer -> re-writer -> parser flow, but that is a less incremental approach.
My CoffeeScript skills are lacking, may require assistance in getting the code to decent quality.
Error messages contents and structure for invalid inputs will change.
Testing that the AST output is the same requires a large amount of valid CS source code.
Additional abstraction and separation will have an overhead performance wise.
- Should be mitigated by the higher base performance of Chevrotain vs Jison.

Questions:

Any feedback / suggestions?
Am I missing some blocker or potential show stopper here?
Is this approach acceptable/approved by the project leaders?
If a POC succeeds will there be assistance in integrating this into the CoffeeScript code base?
What percentage of the CoffeeScript running time is spent parsing?
- I'm trying to figure out the potential performance benefits for an E2E compilation flow.

rattrayalex · 2017-03-20T19:13:18Z

As previously discussed the existing CS compiler infrastructure is a limiting factor in the long term for CoffeeScript.

Is it? My impression is that backwards-incompatibility is the biggest limiting factor, coupled with a lack of general appetite for new language features. Perhaps I haven't been paying close enough attention.

I've personally been super-impressed with what the CS2 team has accomplished, and I'm not personally aware of issues that are blocked on "the parser is too hard to work with". Perhaps it would help to list a few?

bd82 · 2017-03-20T19:41:02Z

Thanks for the quick feedback @rattrayalex.

I'm not talking about CS2. but the longer term: CS3+.
The approach taken to create CS2 appears to be the right approach to
quickly achieve the desired results from the end user perspective
while maintaining backwards compatibility.

But what about the future? Will CoffeeScript still rely on a Parsing library that was last updated in April 2016, Will this still be true in 2019 too?
Does CoffeeScript ever want a new backend based on Babel? If so how to achieve this gradually?
Does CoffeeScript ever want to implement language services as part of the compiler(Like TypeScript)? Is so, is this achievable with the current architecture ?

My suggestion tries to find a gradual process and architecture to attack these longer
term strategic problems, While mitigating concerns regarding backward compatibility.
as a full compiler rewrite is probably not feasible or desired.

References: (more in #25)

@GeoffreyBooth said:

Don’t get me wrong: I love the idea of a new compiler, especially one that passes an AST to Babel to generate JavaScript; but I wonder if it’s the best way to get to 2.0.0. We could release 2.0.0 first, then refactor the compiler afterward if people have the motivation. A better compiler will certainly make it easier to implement whatever ES2016 and ES2017 and so on throws our way going forward.
@JimPanic said

really am not sure tbh. I looked a bit deeper into the current code base last week and I think in the long run, we'd benefit a lot from a different approach lexing, parsing and transforming the AST in multiple phases even. The structure right now is quite coupled.
I'd still like to see a gradual transition towards a new approach, though, whatever that might be in the future. Mostly because (I suppose) nobody knows all the quirks and undefined implementation-defined behavior, but also to give users a chance to catch up on these very probable, small behavioral changes. Otherwise the upgrade path seems too steep tbh.
@GeoffreyBooth said

If none of these experiments bear fruit, the backup plan is to keep the original compiler and gradually work to improve it. We know that that’s an option, though not as satisfying as replacing its hairiest parts with cleaner and more-extensible code.

GeoffreyBooth · 2017-04-04T06:34:17Z

@bd82 Thanks for suggesting this, and offering your time. I for one would be interested, though I shudder to think of how ambitious this effort could be. But if you want to take it on, be my guest.

When I was working on adding support for modules, @lydell and I discussed someday mapping the CoffeeScript grammar nodes to Babel’s AST, and we named the nodes (like ImportSpecifier) to match. It would be tremendously helpful if CoffeeScript could parse and tokenize and then pass an AST to Babel for Babel to generate the actual JavaScript. nodes.coffee is by far the largest and most complicated part of the CoffeeScript codebase, and if it could be replaced by Babel that would be a huge savings. And with Babel part of the compilation chain, people would be able to choose their output target: ESNext, ES5, etc. There wouldn’t be a need to further transpile CS2’s output.

The other benefit is that adopting new ECMAScript features would happen much faster and with less effort. If Babel already knew how to generate the JavaScript for a certain feature, all we would need to do is decide what CoffeeScript’s syntax for it should be and figure out how to parse it into an AST. For many features, that’s the easy part.

bd82 · 2017-04-04T07:45:07Z

@bd82 Thanks for suggesting this, and offering your time. I for one would be interested, though I shudder to think of how ambitious this effort could be. But if you want to take it on, be my guest.

I'll have some time over the passover holidays to play around with some POCs.
The project is not that big in terms of code, rather in terms of compatibility and sneaky edge cases...

On Modules and Babel.

Just to be clear, I'm not trying to change the compiler backend (code generation).
That would indeed be a far too ambitious project while also modifying the compiler frontend (similar to a full rewrite).

I'm limiting my scope to modifying parts of compiler frontend (parser).
To make implementing future capabilities such as code generation using Babel be more feasible in the future, both from a technical complexity perspective and from a backward complexity perspective.

GeoffreyBooth · 2017-04-04T17:52:03Z

Sure. The parser could certainly be more organized. As long as all the tests continue to pass once you're done 😄

auvipy · 2017-04-14T11:27:07Z

how about incorporating changes from https://github.com/michaelficarra/CoffeeScriptRedux ?

lydell · 2017-04-14T11:46:50Z

@auvipy That has already been discussed in other issues.

bd82 · 2017-08-03T07:40:52Z

Oops forgot to update this issue 😢

I've played around with this a few months ago with implementing this and reached the conclusion that CoffeeScript is better off with a bottom up (LR) parser instead of a top down parser (LL).

I Initially thought that a conversion would only require handling different patterns for handling
the binary expressions due to lack of left recursion in top down recursive decent parsers.

However as I went deeper and deeper into the grammar I realized that many rules actually
rely on the power of LR grammars to distinguish betweens productions.
For example: In one of the most basic rules (Line) to distinguish between a statement and an Expression may require infinite lookahead, for example:

   // statement
   return 1
   
   // expression, 2 tokens lookahead to distinguish
   return 1 if (true)

   // expression, infinite tokens lookahead to distinguish.
   return 1  + 1 + 1 + ... + 1 if (true)

This works with an LR parser because it can "delay" the decision and reduce the possible
valid matches as it parses along, but an LL parser needs to be able to decide in advance (lookahead).

This does not mean that the grammar cannot be converted to a top down (LL) form (was already done in other coffeeScript related projects...)
Just that that conversion would result in an "uglier" grammar and is more likely to
not comply 100% to the original due to many refactoring.

So because it appears the kind of tool I wanted to use is not the best tool for the job
and additionally these extra required refactoring will increase the already large scope beyond what I would consider attacking as a side project I will unfortunately have to admit defeat here 😭

coffeescriptbot · 2018-02-19T11:30:05Z

Migrated to jashkenas/coffeescript#4970

GeoffreyBooth added project meta labels Apr 10, 2017

bd82 closed this as completed Aug 3, 2017

coffeescriptbot changed the title ~~Revisiting implementation of a new Parser.~~ CS2 Discussion: Project: Revisiting implementation of a new parser Feb 19, 2018

coffeescriptbot mentioned this issue Feb 19, 2018

CS2 Discussion: Project: Revisiting implementation of a new parser jashkenas/coffeescript#4970

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CS2 Discussion: Project: Revisiting implementation of a new parser #77

CS2 Discussion: Project: Revisiting implementation of a new parser #77

bd82 commented Mar 20, 2017

rattrayalex commented Mar 20, 2017

bd82 commented Mar 20, 2017

GeoffreyBooth commented Apr 4, 2017

bd82 commented Apr 4, 2017 •

edited

Loading

GeoffreyBooth commented Apr 4, 2017

auvipy commented Apr 14, 2017

lydell commented Apr 14, 2017

bd82 commented Aug 3, 2017

coffeescriptbot commented Feb 19, 2018

CS2 Discussion: Project: Revisiting implementation of a new parser #77

CS2 Discussion: Project: Revisiting implementation of a new parser #77

Comments

bd82 commented Mar 20, 2017

The What:

Scope:

The why:

Architecture:

The How:

The who:

Risks & Issues:

Questions:

rattrayalex commented Mar 20, 2017

bd82 commented Mar 20, 2017

GeoffreyBooth commented Apr 4, 2017

bd82 commented Apr 4, 2017 • edited Loading

On Modules and Babel.

GeoffreyBooth commented Apr 4, 2017

auvipy commented Apr 14, 2017

lydell commented Apr 14, 2017

bd82 commented Aug 3, 2017

coffeescriptbot commented Feb 19, 2018

bd82 commented Apr 4, 2017 •

edited

Loading