Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CS2] Literate CoffeeScript without dependencies #4509

Merged
merged 4 commits into from
Apr 18, 2017

Conversation

GeoffreyBooth
Copy link
Collaborator

I went back to the drawing board and took a crack at reimplementing invertLiterate without Marked or Markdown-It or any other dependencies—but still passing the tricky additional tests added in #4345. I think I’ve succeeded, but @lydell and @billymoon you should please review this and see if there are additional test cases we should add to cover edge cases I’m not thinking of.

There are still a few breaking changes from 1.x:

  • As has been the case in all the v2 versions of Literate CoffeeScript, simply wrapping a code block in HTML tags is insufficient to get it to be treated as a comment. The code block must also have at least one line unindented.
  • A Literate CoffeeScript file needs to be consistently indented either always with tabs or always with spaces.
  • Putting a code block inside a list item means indenting with spaces to the column of the first non-whitespace character of that list item, and then with your usual indentation beyond that point.

That last point is a little confusing, I concede. It could lead to mixing tabs and spaces, which otherwise would be verboten:

 *  A list item with a code block:

    	test "basic literate CoffeeScript parsing", ->
    		ok yes

The A here is at column 5. This file is otherwise indented with tabs, so the test line is indented with four spaces then one tab. Basically, being within a list resets the left column to be the same as the column of the first non-whitespace character in the list item body (i.e. the A in this example).

I’m not sure Markdown-It handles this any better, so I’m not sure there’s anything to be gained at this point from adding the dependency. (We still have it as a devDependency for building the documentation.) What do you all think?

…tion levels (including inside lists); update literate test files to also check that no tests are skipped
@GeoffreyBooth GeoffreyBooth added this to the 2.0.0 milestone Apr 16, 2017
@GeoffreyBooth
Copy link
Collaborator Author

@akfish This passes your Case 1 but not your Case 2. I’m not sure how we could possibly pass Case 2 without detecting all other types of Markdown blocks (headings, thematic breaks, etc.). I think for the purposes of Literate CoffeeScript we can just specify that code blocks need to start after a blank line?

How does this PR otherwise do against the CommonMark test suite?

@lydell
Copy link
Collaborator

lydell commented Apr 16, 2017

I know too little about the gnarly details of markdown parsing to be able to give any constructive feedback.

@GeoffreyBooth
Copy link
Collaborator Author

@lydell but you’re a regex expert, so you can at least tell us how these new regexes can be improved 😄 The only Markdown-specific part is identifying the first line of a list item (starts with *, -, + or a number like 1.). The rest is tracking whether we’re still inside a list item—i.e. is a successive line indented because it’s within a list. If it’s indented further, it’s a code block within a list.

@billymoon
Copy link
Contributor

@GeoffreyBooth my view is that parsing markdown is tricky, especially as there are small gaps in the original definition of markdown, and it's never been updated. I think it's best to leave that complexity with a dedicated markdown parser. I don't mind the dependency too much. I think the main reason for avoiding the dependency is to reduce the size of the browser version of coffeescript. I think it's a super edge case for handling literate coffeescript in the browser, but if that's what is required, I think it should be handled by a markdown parser, and preferably one that could be swapped out with another in the future, or even better, one the user chooses. If I am writing literate coffee, I want to use my default markdown style (perhaps like stack overflow, or github) and use my markdown tools (which already have their own definition of code blocks etc...) and for coffee to just work.

I think marked is a good library, but looks like markdown-it has evolved into a better project, and I think that if there are issues with it's code block handling, it should be fixed in that project, and imported into coffeescript.

Copy link
Collaborator

@lydell lydell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any tests with nested lists, or list items containing several paragraphs and code blocks?

1. Some text

       code block

    more list item text

    1. Sub list

          code block

      code block?

(?:\t?|\ {0,3}) # Up to one tab, or up to three spaces, or neither;
(?:
[\*\-\+] | # followed by `*`, `-` or `+`;
[0-9]{1,9}\. # or by an integer up to 9 digits long, followed by a period;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious where you got this "up to 9 digits long" rule from? :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)
[\ \t] # followed by a space or a tab.
///
listItemFirstCharacter = new RegExp "#{listItemStart.source}+(\\S)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be written as ///#{listItemStart.source}+(\S)///

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why is listItemStart matched one or more times (+)?

insideList = yes
# Get the first non-whitespace character after the bullet or period, to
# determine our new base indentation level.
indentation = line.indexOf line.match(listItemFirstCharacter)[1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this crash if the listItemFirstCharacter doesn't match? I'm thinking that a line only containing a "listItemStart" but nothing after it could cause such a crash?

- 

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are looking for something more like:

match = listItemStart.exec line
if match
  indentation = line[match[0].length..].search /\S/
  # TODO: `.search` returns `-1` if nothing could be found

# Are we in a list, and indented relative to the indentation of the list?
# If so, treat as code.
if indented.test(line.substring(indentation))
out.push line.substring(indentation)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In CoffeeScript I'd say it's more common to use slicing syntax. line[indentation..]. Saves a trip to MDN.

@akfish
Copy link
Contributor

akfish commented Apr 16, 2017

@GeoffreyBooth We should notice that CommonMark is being widely adapted by the community and could become the de facto standard for Markdown one day. GitHub based its formal spec for GFM on it. It would be a good idea to base literate CoffeeScript upon it or use a compatible parser. Or at least document the difference if literate CoffeeScript choose not to implement some corner cases. I can help with adapting relevant CommonMark spec to test cases.

At the end of CommonMark Spec describes a reference parsing strategy. Maybe that would help.

Also maybe we could introduce a 'preprocess' stage in the compiler workflow and thus make it extendable. Then separate literate CoffeeScript to another package, like babel compiler did with its varies language features. This way we could keep the core compiler clean while being able to use external dependencies as we migrate to an in-house implementation.

@akfish
Copy link
Contributor

akfish commented Apr 16, 2017

I forked the CommonMark spec (https://github.com/akfish/CommonMark) and completed the first pass of adapting it to test literate CoffeeScript.

See the spec file or its diff for details. The format is an extended version of the original one that CommonMark uses.

The test plan is to automatically extract test cases from the spec file, compile the source of each test case with coffee and compare the result.

I noticed that Coffee V1 does not support fence code block. But I think it's more versatile than indented code blocks and is widely used. So I modified that part of the spec as well but noted it as a proposal.

As you can see the spec is very extensive and it could be a huge endeavor to fully support the spec with a custom parser.

@GeoffreyBooth
Copy link
Collaborator Author

So I’ve been thinking about this, in particular cases like this:

    if false
      a()

 1. Some list item
        b()

Without going to see how our various Literate CoffeeScript implementations would parse this, what would you expect to happen? Would b() be executed, or would it fall inside if false?

I think for Literate CoffeeScript (or any literate variant of a significant-whitespace language) to make logical sense, the indentation level needs to be consistent throughout all the code blocks. In this example, a() starts at column 6; so if there’s a b() in the next code block, it should be at column 6 to behave the same as a(), i.e. to be inside the same if block. To shift around the starting point for where we begin tracking indentation from, which is really what putting a code block inside a list item does, is a recipe for confusion.

And this gets to a larger point: Not all Markdown code blocks should be interpreted as executable code blocks. Blockquoted code blocks, in particular, make no sense to treat as executable code. Markdown is meant for displaying text, a more convenient HTML; it doesn’t care about the relative indentation levels of successive code blocks, because in Markdown’s final output those code blocks are meant to be read as text, not executed as code.

So if we accept the premise that not all Markdown code blocks should be treated as executable code blocks, that frees us from having to fully follow the Markdown spec. We need not catch all edge cases of where Markdown allows placement of code blocks. We can make our own definition of what an executable code block is, and everything outside of that is just Markdown, however the Markdown processor decides to parse it.

Here’s one possible definition of a literate executable code block:

  • it follows a blank line (or starts the file)
  • is indented by at least one tab or one space
  • doesn’t match the regex for a Markdown list item, i.e. * A or - A or + A or 1. A etc. with or without leading spaces

This is pretty much the 1.x definition, improved with the test for list item first line. Besides preventing code blocks in list items, it would also mean that list items couldn’t span multiple paragraphs. But I think this is better than trying to adjust the indentation level of code blocks inside list items. @jashkenas?

@billymoon
Copy link
Contributor

@GeoffreyBooth when I was thinking about illiterate, I wanted to be able to make a markdown document that I could paste into github, and stack overflow, and the parts that are displayed as code in those interfaces would make up the executable script so that describing a feature (github documentation) or describing a bug (stack overflow) or just writing code become all the same thing. I see great value in interoperability in that regard. I think that if you paste it into github, and it looks like code, it should be executed is a pretty good starting point to defined what code blocks are.

@akfish
Copy link
Contributor

akfish commented Apr 16, 2017

@GeoffreyBooth I was having having some similar reservations too when I saw cases of nested code blocks in the spec. There are many possible ways to write code with uneven indentations.

Unless it's intentional, no sane developer would write unreadable code like that. The intention could likely be self-torture, that we wouldn't really care. Or he/she/it is trying to construct a piece of potentially harmful code that looks innocent so that unsuspecting victims would execute it willingly.

So indeed we should come up with some conventions to restrict uneven indentations. We could:

  1. Use source indentation - force developers to write evenly indented code throughout an script at the cost of breaking some specs
  2. Use rendered indentation - parse the code as the spec suggests and use the generated indentation. The code will run the same as how it looks after rendered (like when posted on GitHub). But if the user doesn't have the habit of render and review the code before running it, he is still vunlerable. And the problem with render is, the way it looks can be manipulated with CSS.
  3. Do both but add a 'strict' option. Strict on == convention 1. Strict off == convention 2.

On second thought, conforming to the spec doesn't mean we have to accept all possible input the spec allows. It's part of compiler's job to reject ill-formed code. We only have to not to encourage developers to write Markdown that breaks the spec. And literate CoffeeScript clearly has smaller scope than that of CommonMark (computer code generation vs human-readable document generation). It makes sense to only support a subset of the spec.

@GeoffreyBooth
Copy link
Collaborator Author

@billymoon I think “if you paste it into github, and it looks like code, it should be executed” is a good rule to apply to a whitespace-insignificant language like JavaScript, where indentation doesn’t matter. But when indentation does matter, and code blocks aren’t separate modules but are just separate sections of one cohesive file, there needs to be a way to coordinate how separate code blocks relate to each other. I think in the case of CoffeeScript, the simplest solution is to just say that the only code blocks that are executed are top-level ones: they may be indented from the leftmost column, but they’re not further indented inside a list item or blockquote.

@akfish I think about it this way: Literate CoffeeScript (or literate anything) isn’t “breaking” the Markdown spec. It’s a superset of the Markdown spec, that says that certain code blocks are treated as executable code. That’s it. Other code blocks—in list items, in blockquotes, in triple-backtick fences—are left to be parsed by Markdown as text, like they would be in a normal Markdown file.

For the purposes of Literate CoffeeScript via the CoffeeScript compiler, I think we need to go with the simple, zero-configuration option. Perhaps Illiterate can offer configuration options where the user can decide whether the trickier code blocks should also be executable, and how to handle indentation between various code blocks. @billymoon I leave that challenge to you 😄 For people who want that level of control, they can use Illiterate and pipe its output through the CoffeeScript compiler.

@GeoffreyBooth
Copy link
Collaborator Author

@jashkenas do you still have the blog post you linked to from the CoffeeScript docs? I think it would be useful; perhaps we can add it to the wiki. How do you feel about the various suggestions on this thread?

@carlsmith
Copy link
Contributor

carlsmith commented Apr 17, 2017

Without a proper Markdown parser, how will indented code between HTML tags be handled? For example:

# Title

Some paragraph.

<div>

    <p>Will this be treated as executable CoffeeScript?</p>

</div>

If I remember correctly, this was broken before, but it was a long time ago, so I may be wrong, or it may have been fixed since. Note: It's the Markdown parser (or its configuration) that would have been to blame for this.

Anything between HTML tags should be ignored by the CoffeeScript compiler, and a compliant Markdown parser should pass it along verbatim. Otherwise, Literate CoffeeScript would have to disallow indentation (or empty lines) inside blocks of HTML.

EDIT: Just reread the original post, and saw this:

As has been the case in all the v2 versions of Literate CoffeeScript, simply wrapping a code block in HTML tags is insufficient to get it to be treated as a comment. The code block must also have at least one line unindented. -- @GeoffreyBooth

So, are we just accepting that this will not work?

@carlsmith
Copy link
Contributor

carlsmith commented Apr 17, 2017

if you paste it into github, and it looks like code, it should be executed -- @billymoon

There needs to be a way to distinguish between example code, and executable code. It makes no sense to allow users to document their code in Markdown, but not allow them to include example code there. Imagine documenting an API; you may want to include an example showing how to use that API. Example code showing how to use something from the command line is another common case where it looks like code on GitHub, but shouldn't be executed.

As we'll always want indented CoffeeScript to execute, it seems the only option is to never execute fenced code blocks.

@carlsmith
Copy link
Contributor

@GeoffreyBooth - It doesn't solve every issue, but for what it's worth, I don't personally think that requiring a blank line between Markdown and executable CoffeeScript is a problem at all.

@GeoffreyBooth
Copy link
Collaborator Author

@carlsmith HTML tags are no different than any other comment. If you want to make a “comment code block,” as opposed to an executable code block, start its first line at column 0. I think when we first started playing with Marked for CS2 it was thought that it might process HTML tags and therefore use them to decide when to not make a code block executable, and a test was written to that effect, but on closer inspection it worked as desired because of the column 0 thing and not because of the HTML tags. Your example above fails in both CS1 and CS2. See https://github.com/jashkenas/coffeescript/blob/2/test/literate.litcoffee.

@carlsmith
Copy link
Contributor

carlsmith commented Apr 17, 2017

Sorry dude. I'm completely confused.

If you want to make a “comment code block,” as opposed to an executable code block, start its first line at column 0

If a code example starts onside, it wouldn't be executed, but it also wouldn't be example code. It'll be parsed as a paragraph or something.

# A Markdown File

This is a regular paragraph.

square = (x) -> x * x # this is a paragraph (not example code) though it starts onside??

HTML tags are no different than any other comment.

HTML is different to other comments, as the CoffeeScript parser should ignore everything inside of HTML tags.

# A Markdown File

<div>

    <p>This paragraph, and the entire surrounding div, should be ignored by CoffeeScript,
    and any Markdown compiler should pass the div into the compiled HTML as is. The
    current proposal would cause this paragraph to be executed as CoffeeScript.</p>

</div>

Sorry if I'm just being dumb, and don't waste your time explaining this to me if I am. It's not actually important for me to get it, so long as you guys know what you're doing.

@carlsmith
Copy link
Contributor

Your example above fails in both CS1 and CS2.

In what way does it fail? I'm assuming the indented <p> element gets executed as CoffeeScript, which is wrong. It should be ignored by CoffeeScript. A standards compliant Markdown parser should pass the entire div through to the compiled HTML exactly as it's written.

@GeoffreyBooth
Copy link
Collaborator Author

In what way does it fail?

The compiler treats it as code and tries to execute it. Anything that’s indented is treated as code, period. The CoffeeScript 1.x compiler doesn’t currently give HTML tags special treatment.

I don’t know what the best solution is. Yes if we integrate Markdown-It or some other Markdown parser, presumably it might see the HTML tags and trigger treating them as a comment code block. But then we have to deal with it identifying code blocks inside list items and blockquotes, which will have skewed indentation (which is why I filtered those out in my version). Just using a Markdown processor is no panacea; they’re designed for formatting code blocks as HTML, not for preserving them for parsing. Marked, in particular, converts all tabs to four spaces, which is fine for display but terrible for parsing (since it screws up block strings). I’m not sure there’s an easy way in Markdown-It to detect whether a code block is within a list item or blockquote or not; also there would be no difference between regular indented code blocks (i.e. just indented four spaces) and fenced code blocks (between triple backticks). How would we deal with indentation between successive code blocks where one is indented four spaces and the other is fenced? Should the “zero column” start four spaces in for the indented-four-spaces block? This is the problem with using a Markdown processor.

Hence I’m leaning toward reverting more or less to the 1.x behavior, with as many minor improvements as we can give it. I thought about creating a new rule where two blank lines are necessary to end a list, which would allow multi-paragraph list items; but I feel like that’s too arcane for people to remember. I understand there are shortcomings to the 1.x approach and this PR, but I feel like all the alternatives we’ve tried so far introduce problems that are worse.

@carlsmith
Copy link
Contributor

I see what you mean now. Sorry for needing it spelt out. I was assuming a good Markdown parser would do everything we need, but you're right, it wouldn't. Short of writing a new parser from scratch, your approach seems to be the best option.

It's an interesting little problem, but I don't want to promise something and not deliver. It's best you do what you're planning for now. We can always improve the parser later. If it's only making more things possible, then it'll still be BC.

@akfish
Copy link
Contributor

akfish commented Apr 17, 2017

@GeoffreyBooth

I’m not sure there’s an easy way in Markdown-It to detect whether a code block is within a list item or blockquote or not; also there would be no difference between regular indented code blocks (i.e. just indented four spaces) and fenced code blocks (between triple backticks).

Actually you can. My current project is somewhat similar to literate coffee. It involves parsing Markdown and feed code blocks to compilers according to the code language (e.g. Babel and coffee). I'm also using the same technique in the upcoming automatic CommonMark test suit for coffee script. Take a look at MarkdownIt.prototype.parse method. It returns a token array.

@akfish
Copy link
Contributor

akfish commented Apr 17, 2017

Doc: https://markdown-it.github.io/markdown-it/#MarkdownIt.parse

Also https://markdown-it.github.io, at right column choose debug to see the token stream.

@jashkenas
Copy link
Owner

I think that @GeoffreyBooth has the right plan for this one figured out. Returning to "as sensible and simple as we can make it" seems like a fine way to go. Parsing the tricky corner cases in Markdown is about complicated corners of HTML -- not just detecting all of the top-level code we want to run.

But it seems regrettable that we went back and forth on this in the first place by introducing a markdown library, breaking some things, trying to upgrade the library, and now returning to an "as good and simple as we can make it" implementation.

Perhaps there is a lesson to be learned there.

@GeoffreyBooth
Copy link
Collaborator Author

@akfish is there a way to use Markdown-It and preserve consistent indentation across different code blocks?

I suspect there isn't, which is why I think we should revert. I think consistent indentation is more important than these markdown edge cases. The current Markdown-It implementation which has inconsistent indentation is a breaking change compared to CS1, and a devilish one to accommodate in any code it affects.

@GeoffreyBooth
Copy link
Collaborator Author

@lydell how does this PR look now?

@akfish
Copy link
Contributor

akfish commented Apr 17, 2017

@GeoffreyBooth For some situations it does.

Tab indented code block seems to be very consistent, like this simple example.

    biubiu (4 spaces)

Returns token:

  {
    "type": "code_block",
    "tag": "code",
    // ...
    "content": "biubiu (4 spaces)\n",
    // ...
  }
        biubiubiu (8 spaces)

Returns token:

  {
    "type": "code_block",
    "tag": "code",
    // ...
    "content": "    biubiubiu (8 spaces)",
    // ...
  }

It's always minus 4 spaces for each line of the block content, as the spec requires.

When nested with quote block, it's minus (4 spaces + 1 space) for the each line, for blockquote requires >[space] at the beginning.

>     4 + 1 spaces
>      5 + 1 biubiu
[
  {
    "type": "blockquote_open",
    "tag": "blockquote",
    //...
  },
  {
    "type": "code_block",
    "tag": "code",
    "attrs": null,
    "map": [
      0,
      2
    ],
    "nesting": 0,
    "level": 1,
    "children": null,
    "content": "4 + 1 spaces\n 5 + 1 biubiu\n",
    "markup": "",
    "info": "",
    "meta": null,
    "block": true,
    "hidden": false
  },
  {
    "type": "blockquote_close",
    "tag": "blockquote",
    //...
  }
]

I haven't played with list yet. But I think the situation should be similar, since the spec has strict rules on indentation and spaces.

It's when the code block's parent has leading spaces, things get messier. Those leading spaces are simply lost. But there's a workaround for that. The token.map is an array of start and end line number of each token (unfortunately no exact line-column information). It is easy to count the leading spaces by ourselves.

@GeoffreyBooth
Copy link
Collaborator Author

@akfish What I mean by consistent is that the code blocks’ indentation is consistent relative to each other and to the source. So if there’s a 4-space indented code block, the first “code” character is at column 4 (assuming the first column is column 0). If there’s a fenced code block, the first code character is at column 0.

Say a Literate CoffeeScript file looks like this:

Some commentary

    if false
      a() # Never runs

More commentary

# imagine triple backticks here (not sure how to type in GitHub)
      b()
# triple backticks

As a naive user, I would expect both a() and b() to fall under if false and neither to be executed. This is how I feel Literate CoffeeScript should behave, if it were to allow fenced code blocks. But Markdown-It will interpret a() as starting at column 2 and b() as starting at column 6, so this would get sent to the compiler like this:

if false
  a()
      b()

Which is clearly not what I intended, and which throws a compiler error. This is what I mean by inconsistent indentation. Things need to line up to the user, in the original source (in my opinion). I don’t see how to achieve that if we allow multiple types of code blocks (fenced, in list items, blockquoted, etc.).

So either we keep the “simple,” no-dependency implementation (this PR) or we update the Markdown-It implementation (what’s currently in 2.0.0-beta1) to somehow handle indentation the way I’m describing. I suppose the latter might be preferable, in that then we might get the benefit of handling HTML tags properly and other niceties; if it’s even possible to track indentation this way in Markdown-It or some other Markdown parser. Personally I think we should merge in this PR now, and if someone can achieve the same using a Markdown parser, that could be a subsequent PR and we can use that PR to debate the merits of what we gain from that approach.

@akfish
Copy link
Contributor

akfish commented Apr 17, 2017

Markdown-it parsed your example correctly I think.

The two code block tokens are:

  {
    "type": "code_block",
    "tag": "code",
    "attrs": null,
    "map": [
      2,
      4
    ],
    "nesting": 0,
    "level": 0,
    "children": null,
    "content": "if false\n  a() # Never runs\n",
    "markup": "",
    "info": "",
    "meta": null,
    "block": true,
    "hidden": false
  },
  //...
  {
    "type": "code_block",
    "tag": "code",
    "attrs": null,
    "map": [
      8,
      9
    ],
    "nesting": 0,
    "level": 0,
    "children": null,
    "content": "  b()\n",
    "markup": "",
    "info": "",
    "meta": null,
    "block": true,
    "hidden": false
  },

Put the two token's content is:

if false
  a() # Never runs
  b()

The top level code blocks should have consistent indentations, as the spec required (source space count - 4). Nested code blocks should have consistent indentations too, but requires some effort to figure it out.

@akfish
Copy link
Contributor

akfish commented Apr 17, 2017

Never mind my last comment. I just saw the imagine triple backticks thing (aka the fence block in spec).

Fence block handles indentation differently by design as the spec requires. That's expected behaivor.

@akfish
Copy link
Contributor

akfish commented Apr 17, 2017

FYI, fence block has different token type. It's easy to handle differently.

  {
    "type": "fence",
    "tag": "code",
    "attrs": null,
    "map": [
      7,
      10
    ],
    "nesting": 0,
    "level": 0,
    "children": null,
    "content": "      b()\n",
    "markup": "```",
    "info": "",
    "meta": null,
    "block": true,
    "hidden": false
  }

I agree that we should drop fence block support, but for different reason. Take a look at Example 100 of the spec:

  ```
aaa
  aaa
aaa
  ```

Yeilds:

<pre><code>aaa
aaa
aaa
</code></pre>

The indentation information is totally lost here.

@akfish
Copy link
Contributor

akfish commented Apr 17, 2017

image

image

Anyway, I completed the automatic test suite as a separate package. I made it flexible enough so that we can specify what spec should be adapted. Just edit the spec.txt you saw fit, or write a new one to add more tests.

@GeoffreyBooth
Copy link
Collaborator Author

@akfish take a look at the tests/literate.litcoffee and tests/literate_tabbed.litcoffee in this PR. As part of this PR I adjusted these tests to have consistent indentation, undoing the adjusted indentation that I had given them when adapting them for Markdown-It. These would be good sample files to use for testing.

Copy link
Contributor

@akfish akfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accroding to the spec of list items, we could never write a nested code block with zero indentation relative to column 0 (a.k.a 4 spaces) in it. It is at least 2 spaces too short.

...and M is a list marker of width W followed by 1 ≤ N ≤ 4 spaces, then the result of prepending M and the following spaces to the first line of Ls, and indenting subsequent lines of Ls by W + N spaces,

Since there's no way to write consistence code blocks within a list item (as well as in a blockquote), we could just ban nested code block (don't execute it). With top level code blocks being consistent, list item and blockquote as the only container blocks that are possible to nest a code block (list can only nests fence block), all our troubles would go away.

I'm no expert on RegExp, but shouldn't it be relatively easy to detect whether an indented block is within a list item?


test "basic literate CoffeeScript parsing", ->
ok yes
test "basic literate CoffeeScript parsing", ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This renders as:


  • A list item followed by a code block:

    test "basic literate CoffeeScript parsing", ->
    ok yes
    testsCount++


It should not be parsed as a code block.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be an executable code block, because it’s indented and it follows a blank line. That’s the rule I came up with 😉 That’s also why list items cannot be longer than one paragraph, because there’s no way to distinguish the second paragraph of a list item from a code block.

Copy link
Contributor

@akfish akfish Apr 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that breaks the spec. Wouldn't it be easier to just ban nested code block? Since there's no way to write one that doesn't break the spec, while keeping the indentation consistent.

Edit: grammar

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s not nested. Imagine there was a --- or another regular unindented paragraph between the list item line and the code block.

Copy link
Contributor

@akfish akfish Apr 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Custom rules are fine. As long as they are rendered correctly by most popular tools.

Imagine the security risk behind this. A user see something on GitHub like this:


fetch_me_cute_cats()
  • Don't worry

    because_this_is_not_code_and_will_totally_not_destroy_your_computer_or_anything()


But literate Coffee sees it like this:


fetch_me_cute_cats()
  • Don't worry

    because_this_is_not_code_and_will_totally_not_destroy_your_computer_or_anything()
    

And the user happily executes it because why wouldn't he trust GitHub's render?
The next day the user buys a new computer just so that he could submit an issue filled with anger here.

A little too dramatic maybe, but you get the idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you’re relying on Markdown syntax to protect you from not executing code, you deserve what you get.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The technical term is "smaller attack surface". And reducing attack surface is called being responsible, especially for a high profile project like this. With tons of noob developers and mis-configured servers running node process with root-level access out there, you can never be too careful.

This's only my opinion and it's up to you guys to decide. I won't push for it further. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, but if someone is executing a .litcoffee file, well, they’re executing code. They shouldn’t be surprised if code blocks execute, even if they mistakenly presume a particular code block is a commentary code block. People who have code-execution rights need to have some responsibility for what they’re doing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that whoever would fall for this most probably deserves what he gets one way or another. But we human invented technologies and machines to reduce human-error. Why give people a footgun if one can help it?

# Conflicts:
#	docs/v2/index.html
#	documentation/sections/literate.md
@GeoffreyBooth
Copy link
Collaborator Author

I’d like to get to a resolution here. Leaving aside the merits of the various alternatives, is this PR okay to merge in? @lydell?

@akfish or @billymoon, I think the way forward for an alternative that uses a Markdown parsing library would be to start a new PR (ideally) or issue where we can discuss that approach, and how it would deal with the issues discussed on this thread; and what is to be gained by that approach over this one. If we can find a solution that can handle consistent indentation and doesn’t confuse matters between executable and commentary code blocks, and supports Markdown features better than this PR does, then such a solution would be worth switching to. In the meantime I think this PR is better than what’s in 2.0.0-beta1.

@lydell
Copy link
Collaborator

lydell commented Apr 18, 2017

I don't have any further notes on this PR.

@billymoon
Copy link
Contributor

@GeoffreyBooth do the tests in this PR cover all scenarios discussed... as in, if I were to have a go at using an external parser, can I consider the job done if the tests pass?

@jashkenas
Copy link
Owner

Lovely progress. Bravo, @GeoffreyBooth.

@akfish
Copy link
Contributor

akfish commented Apr 18, 2017

Fine with me. It would be better to discus issues mentioned above in separate PR/issues.
Two thoughts:

  • We should at least keep a branch with external parser implementation as a reference. I could send a PR in the future if you like, since I implemented something similar to literate Coffee with Markdown-it in my project.
  • This just came to my mind. With current implementation, it seems that literate Coffee has no valid source map support, since no mapping information is preserved.

@GeoffreyBooth
Copy link
Collaborator Author

@billymoon Yes, but only if you aren’t expanding the spec of what code blocks we support. This PR is explicit about limiting the definition of code block to be anything indented after a blank line. I’m assuming you would want to add support for fenced code blocks and possibly code blocks in list items, etc. If you do that, I would consider your version a “pass” if those new code block types are parsed with what I’ve been calling consistent indentation (see my example above) and you add tests for them. And, of course, the current tests still pass, or you can make a case for why they shouldn’t (i.e. why we should make a breaking change here). Originally we were going to introduce a breaking change with Literate CoffeeScript regarding indentation so that these other code block types could be supported, but I came to the conclusion that that was too confusing and not worth the bugs that would be caused.

If you don’t expand the types of code blocks supported, then you need to filter out these other code block types so that they’re treated as comments; and you need to make a case for how your version improves upon this PR’s implementation. Maybe that improvement is simply that HTML tags are supported, or that list item paragraphs are supported; I’m sure there probably are a few arguments to be made. Then we’ll need to debate the tradeoffs, if there are any.

@akfish The current implementation, and the implementation in 2.0.0-beta1, do have source map support. Both versions preserve the same number of lines as the original when stripping out the comments before sending things into the compiler, and the compiler should generate source maps as usual. If you want to make a new branch for a Markdown parser-based implementation, you might want to branch off of 2.0.0-beta1. The litcoffee-bug-illiterate branch on my repo is another possible place to start (you should fork that into your repo soon, in case I delete it at some point).

@akfish
Copy link
Contributor

akfish commented Apr 18, 2017

@GeoffreyBooth I see. Thanks for the tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants