Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More forgiving tag names #33

Closed
stevenvachon opened this issue Jan 5, 2015 · 20 comments
Closed

More forgiving tag names #33

stevenvachon opened this issue Jan 5, 2015 · 20 comments

Comments

@stevenvachon
Copy link
Contributor

<{{tag}}>asdf</{{tag}}>

is currently parsed as text.

I realize that this is not standard HTML5, but it'd be nice to benefit from many of this lib's features when parsing HTML variants such as Handlebars templates.

@inikulin
Copy link
Owner

I'm not sure if this can be implemented because with non-standard tag names you will get completely different grammar.

@stevenvachon
Copy link
Contributor Author

I'm not sure what you mean by "different grammar".

How about adding some flag to allow it, or perhaps going as far as making the parser customizable like skunks?

@inikulin
Copy link
Owner

I mean formal grammar.

@stevenvachon
Copy link
Contributor Author

Ok, but what if that grammar were customizable? This brings us back to skunks.

@inikulin
Copy link
Owner

Then you will end up with the parser generator. Looks like Skunks is a co-called 'forgiving' tokenizer, while parse5 was designed to be precise spec-compatible parser.

@stevenvachon
Copy link
Contributor Author

Would it be silly to make parse5 both forgiving and unforgiving?

@inikulin
Copy link
Owner

Spec-compatible parser is a parser that can parse HTML, "forgiving" parser is a parser that can "somehow parse some subset of the HTML". Having both in the one package doesn't makes sense for me.

@stevenvachon
Copy link
Contributor Author

Ok, thank you.

My reasoning would be for parsing spec-compatible HTML (nested comments, checked="checked", etc) but with non-spec additions. htmlparser2 is a good "forgiving" parser, but it has trouble in some areas. Handlebars is a good example of the need for both forgiving and spec-compatible, without having to venture into a completely custom parser.

@nylen
Copy link

nylen commented Apr 11, 2017

Hi - I'm also quite interested in this, in this case for a new iteration of the WordPress editor that we're working on. We plan to use HTML comments as a "pseudo-block-tag" to store post content, and these "pseudo-tags" will contain HTML content inside of them. Here's an example.

I'm curious about your thoughts on how to parse a structure like this - it's not too different from Handlebar templates, but I agree that the robustness of parse5 would be a big benefit. @stevenvachon what did you end up doing here?

@stevenvachon
Copy link
Contributor Author

@nylen I'd switched projects and haven't yet gotten around to solving this one.

@inikulin
Copy link
Owner

@nylen We don't have any plans to support grammars except HTML. I recommend to stick with another project or create a fork of parse5 which will support your custom grammar.

@inikulin
Copy link
Owner

@nylen

We plan to use HTML comments as a "pseudo-block-tag" to store post content, and these "pseudo-tags" will contain HTML content inside of them.

Why not use custom HTML elements for this purpose? E.g. <wordpress-post-content>

@nylen
Copy link

nylen commented Apr 11, 2017

Why not use custom HTML elements for this purpose?

We want to preserve the structure of existing markup as much as possible, so that browsers will render it correctly without modification. We also want to avoid adding extra container tags because this will break things like CSS rules that apply to specific sections of post content.

A bit more context about the parsing specifically and what I would like to achieve there: WordPress/gutenberg#391

@inikulin
Copy link
Owner

@nylen It would be nice to have some context before giving any advice. What's the lifecycle of these "pseudo-block-tags": who create them, how they processed, how their content is displayed, is there any sanity check required for content, etc.?

@nylen
Copy link

nylen commented Apr 11, 2017

Probably the easiest way to explain that is to point you to one of our prototypes: https://wordpress.github.io/gutenberg/tinymce-per-block/

There's a lot of needed functionality/UX missing from the prototype, but the basic idea is there: to re-work editing a WordPress post into editing a series of "blocks". These "blocks" will be delimited by HTML comments. You can see how this is serialized by clicking the "Html" button. However, block delimiters have changed since then, to be more robust and look as follows:

<!-- wp:core/text -->
Welcome to WordPress. This is your first post. Edit or delete it, then start writing!
<!-- /wp:core/text -->

If you're interested in reading further, I'd recommend taking a look at the links in the Overview section of our project readme.

@inikulin
Copy link
Owner

There's a lot of needed functionality/UX missing from the prototype, but the basic idea is there: to re-work editing a WordPress post into editing a series of "blocks". These "blocks" will be delimited by HTML comments. You can see how this is serialized by clicking the "Html" button

Seems like I've got it. As I recall there was something similar on tumblr .

@inikulin
Copy link
Owner

Well if those parts are always edited separately then workflow is quite simple: parsed document, get all child nodes between matching comment nodes, serialize them and dump them to editor. On save parse given fragment with parseFragment (this will automatically strip unwanted elements like <head>) perform sanity checks if necessary, then insert those nodes into parsed document and serialize it.

@inikulin
Copy link
Owner

Or, even better:

  • Use SAXParser with location info enabled and get locations of content between two matching comments for required section. (You can stop parsing once you found what you need).

  • Dump found substring to editor

  • On save parse given fragment with parseFragment (this will automatically strip unwanted elements like ) perform sanity checks if necessary

  • Insert new content instead of substring that was obtained earlier

@nylen
Copy link

nylen commented Apr 11, 2017

Rather than just extracting substrings, I think parsing the HTML inside of the block delimiters is part of the task. We expect to have many different types of blocks, including implementations by third-party code via plugins. It seems much better to me to provide a "recommended" way to handle parsing and verify that the markup inside of a block is actually valid for that block type, providing a fallback otherwise.

I had hoped to achieve this in a single parsing step by extending a library like parse5, but that may not be possible. Another reason for this is that there are also other considerations - WordPress post content can contain shortcodes, yet another type of tag which needs another grammar extension. Eventually we'd like to detect these and transparently upgrade them to the equivalent "block" representation.

@inikulin
Copy link
Owner

@nylen Let me know if you'll need any assistance.

43081j pushed a commit to 43081j/parse5 that referenced this issue Dec 28, 2021
Bumps [ts-jest](https://github.com/kulshekhar/ts-jest) from 27.1.1 to 27.1.2.
- [Release notes](https://github.com/kulshekhar/ts-jest/releases)
- [Changelog](https://github.com/kulshekhar/ts-jest/blob/main/CHANGELOG.md)
- [Commits](kulshekhar/ts-jest@v27.1.1...v27.1.2)

---
updated-dependencies:
- dependency-name: ts-jest
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants