parse5 and streaming #26

jeromew · 2014-10-08T15:44:27Z

Is there a way to pipe a file into parse5 (this feature is available in htmlparser2).

similarly, is there a way to pause / resume the sax parser ? I believe that the getNextToken method could be used to decide when the parsing should pause/resume.

I have been using the html-tokenize streaming parser lately and its suite (html-select, trumpet) but reaching html5 conformity on html-tokenize is still a long way to go so I am trying to see how parse5 could fit in and be used with html-select

The text was updated successfully, but these errors were encountered:

jeromew · 2014-10-08T15:51:33Z

the Tokenizer and Preprocessor take the html content as one big chunk. Have you considered the possibility of working on partial chunks that would be written one after the other ?

parser.write('<div>')
parser.write('hello')
parser.write('</div>')
parser.end();

inikulin · 2014-10-08T17:01:27Z

Hi,
currently parse5 doesn't support streaming. However we've discussed possibility of it's implementation with jsdom maintainers in context of this thread.

I would like to know which goal you would like to achieve using streaming? If it's primary focus is performance then I would like to warn you that I find it a little bit questionable. For streaming we need to teach tokenizer to invalidate non-emitted tokens if end of chunk was encountered. This behavior requires introduction of the tokenizer state snapshots mechanism: if we have invalidated token we need to rollback tokenizer to the last valid state and retreat preprocessor to the point there last valid token was emitted. Since tokenizer and preprocessor are the most performance-sensitive parts of parse5 this may end up with significant performance degradation, so you can lose more than you win.

Long story short. I definetly would like to see streaming API in parse5 too. But it requires quite complex research and I'm afraid I will not be able to get my hands on this soon, since we need to land more important features (like <template> support) first and I'm already suffering from the lack of the spare time. Therefore, any PR on this task will be highly appreciated.

jeromew · 2014-10-08T17:13:13Z

Ok I see thanks for your feedback & for the link to the other thread. I agree that there is a big question mark on the result of all this performance wise.

I'll start my research with understanding the caveats of the html5 spec and try to understand how the approach taken by html-tokenize is compatible or not with all this.

stevenvachon · 2014-12-29T21:55:59Z

~~Streaming would be useful in parsing HTML that contains template logic such as handlebars/etc.~~

inikulin · 2015-01-03T17:13:32Z

@stevenvachon Sounds interesting. Can you describe this scenario in details, please? How streaming can help here?

stevenvachon · 2015-01-05T21:40:06Z

@inikulin Having worked on my idea further, I realize that I was mistaken. ~~Instead, #33 makes more sense.~~

stevenvachon · 2015-01-10T18:06:35Z

The biggest benefit to streaming is in memory usage and garbage collection, not parse performance. Not having to parse an entire html file in a list of thousands has great benefits.

aredridel · 2015-01-16T16:13:22Z

Also having a DOM to start working on before all the data has arrived.

stevenvachon · 2015-06-11T15:41:09Z

Any progress on this at all?

angelozerr · 2015-06-12T07:19:49Z

In my case I'm using https://github.com/isaacs/sax-js and it work's great. The feature with sax-js that I needed where:

speed
parse an HTML even if it's not valid (a XML element which is not closed for instance)
retrieve location of each elements, attributes
works with web browser (without using browserify)

And sax-js provides me those features (for speed, I have not benchmark).

I tell me if it's a good idea for parse5 to spend time to implement a new sax parser although sax-js works great.

But perhaps I have missed something?

inikulin · 2015-06-15T10:25:10Z

@angelozerr sax-js is not the HTML parser.
@stevenvachon I will be able to get my hands on it not earlier than end of July/August. Any PR is still welcome. If someone would like to start implementing it, then you can count on my assistance.

angelozerr · 2015-06-15T10:29:23Z

@angelozerr sax-js is not the HTML parser.

You could implement sax-js callback to create DOM node and having an HTML Document, no?

inikulin · 2015-06-15T11:02:55Z

@angelozerr No, HTML parser requires html preprocessing/tokenization/parsing algorithms.

inikulin · 2015-06-18T20:27:43Z

Ok, good news, everyone. I finally figured out how it should be done and parse5 will receive streaming support in the near future. One thing that I would like to set for the discussion agenda is the API. Should ParserStream and SerializerStream extend node's WritableStream and ReadableStream respectively, or it should just support stream-like API like htmlparser2? How you will obtain resulting AST, should ParserStream.end() return it? Or we should give user access to the unfinished AST via the property? The unsophisticated modification of the unfinished AST can brake parser (since we don't have DOM which guarantees that AST modifications will not lead to the malformed tree). This bothers me a little bit. Any ideas or suggestions?

aredridel · 2015-06-18T21:38:10Z

Actually extending streams is awesome. Then you can pipe into it and it Just Works. stream-like but not actually streams give you such fun edge cases.

aredridel · 2015-06-18T21:38:32Z

unfinished AST is definitely useful, but could be given with a stern warning.

stevenvachon · 2015-06-18T21:42:36Z

~~Unfinished AST is ideal for me, but what kind of malformed tree could we end up with? Why not delay certain elements until more information is available? An example stream could be:~~

<p>asdfasdfasdf

followed by

asdfasasdf<p>asdfasdfasdf</p>

~~The parser wouldn't yet know if the first <p> is a void element or not. Instead of emitting it, the parser could hang onto it until it has access to future closing tags (or some kind of end()).~~

Actually... that creates a problem, doesn't it, because <html> wouldn't be emitted until the whole thing is done.

jeromew · 2015-06-18T21:47:33Z

+1 for extending streams instead of home-cooked streams

I don't know if it is possible but having a streaming solution inspired on html-tokenize that would push data as soon as is it guaranteed to be "correctly" understood would be nice.

I already tried to do such a thing but was stopped in my progress. The code is just horrible and I was blocked because it needs a rewrite of html-tokenize to be more html5 conformant - https://github.com/jeromew/html-nest

there definitely is a problem with the html attributes because the spec says that additional html elements should push their attributes on the first one...this is not stream friendly at all ! Other than that I felt that in most scenario, we should not have to withhold tokens for too long in memory.

this is what I found during this experiment

There will always be an html element in the tree. The specification states that if an html opening tags is found in the 'in body' insertion mode, its attributes should extend the attributes of the first html element. Doing this would basically buffer the whole document in memory. A way to mitigate this could be to send provisional html tags
The table 'foster parenting' algorithm states that if we find elements inside a table that have nothing to do in the table, they should be reparented just before the table. In order to do this, we have to buffer the tables
The misnested tags are rectified by the 'adjacency adoption algorithm'. This algorithm tracks some formating elements (b, i, ..) and re-organizes locally the elements when a misnesting is detected. Sometimes the re-organization is triggered after tokens have already been processed. In order to follow this algorithm, we have to buffer tokens during formatting sections.

stevenvachon · 2015-06-18T21:50:00Z

Regarding @aredridel's mention of a "stern warning" -- also include technical reasons as to why there is a warning. It would do two things: 1, thwart incorrect use; 2, enlighten us on what issues to avoid.

sideshowbarker · 2015-07-14T02:19:23Z

I seriously question whether it’s actually possible in practice to implement a fully spec-conformant streaming parser without it needing to also buffer the entire document. It seems that no matter what partial-buffering strategy you try, you can often end up needing to buffer such a large part of the document anyway that you don’t gain much versus just buffering the whole document to begin with.

The table 'foster parenting' algorithm states that if we find elements inside a table that have nothing to do in the table, they should be reparented just before the table. In order to do this, we have to buffer the tables.

Yeah, you always must buffer all tables entirely. And if consider how many documents out there are largely made of tables (e.g., pages that use tables for layout), in a lot of cases you’re going to need to end up buffering the majority of the document anyway.

The specification states that if an html opening tags is found in the 'in body' insertion mode, its attributes should extend the attributes of the first html element. Doing this would basically buffer the whole document in memory.

Yeah. I don’t understand how it’s possible to get around that without requiring all the consumers of your streaming parser to implement some kind of special handling for html elements.

A way to mitigate this could be to send provisional html tags

I don’t think I understand clearly what you mean by that.

sideshowbarker · 2015-07-14T02:24:01Z

Pinging @gsnedders (one of the html5lib devs) who might have some additional thoughts on this (though I doubt he can make time right now to respond here, and anyway my comments in #26 (comment) reflect a discussion I just now had with him on #whatwg IRC ).

inikulin · 2015-07-14T08:09:17Z

@sideshowbarker

It's possible. Because:

In order to do this, we have to buffer the tables.

We shouldn't. We just walk up the stack of open elements and search for the table's parent:
https://github.com/inikulin/parse5/blob/master/lib/tree_construction/parser.js#L823

Doing this would basically buffer the whole document in memory.

Nope. Since we are in body we already have html element at the bottom of the stack of open elements.

Parser don't use token lookahead. So, the the bulk of the changes goes to tokenizer. We will make snapshots of the tokenizer state after each token emission. Then if we meet end-of-chunk, we invalidate last token, rollback to the last snapshot and suspend the parser. The next call to write() resumes the parser. This is it.

sideshowbarker · 2015-07-14T08:24:04Z

@inikulin thanks—looking back at the comments in this issue I see now I misunderstood what it’s about…

inikulin · 2015-07-14T12:41:56Z

@sideshowbarker Ah, I see there this discussion started: servo/html5ever#149

My conclusion is that you can't have full spec compilant parsing without buffering already produced DOM-tree. In our case it's not an issue. parse5 has SAX-style parser, but it behaves more like tokenizer with the simulated parser feedback (CDATA parsing flag, switch text parsing modes).

sideshowbarker · 2015-07-14T13:07:23Z

@sideshowbarker Ah, I see there this discussion started: servo/html5ever#149

Yup

My conclusion is that you can't have full spec compliant parsing without buffering already produced DOM-tree.

Yeah, if you're going to conform to, e.g., the adoption-agency requirements and foster-parenting requirements and the special case of the html start tag, I guess you pretty much can't do that without building some kind of full tree. It may not actually be a full "DOM" as such, but I guess it's going to end up being effectively the same regardless of what you call it.

E.g., the code in https://github.com/validator/htmlparser/tree/master/src/nu/validator/saxtree provides an event-based SAX API by building something that code actually calls a "SAX tree"...

inikulin · 2015-07-14T13:12:26Z

But I still can't figure out use cases for such approach. Using SAX parser most likely you will not need information about position of the element in the DOM-tree (It becomes even more absurd if you don't have DOM-tree).

sideshowbarker · 2015-07-14T13:24:06Z

But I still can't figure out use cases for such approach. Using SAX parser most likely you will not need information about position of the element in the DOM-tree (It becomes even more absurd if you don't have DOM-tree).

Yeah, I’ve never used that API. Instead I have used the fully-streaming SAX API that code also provides. That streaming API builds no tree and produces a fatal error for any markup cases that would require the adoption-agency algorithm or foster-parenting algorithm. That streaming API is actually what validator.nu and the W3C Nu Html Checker use. Those also report all the parse errors to the end user, so doing the streaming-but-stop-with-error-message-for-fatal-parse-errors thing makes sense in that context.

inikulin · 2015-07-31T19:16:30Z

Ok, here is the fundamental question: should we abandon non-streaming API and release 2.0 or keep non-streaming API as well? If we will keep non-streaming at will be messed up in my taste:
Parser
SAXParser
Serializer
ParserStream
SAXParserStream
SerializerStream

I would like to expose Tokenizer, TokenSerializer and co-called ForgivenParser (like htmlparser2 but with the proper spec-compatible tokenization, integration points, etc.). With streaming alternatives for each of them it will be a real mess.
Please, let me know what do you think.

stevenvachon · 2015-07-31T19:23:28Z

I don't see why we would need non-streaming anymore. htmlparser2 is stream-only with pseudo-non-streaming via a single write() call.

Totally awesome on the ForgivenParser! Though, perhaps rename it to ForgivingParser?

inikulin · 2015-07-31T19:24:50Z

@stevenvachon Yep, it's should be ForgivingParser, just my typo

stevenvachon · 2015-07-31T19:25:26Z

Totally looking forward to this. Will <{{tag}}> and <tag {{attr}}> be possible?

inikulin · 2015-07-31T19:29:50Z

@stevenvachon nope, we will still use current tokenizer and current HTML5 lexical grammar doesn't allow such constructs.

domenic · 2015-07-31T19:34:11Z

@inikulin I'll defer to @Sebmaster but I think we just want JsDomParser to stay around, i.e. for streaming to primarily be via document.write calls.

inikulin · 2015-07-31T19:38:34Z

@domenic Ok, I'll keep it. However, handling both document.write and streaming will be an interesting task. So, i think there will be a huge rewrite of the curent JSDOMParser code anyway.

stevenvachon · 2015-07-31T20:38:00Z

@inikulin what is forgiving like htmlparser2, then?

inikulin · 2015-08-04T15:03:13Z

@inikulin what is forgiving like htmlparser2, then?

Huh, seems like I'm starting to remember why I initially was against 'forgiving' parsing in parse5. Because there is no exact definition of the 'forgiving parsing'. Therefore people will always be somehow unhappy with this thing.
Conclusion: cancelled 🎉

stevenvachon · 2015-08-04T15:10:47Z

lol, ok 😢

Remove unnecessary buffer flush in amb amp state

stevenvachon mentioned this issue Apr 4, 2015

well-known self-closing tags not parsed as such #39

Closed

inikulin mentioned this issue Jun 1, 2015

Brainstorming: what would a modernized jsdom API look like? jsdom/jsdom#1139

Closed

sideshowbarker mentioned this issue Jul 14, 2015

Add a method in TreeSink trait which will be called when an element ends servo/html5ever#149

Open

RReverser mentioned this issue Aug 4, 2015

stop parse5.SimpleApiParser #47

Closed

inikulin added this to the 2.0 milestone Aug 5, 2015

inikulin added enhancement proposal breaking hard labels Aug 11, 2015

inikulin closed this as completed in b71df26 Aug 25, 2015

inikulin added a commit that referenced this issue Apr 16, 2018

Merge pull request #26 from HTMLParseErrorWG/fix-amb-amp

4c1fd18

Remove unnecessary buffer flush in amb amp state

43081j pushed a commit to 43081j/parse5 that referenced this issue Dec 28, 2021

refactor: Use unions to describe DOM (inikulin#26)

21e5fa6

parse5 and streaming #26

parse5 and streaming #26

Comments

jeromew commented Oct 8, 2014

jeromew commented Oct 8, 2014

inikulin commented Oct 8, 2014

jeromew commented Oct 8, 2014

stevenvachon commented Dec 29, 2014

inikulin commented Jan 3, 2015

stevenvachon commented Jan 5, 2015

stevenvachon commented Jan 10, 2015

aredridel commented Jan 16, 2015

stevenvachon commented Jun 11, 2015

angelozerr commented Jun 12, 2015

inikulin commented Jun 15, 2015

angelozerr commented Jun 15, 2015

inikulin commented Jun 15, 2015

inikulin commented Jun 18, 2015

aredridel commented Jun 18, 2015

aredridel commented Jun 18, 2015

stevenvachon commented Jun 18, 2015

jeromew commented Jun 18, 2015

stevenvachon commented Jun 18, 2015

sideshowbarker commented Jul 14, 2015

sideshowbarker commented Jul 14, 2015

inikulin commented Jul 14, 2015

sideshowbarker commented Jul 14, 2015

inikulin commented Jul 14, 2015

sideshowbarker commented Jul 14, 2015

inikulin commented Jul 14, 2015

sideshowbarker commented Jul 14, 2015

inikulin commented Jul 31, 2015

stevenvachon commented Jul 31, 2015

inikulin commented Jul 31, 2015

stevenvachon commented Jul 31, 2015

inikulin commented Jul 31, 2015

domenic commented Jul 31, 2015

inikulin commented Jul 31, 2015

stevenvachon commented Jul 31, 2015

inikulin commented Aug 4, 2015

stevenvachon commented Aug 4, 2015