-
-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parse5 and streaming #26
Comments
the
|
Hi, I would like to know which goal you would like to achieve using streaming? If it's primary focus is performance then I would like to warn you that I find it a little bit questionable. For streaming we need to teach tokenizer to invalidate non-emitted tokens if end of chunk was encountered. This behavior requires introduction of the tokenizer state snapshots mechanism: if we have invalidated token we need to rollback tokenizer to the last valid state and retreat preprocessor to the point there last valid token was emitted. Since tokenizer and preprocessor are the most performance-sensitive parts of parse5 this may end up with significant performance degradation, so you can lose more than you win. Long story short. I definetly would like to see streaming API in parse5 too. But it requires quite complex research and I'm afraid I will not be able to get my hands on this soon, since we need to land more important features (like |
Ok I see thanks for your feedback & for the link to the other thread. I agree that there is a big question mark on the result of all this performance wise. I'll start my research with understanding the caveats of the html5 spec and try to understand how the approach taken by |
|
@stevenvachon Sounds interesting. Can you describe this scenario in details, please? How streaming can help here? |
The biggest benefit to streaming is in memory usage and garbage collection, not parse performance. Not having to parse an entire html file in a list of thousands has great benefits. |
Also having a DOM to start working on before all the data has arrived. |
Any progress on this at all? |
In my case I'm using https://github.com/isaacs/sax-js and it work's great. The feature with sax-js that I needed where:
And sax-js provides me those features (for speed, I have not benchmark). I tell me if it's a good idea for parse5 to spend time to implement a new sax parser although sax-js works great. But perhaps I have missed something? |
@angelozerr sax-js is not the HTML parser. |
You could implement sax-js callback to create DOM node and having an HTML Document, no? |
@angelozerr No, HTML parser requires html preprocessing/tokenization/parsing algorithms. |
Ok, good news, everyone. I finally figured out how it should be done and parse5 will receive streaming support in the near future. One thing that I would like to set for the discussion agenda is the API. Should ParserStream and SerializerStream extend node's WritableStream and ReadableStream respectively, or it should just support stream-like API like htmlparser2? How you will obtain resulting AST, should ParserStream.end() return it? Or we should give user access to the unfinished AST via the property? The unsophisticated modification of the unfinished AST can brake parser (since we don't have DOM which guarantees that AST modifications will not lead to the malformed tree). This bothers me a little bit. Any ideas or suggestions? |
Actually extending streams is awesome. Then you can pipe into it and it Just Works. stream-like but not actually streams give you such fun edge cases. |
unfinished AST is definitely useful, but could be given with a stern warning. |
<p>asdfasdfasdf followed by asdfasasdf<p>asdfasdfasdf</p>
Actually... that creates a problem, doesn't it, because |
+1 for extending streams instead of home-cooked streams I don't know if it is possible but having a streaming solution inspired on I already tried to do such a thing but was stopped in my progress. The code is just horrible and I was blocked because it needs a rewrite of there definitely is a problem with the this is what I found during this experiment
|
Regarding @aredridel's mention of a "stern warning" -- also include technical reasons as to why there is a warning. It would do two things: 1, thwart incorrect use; 2, enlighten us on what issues to avoid. |
I seriously question whether it’s actually possible in practice to implement a fully spec-conformant streaming parser without it needing to also buffer the entire document. It seems that no matter what partial-buffering strategy you try, you can often end up needing to buffer such a large part of the document anyway that you don’t gain much versus just buffering the whole document to begin with.
Yeah, you always must buffer all tables entirely. And if consider how many documents out there are largely made of tables (e.g., pages that use tables for layout), in a lot of cases you’re going to need to end up buffering the majority of the document anyway.
Yeah. I don’t understand how it’s possible to get around that without requiring all the consumers of your streaming parser to implement some kind of special handling for
I don’t think I understand clearly what you mean by that. |
Pinging @gsnedders (one of the html5lib devs) who might have some additional thoughts on this (though I doubt he can make time right now to respond here, and anyway my comments in #26 (comment) reflect a discussion I just now had with him on #whatwg IRC ). |
It's possible. Because:
We shouldn't. We just walk up the stack of open elements and search for the
Nope. Since we are Parser don't use token lookahead. So, the the bulk of the changes goes to tokenizer. We will make snapshots of the tokenizer state after each token emission. Then if we meet end-of-chunk, we invalidate last token, rollback to the last snapshot and suspend the parser. The next call to |
@inikulin thanks—looking back at the comments in this issue I see now I misunderstood what it’s about… |
@sideshowbarker Ah, I see there this discussion started: servo/html5ever#149 My conclusion is that you can't have full spec compilant parsing without buffering already produced DOM-tree. In our case it's not an issue. parse5 has SAX-style parser, but it behaves more like tokenizer with the simulated parser feedback (CDATA parsing flag, switch text parsing modes). |
Yup
Yeah, if you're going to conform to, e.g., the adoption-agency requirements and foster-parenting requirements and the special case of the E.g., the code in https://github.com/validator/htmlparser/tree/master/src/nu/validator/saxtree provides an event-based SAX API by building something that code actually calls a "SAX tree"... |
But I still can't figure out use cases for such approach. Using SAX parser most likely you will not need information about position of the element in the DOM-tree (It becomes even more absurd if you don't have DOM-tree). |
Yeah, I’ve never used that API. Instead I have used the fully-streaming SAX API that code also provides. That streaming API builds no tree and produces a fatal error for any markup cases that would require the adoption-agency algorithm or foster-parenting algorithm. That streaming API is actually what validator.nu and the W3C Nu Html Checker use. Those also report all the parse errors to the end user, so doing the streaming-but-stop-with-error-message-for-fatal-parse-errors thing makes sense in that context. |
Ok, here is the fundamental question: should we abandon non-streaming API and release 2.0 or keep non-streaming API as well? If we will keep non-streaming at will be messed up in my taste: I would like to expose |
I don't see why we would need non-streaming anymore. htmlparser2 is stream-only with pseudo-non-streaming via a single Totally awesome on the |
@stevenvachon Yep, it's should be |
Totally looking forward to this. Will |
@stevenvachon nope, we will still use current tokenizer and current HTML5 lexical grammar doesn't allow such constructs. |
@inikulin I'll defer to @Sebmaster but I think we just want JsDomParser to stay around, i.e. for streaming to primarily be via |
@domenic Ok, I'll keep it. However, handling both |
@inikulin what is forgiving like htmlparser2, then? |
Huh, seems like I'm starting to remember why I initially was against 'forgiving' parsing in parse5. Because there is no exact definition of the 'forgiving parsing'. Therefore people will always be somehow unhappy with this thing. |
lol, ok 😢 |
Remove unnecessary buffer flush in amb amp state
Is there a way to pipe a file into parse5 (this feature is available in htmlparser2).
similarly, is there a way to pause / resume the sax parser ? I believe that the
getNextToken
method could be used to decide when the parsing should pause/resume.I have been using the
html-tokenize
streaming parser lately and its suite (html-select
,trumpet
) but reaching html5 conformity onhtml-tokenize
is still a long way to go so I am trying to see how parse5 could fit in and be used withhtml-select
The text was updated successfully, but these errors were encountered: