Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle CDATA with UTF-8 characters when partial parsing #133

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tomtaylor
Copy link

@tomtaylor tomtaylor commented Aug 5, 2024

Follows on from #122.

In partial mode, UTF-8 encoded characters might be split across multiple chunks. When this happens for a character such as £, which is encoded as <<0xC2, 0xA3>>, the 0xC2 is neither an ASCII character (<= 127), nor does it match the <<codepoint::utf-8>> clause, and Saxy throws a parser error.

This fixes that by just parsing all the bytes inside a CDATA element regardless of their code point. It drops the UTF-8 character optimisation, but I suspect that's probably a minor performance improvement for most documents.

@qcam is this a more prevalent issue than my use case? I can see why matching on UTF-8 codepoint and swallowing the whole character is a nice optimisation, but I wonder if it might cause issues in other places when partial parsing.

Don't assume that we're always seeing a full UTF-8 character. In partial mode,
UTF-8 encoded characters might be split across multiple chunks.
@tomtaylor tomtaylor changed the title Handle CDATA containing partial UTF-8 characters Handle CDATA with UTF-8 characters when partial parsing Aug 5, 2024
@tomtaylor
Copy link
Author

@qcam any thoughts on this?

element_cdata(rest, more?, original, pos, state, len + 1)

<<codepoint::utf8>> <> rest ->
element_cdata(rest, more?, original, pos, state, len + Utils.compute_char_len(codepoint))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can the same way how dangling UTF-8 fragments is handled

For example https://github.com/qcam/saxy/blob/master/lib/saxy/parser/builder.ex#L540-L541

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants