Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FoLiA-txt fails on control character #46

Closed
proycon opened this issue Sep 10, 2020 · 5 comments
Closed

FoLiA-txt fails on control character #46

proycon opened this issue Sep 10, 2020 · 5 comments
Assignees
Labels
bug ready Ready but not yet in a published release

Comments

@proycon
Copy link
Member

proycon commented Sep 10, 2020

It seems FoLiA-txt manages to retain some control characters from the input and propagates it to the output text, leading to invalid XML. (reported by @martinreynaert in #45)

@proycon proycon added the bug label Sep 10, 2020
@proycon proycon self-assigned this Sep 10, 2020
@proycon proycon changed the title FoLiA-txt fails on control character on single line FoLiA-txt fails on control character Sep 10, 2020
proycon added a commit to LanguageMachines/libfolia that referenced this issue Sep 10, 2020
… from the input. They were often propagated to the XML output still producing invalid XML in the end (LanguageMachines/foliautils#46)
proycon added a commit to LanguageMachines/libfolia that referenced this issue Sep 10, 2020
@proycon
Copy link
Member Author

proycon commented Sep 10, 2020

I'm adjusting this in libfolia itself, it will do stripping of all non-allowed control characters. This does technically violate the principle of not manipulating the incoming data, but I think it's justifiable in this case. Ending up with invalid XML as was the current situation would be worse. (the alternative would a hard failure, which is what foliapy does upon serialisation).

@proycon proycon added the ready Ready but not yet in a published release label Sep 10, 2020
@kosloot
Copy link
Contributor

kosloot commented Sep 12, 2020

IMHO this libfolia patch to exterminate control characters is using a sledgehammer to crack a nut.
A less intrusive solution would have been to adapt the 'normalize_spaces()' function to see a FORM_FEED as a space too.
for instance by using ICU::u_isWhitespace() instead of u_isspace()

NOTE: as such, Control-Characters are not forbidden in XML, provided they are encoded correctly. So isn't this a signal that something else is wrong?

@proycon
Copy link
Member Author

proycon commented Sep 13, 2020

I did adapt normalize_spaces(), that was my first try... but that function is only invoked as part of text validation, we don't force space normalization on all text input (and that's a good thing), only this control-char normalization is something we want to force.

NOTE: as such, Control-Characters are not forbidden in XML, provided they are encoded correctly. So isn't this a signal that something else is wrong?

They are? That wouldn't make much sense, I tried: <a>&#x0c;</a> and xmllint rightly complains:

/tmp/test.xml:1: parser error : xmlParseCharRef: invalid xmlChar value 12
<a>&#x0c;</a>

@kosloot
Copy link
Contributor

kosloot commented Sep 13, 2020

Of course , i stand corrected. this only accepted for xml version 1.1 and upward.

You state that removing control characters is a forced thing, so I assume that NOT to be the default?
That is NOT what you have implemented atm? Which breaks the API, and in theory might break existing programs.

@proycon
Copy link
Member Author

proycon commented Sep 13, 2020

it IS the default now (because not removing them would lead to invalid xml). settext() calls strip_control_chars(). It shouldn't really effect the API much, other than that the String.empty() check prior to settext() should now ideally use is_norm_empty(). If not, there could be situations where things now break with a NoSuchText error if the original input consists of only control characters. But even then raising such an error is preferable to generating invalid XML anyway.

@proycon proycon closed this as completed Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug ready Ready but not yet in a published release
Projects
None yet
Development

No branches or pull requests

2 participants