-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add impl UnicodeSegmentation for [u8] #46
Comments
In terms of implementation, this doesn’t look easy as the code seems to make pervasive use of In terms of API, I mildly oppose to silently doing replacement without users having to type the name of some items that contains the word But then we have trait methods that return itertors of So instead, maybe the iterator types could be made generic to support both By the way, in case you’re also tempted to do incremental processing where not all of the input bytes are in memory at the same time (or simply not in contiguous memory), note that segmentation does not "distribute" over concatenation: |
Yeah. It's been a while since I looked at the implementation in this crate, but I suspected as much. I think that's definitely part of my question here. Namely, if I were to implement this, and it resulted in a fairly pervasive change, is that something the maintainers would be willing to accept?
Hmm. OK. I don't think I strongly disagree with this.
Interesting. Hmm. OK. I think I'm still playing catch up here, and I think there are parts that I don't quite grok yet. In particular, I have not yet deeply familiarized myself with the segmentation algorithm, so I might be missing some critical context here. What I had in mind shouldn't require any allocations. In particular, if an invalid UTF-8 sequence is seen, then that sequence is skipped and
This is interesting! So in this sense, the API for I'm not sure which API is better. It's conceivable both could be desirable.
Right, yeah, that makes sense. I imagine if I were doing this, then I'd need a buffer until I knew I had a single complete grapheme. To pop up a level to give you an idea of what I want to do: basically, in my work, I frequently deal with However, it's not feasible to do that, AFAIK, with the current API of this crate. I could do a lossy conversion of the entire line, but the line could be quite large (potentially arbitrarily large), so it would be much nicer to just decode what I need and be done with it. More generally, I've been toying with the idea of creating a new string type that is "conventionally UTF-8, but not required to be," similar to the semantics of I'd very much welcome thoughts you have on my overall plan here. :-) Apologies if it's a little side tracked from the specific issue here, but I figured giving some context on what I'm trying to do would help. |
Speaking as a sometime-contributor to this crate, I wouldn't object to the proposed implementation/API changes. But I'd also note that this crate is not very actively maintained at all, and the fork in https://crates.io/crates/unic-segment might be a better place for new development. It hasn't been very actively developed either, but the "rust-unic" project as a whole has been more active recently than the "unicode-rs" project. |
@mbrubeck Yeah I noticed that. It's tough keeping up with maintenance on these types of crates; there's a high context switch overhead for diving into it (in my experience, anyway). Thanks for pointing out If I do decide to move forward with this, I might just go ahead and prototype this in my own space, with the intent of hopefully merging efforts at some point in the future. |
Just out of interest, did @BurntSushi or anyone else end up prototyping / implementing this somewhere? I have come across a use case where this might be useful. Update: it looks like @BurntSushi went on and created bstr, which does segmentation too 😄 |
@aochagavia Yeah, indeed. I also took a different implementation technique than I generally think my technique was mostly a failure. It works, but not as well as I'd hoped. The actual binary size is still big and I'm not sure it's meaningfully faster. ICU4X is probably a good project to look into if you want the best, and I believe it supports As for APIs, I just exposed, e.g., |
Also worth noting that otoh it is probably not as fast as hardcoded segmenters. |
Would the maintainers be willing to accept an implementation of the trait for
[u8]
? Specifically, for possibly invalid UTF-8. We could specify that if invalid UTF-8 bytes are found, then we lossily emit a Unicode replacement codepoint.cc @SimonSapin
The text was updated successfully, but these errors were encountered: