Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #6540: utf16(s) for binary data, etc. #6546

Merged
merged 2 commits into from
Apr 22, 2014

Conversation

stevengj
Copy link
Member

This patch adds utf16(s) for binary data, an is_valid_utf16(s) function analogous to is_valid_utf8, and restricts IOBuffer(s::String) to ByteStrings as discussed in #6540.

@stevengj stevengj mentioned this pull request Apr 16, 2014
@stevengj
Copy link
Member Author

@vtjnash, @nolta, does this look good to you?

@JeffBezanson
Copy link
Member

lgtm

@vtjnash
Copy link
Member

vtjnash commented Apr 17, 2014

i'm not that well versed in UTF16 strings.

should the convert(::Type{UTF16String}, data::Array{Uint16}) function also check for the BOM?

elseif data[1] == 0xfffe # byte-swapped
convert(T, Uint16[bswap(data[i]) for i=2:length(data)])
else
convert(T, copy(data)) # assume native byte order
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this copy necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly necessary, but my thinking was that whether or not this function makes a copy of the data should not depend on the contents of the data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a converting copy?

eg.

function copy(T::Type, x)
    a = convert(T, x)
    if a === x
        a = copy(a)
    end
    a
end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could always do convert(T, copy(x)) if you need to force a copy in a particular case. In any case, === won't work here since I don't think the UTF16String will compare equal to an Array{Uint16} even if they share the same data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if the user has an array containing a UTF-16 string in the native byte order, with no BOM, and it is important to convert it to a string without making a copy, she can always force that behavior by doing utf16(reinterpret(Uint16, array)). So, I don't think the copy here could pose a big problem in practice.

@stevengj
Copy link
Member Author

@vtjnash, my feeling was that if the user supplies an Array{Uint16} directly, then we should assume that they are really Uint16s, i.e. that they are in the native byte order, and that they want a string from the raw data. (A BOM, if any, would then be treated as a zero-width non-breaking space.)

JeffBezanson added a commit that referenced this pull request Apr 22, 2014
fix #6540: utf16(s) for binary data, etc.
@JeffBezanson JeffBezanson merged commit a26f03e into JuliaLang:master Apr 22, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants