-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
safer vector<->string conversions, fixing #24388 #25241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
base/strings/io.jl
Outdated
| IOBuffer(str::String) = IOBuffer(Vector{UInt8}(str)) | ||
| IOBuffer(s::SubString{String}) = IOBuffer(view(Vector{UInt8}(s.string), s.offset + 1 : s.offset + sizeof(s))) | ||
| IOBuffer(str::String) = IOBuffer(unsafe_wrap(Vector{UInt8},str)) | ||
| IOBuffer(s::SubString{String}) = IOBuffer(view(unsafe_wrap(Vector{UInt8},s.string), s.offset + 1 : s.offset + sizeof(s))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like it would make more sense to use a StringBytes here
base/strings/io.jl
Outdated
|
|
||
| macro b_str(s); :(Vector{UInt8}($(unescape_string(s)))); end | ||
| macro b_str(s) | ||
| v = unsafe_wrap(Vector{UInt8}, unescape_string(s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't performance sensitive to run the macro, but this would probably generate better runtime performance if it was making an actual copy here.
base/strings/string.jl
Outdated
| unsafe_wrap(::Type{Vector{UInt8}}, s::String) = ccall(:jl_string_to_array, Ref{Vector{UInt8}}, (Any,), s) | ||
|
|
||
| struct StringBytes <: AbstractVector{UInt8} | ||
| s::String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this were parameterized as S <: AbstractString, this would likely be helpful for implementing readuntil (I know, somewhat random connection, but I had been strongly considering creating this type – whereas right now we typically opt for using a special-case to access bytes of String and convert everything else to a Char array)
base/strings/string.jl
Outdated
| next(s::StringBytes, i) = (@_propagate_inbounds_meta; (s[i], i+1)) | ||
| done(s::StringBytes, i) = (@_inline_meta; i == length(s)+1) | ||
|
|
||
| copy(s::StringBytes) = copyto!(Vector{UInt8}(uninitialized, length(s)), s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we copying immutable data? Also, seems to return the wrong type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should probably be a method of Vector{UInt8}. Though there's some precedent for returning a different type e.g. when copying views.
base/strings/string.jl
Outdated
| copy(s::StringBytes) = copyto!(Vector{UInt8}(uninitialized, length(s)), s) | ||
|
|
||
| unsafe_convert(::Type{Ptr{UInt8}}, s::StringBytes) = convert(Ptr{UInt8}, pointer(s.s)) | ||
| unsafe_convert(::Type{Ptr{Int8}}, s::StringBytes) = convert(Ptr{Int8}, pointer(s.s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
= unsafe_convert(Ptr{Int8}, s.s) and drop the definitions of pointer (why does this function still exist anyways?)
base/strings/util.jl
Outdated
| hex2bytes(s::Union{String,AbstractVector{UInt8}}) = hex2bytes!(Vector{UInt8}(uninitialized, length(s) >> 1), s) | ||
|
|
||
| _nbytes(s::String) = sizeof(s) | ||
| _nbytes(s::AbstractVector{UInt8}) = length(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this also sizeof? I thought we define that sizeof means nbytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was being super paranoid, but you're probably right.
base/c.jl
Outdated
| transcode(::Type{String}, src) = String(transcode(UInt8, src)) | ||
|
|
||
| function transcode(::Type{UInt16}, src::Vector{UInt8}) | ||
| function transcode(::Type{UInt16}, src::AbstractVector{UInt8}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function assumes that axes(src) === 1:length(src)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely it's not the only function like that... should I use Union{Vector{UInt8},StringBytes}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, let's keep this as specific as possible – we can always generalize as needed.
|
+1 for StringBytes being exported! For the deprecation, you could deprecate to |
ebb7d86 to
494705e
Compare
494705e to
76b1330
Compare
| """ | ||
| IOBuffer(str::String) = IOBuffer(Vector{UInt8}(str)) | ||
| IOBuffer(s::SubString{String}) = IOBuffer(view(Vector{UInt8}(s.string), s.offset + 1 : s.offset + sizeof(s))) | ||
| IOBuffer(str::String) = IOBuffer(unsafe_wrap(Vector{UInt8}, str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably also IOBuffer(s::StringBytes) = IOBuffer(s.s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a different question but related to the same line. Why unsafe_wrap is used here - for efficiency? I am asking, because it seems that this should work with StringBytes equally well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using StringBytes and a few things broke that I didn't feel like dealing with. The type IOBuffer actually implies that a Vector{UInt8} will be used:
julia> IOBuffer
Base.GenericIOBuffer{Array{UInt8,1}}
so returning a GenericIOBuffer{StringBytes} instead can cause surprises.
base/strings/string.jl
Outdated
| Wrap a `String` (without copying) in an immutable vector-like object that accesses the bytes | ||
| of the string's representation. | ||
| """ | ||
| struct StringBytes <: AbstractVector{UInt8} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<: DenseVector{UInt8}, and include strides methods?
76b1330 to
1cae074
Compare
|
@StefanKarpinski @bkamins Approve? |
|
Thanks! I understand the target functionality is:
I think it is very good and |
|
We might want to hold off on exporting |
|
If you want me to, I could make a PR where |
|
|
|
We're pretty good at eliding allocations these days, so I suspect it'd be fine, but I'd recommend verifying in the cases you're interested in. One thing in particular is that we can only elide allocations if both the allocation and all its uses are inlined into one function. |
If so, it seems like the name is wrong. Either it would be specifically a |
|
Yes, clearly the name |
|
I think we don't want string types to have to define I can rename this |
|
I think if the type could be called |
add `CodeUnits` and `codeunits` fixes #24388
1cae074 to
7bce3b1
Compare
|
OK, see how this grabs you. |
| let v = unsafe_wrap(Vector{UInt8}, "abc") | ||
| s = String(v) | ||
| @test_throws BoundsError v[1] | ||
| push!(v, UInt8('x')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. Why doesn't this line throw an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to. When push! is called new storage is allocated for the vector (also as a String in case you want to make more strings from it).
| String(s::Symbol) = unsafe_string(unsafe_convert(Ptr{UInt8}, s)) | ||
|
|
||
| (::Type{Vector{UInt8}})(s::String) = ccall(:jl_string_to_array, Ref{Vector{UInt8}}, (Any,), s) | ||
| unsafe_wrap(::Type{Vector{UInt8}}, s::String) = ccall(:jl_string_to_array, Ref{Vector{UInt8}}, (Any,), s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have unsafe_wrap(::Type{Array}, s) = unsafe_wrap(Vector{UInt8}, s)?
|
This looks good to me. I guess the wrapping of string object itself is fairly essential here, which inherently couples the |
What's the difference? This should work for anything that implements |
Just that |
I tried a couple things here. First, I decided that we will still want some way to wrap a String as a Vector in-place, so I made that a method of
unsafe_wrapas suggested by @vtjnash. But then, there are many uses that just want to access the bytes of a string and don't do anything unsafe. To handle that I added aStringBytesCodeUnitsAbstractVector type that exposes the bytes of aStringas an immutable UInt8 vector.My questions are (1) do we want
CodeUnitsand should it be exported, and (2) what should the existing functions be deprecated to? I feel the majority of uses will wantCodeUnits, but that isn't an accurate deprecation since it doesn't return aVector{UInt8}.