Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add int_able() and float_able()? #1775

Closed
johnmyleswhite opened this issue Dec 17, 2012 · 10 comments
Closed

Add int_able() and float_able()? #1775

johnmyleswhite opened this issue Dec 17, 2012 · 10 comments

Comments

@johnmyleswhite
Copy link
Member

In trying to resolve some performance issues for DataFrames I/O, @vtjnash very kindly wrote code to determine whether a string could be parsed as an integer or float. I would suggest that functions like these go into Base because they've come up before in other context, such as the JSON parser. Having fast functions for doing this would be very helpful, especially given that Jameson showed that his functions are orders of magnitude faster than using regular expressions. His code is below:

function int_able{T <: String}(s::T)
    d = s.data
    if length(d) == 0
        return false
    end
    p = 1
    if d[p] == '-'
        p += 1
        if length(d) == 1
            return false
        end
    end
    while p <= length(d)
        if !contains("1234567890", d[p])
            return false
        end
        p += 1
    end        
    return true
end

function float_able{T <: String}(s::T)
    d = s.data
    if length(d) == 0
        return false
    end
    p = 1
    if d[p] == '-'
        p += 1
        if length(d) < p
            return false
        end
    end
    dot = false
    exp = false
    while p <= length(d)
        if !contains("1234567890", d[p])
            if d[p] == 'e'
                if p == 1 || d[p-1] == '-'
                    return false
                end
                if exp
                    return false
                end
                exp = true
                p += 1
                if length(d) < p
                    return false
                end
                if d[p] == '-'
                    p += 1
                    if length(d) < p
                        return false
                    end
                end
            elseif d[p] == '.' && exp == false
                if dot
                    return false
                end
                dot = true
            else
                return false
            end
        end
        p += 1
    end
    return true
end
@binarybana
Copy link
Contributor

It seems like the Julia parser should already have something like this? Or is it too general to be fast enough for a special case like this?

@JeffBezanson
Copy link
Member

Check out float64_isvalid. Not sure if that's named optimally.

@johnmyleswhite
Copy link
Member Author

As mentioned on julia-users, float64_isvalid seems to be losing to the proposed definition:

julia> @elapsed for i in 1:1_000_000 DataFrames.float_able("123") end
0.023856163024902344

julia> @elapsed for i in 1:1_000_000 float64_isvalid("123", Array(Float64, 1)) end
0.1689610481262207

That name is definitely a little obscure. Why not isvalidfloat64 instead?

@JeffBezanson
Copy link
Member

In real use you will want to factor out the Array(Float64,1) and only do it once.

@johnmyleswhite
Copy link
Member Author

That makes it only 3x slower:

julia> @elapsed for i in 1:1_000_000 DataFrames.float_able("123") end
0.02039504051208496

julia> a = Array(Float64, 1)
1-element Float64 Array:
 0.0

julia> @elapsed for i in 1:1_000_000 float64_isvalid("123", a) end
0.07102084159851074

@JeffBezanson
Copy link
Member

Oh, it is important to remember that the function also gives you the converted value if it is valid, so it would probably end up being faster.

@johnmyleswhite
Copy link
Member Author

Ok. I'll have to see if we can exploit that to combine our type inference and conversion steps. Right now the data conversion step is killing us performance-wise.

Is there a similar int64_isvalid? We'd need that if we want to combine inference and conversion.

@vtjnash
Copy link
Member

vtjnash commented Dec 17, 2012

Modifying int_able to also return the value makes it about 2x slower. Note that this is now just an inferior implementation of the function parse_int() that is already in base. (although this function is faster by a factor of 3, probably because it doesn't handle arbitrary bases).

But Jeff might be correct that combining the type inference and conversion steps is a win (the difference in times is expected at least). But you might still end up with many cases where the first set of rows was a bad predictor, so then you need to bookkeep which rows need to be reconverted and reallocate storage for the new type.

julia> function int_able2{T <: String}(s::T)
           d = s.data
           int = 0
           if length(d) == 0
               error()
           end
           p = 1
           if d[p] == '-'
               p += 1
               neg = -1
               if length(d) == 1
                   error()
               end
           else
               neg = 1
           end
           while p <= length(d)
               dp = d[p]
               if !('0' <= dp <= '9')
                   error()
               end
               int = int*10 + dp - '0'
               p += 1
           end    
           return neg*int
       end

@johnmyleswhite
Copy link
Member Author

I'll try to see if we can take advantage of the fact that validation and conversion are completely intertwined. Our difficulty is that we need to make a first pass to remove any missing value indicator like NA or NULL before it would even make sense to call the conversion returns. That's why I think it could be helpful to have the ability to validate separately from conversion, beyond the fact that we already offer things like isfile in addition to open for deciding whether an operation will succeed without trying it.

@JeffBezanson
Copy link
Member

subsumed by #3631

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants