Add int_able() and float_able()? #1775

johnmyleswhite · 2012-12-17T00:48:43Z

In trying to resolve some performance issues for DataFrames I/O, @vtjnash very kindly wrote code to determine whether a string could be parsed as an integer or float. I would suggest that functions like these go into Base because they've come up before in other context, such as the JSON parser. Having fast functions for doing this would be very helpful, especially given that Jameson showed that his functions are orders of magnitude faster than using regular expressions. His code is below:

function int_able{T <: String}(s::T)
    d = s.data
    if length(d) == 0
        return false
    end
    p = 1
    if d[p] == '-'
        p += 1
        if length(d) == 1
            return false
        end
    end
    while p <= length(d)
        if !contains("1234567890", d[p])
            return false
        end
        p += 1
    end        
    return true
end

function float_able{T <: String}(s::T)
    d = s.data
    if length(d) == 0
        return false
    end
    p = 1
    if d[p] == '-'
        p += 1
        if length(d) < p
            return false
        end
    end
    dot = false
    exp = false
    while p <= length(d)
        if !contains("1234567890", d[p])
            if d[p] == 'e'
                if p == 1 || d[p-1] == '-'
                    return false
                end
                if exp
                    return false
                end
                exp = true
                p += 1
                if length(d) < p
                    return false
                end
                if d[p] == '-'
                    p += 1
                    if length(d) < p
                        return false
                    end
                end
            elseif d[p] == '.' && exp == false
                if dot
                    return false
                end
                dot = true
            else
                return false
            end
        end
        p += 1
    end
    return true
end

The text was updated successfully, but these errors were encountered:

binarybana · 2012-12-17T01:06:10Z

It seems like the Julia parser should already have something like this? Or is it too general to be fast enough for a special case like this?

JeffBezanson · 2012-12-17T04:29:34Z

Check out float64_isvalid. Not sure if that's named optimally.

johnmyleswhite · 2012-12-17T04:37:11Z

As mentioned on julia-users, float64_isvalid seems to be losing to the proposed definition:

julia> @elapsed for i in 1:1_000_000 DataFrames.float_able("123") end
0.023856163024902344

julia> @elapsed for i in 1:1_000_000 float64_isvalid("123", Array(Float64, 1)) end
0.1689610481262207

That name is definitely a little obscure. Why not isvalidfloat64 instead?

JeffBezanson · 2012-12-17T04:55:02Z

In real use you will want to factor out the Array(Float64,1) and only do it once.

johnmyleswhite · 2012-12-17T04:57:46Z

That makes it only 3x slower:

julia> @elapsed for i in 1:1_000_000 DataFrames.float_able("123") end
0.02039504051208496

julia> a = Array(Float64, 1)
1-element Float64 Array:
 0.0

julia> @elapsed for i in 1:1_000_000 float64_isvalid("123", a) end
0.07102084159851074

JeffBezanson · 2012-12-17T05:05:32Z

Oh, it is important to remember that the function also gives you the converted value if it is valid, so it would probably end up being faster.

johnmyleswhite · 2012-12-17T05:08:58Z

Ok. I'll have to see if we can exploit that to combine our type inference and conversion steps. Right now the data conversion step is killing us performance-wise.

Is there a similar int64_isvalid? We'd need that if we want to combine inference and conversion.

vtjnash · 2012-12-17T08:54:28Z

Modifying int_able to also return the value makes it about 2x slower. Note that this is now just an inferior implementation of the function parse_int() that is already in base. (although this function is faster by a factor of 3, probably because it doesn't handle arbitrary bases).

But Jeff might be correct that combining the type inference and conversion steps is a win (the difference in times is expected at least). But you might still end up with many cases where the first set of rows was a bad predictor, so then you need to bookkeep which rows need to be reconverted and reallocate storage for the new type.

julia> function int_able2{T <: String}(s::T)
           d = s.data
           int = 0
           if length(d) == 0
               error()
           end
           p = 1
           if d[p] == '-'
               p += 1
               neg = -1
               if length(d) == 1
                   error()
               end
           else
               neg = 1
           end
           while p <= length(d)
               dp = d[p]
               if !('0' <= dp <= '9')
                   error()
               end
               int = int*10 + dp - '0'
               p += 1
           end    
           return neg*int
       end

johnmyleswhite · 2012-12-17T12:18:00Z

I'll try to see if we can take advantage of the fact that validation and conversion are completely intertwined. Our difficulty is that we need to make a first pass to remove any missing value indicator like NA or NULL before it would even make sense to call the conversion returns. That's why I think it could be helpful to have the ability to validate separately from conversion, beyond the fact that we already offer things like isfile in addition to open for deciding whether an operation will succeed without trying it.

JeffBezanson · 2013-10-27T07:49:17Z

subsumed by #3631

JeffBezanson closed this as completed Oct 27, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add int_able() and float_able()? #1775

Add int_able() and float_able()? #1775

johnmyleswhite commented Dec 17, 2012

binarybana commented Dec 17, 2012

JeffBezanson commented Dec 17, 2012

johnmyleswhite commented Dec 17, 2012

JeffBezanson commented Dec 17, 2012

johnmyleswhite commented Dec 17, 2012

JeffBezanson commented Dec 17, 2012

johnmyleswhite commented Dec 17, 2012

vtjnash commented Dec 17, 2012

johnmyleswhite commented Dec 17, 2012

JeffBezanson commented Oct 27, 2013

Add int_able() and float_able()? #1775

Add int_able() and float_able()? #1775

Comments

johnmyleswhite commented Dec 17, 2012

binarybana commented Dec 17, 2012

JeffBezanson commented Dec 17, 2012

johnmyleswhite commented Dec 17, 2012

JeffBezanson commented Dec 17, 2012

johnmyleswhite commented Dec 17, 2012

JeffBezanson commented Dec 17, 2012

johnmyleswhite commented Dec 17, 2012

vtjnash commented Dec 17, 2012

johnmyleswhite commented Dec 17, 2012

JeffBezanson commented Oct 27, 2013