Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use generalized dot product in Mahalanobis distance. #242

Open
bvdmitri opened this issue Aug 8, 2022 · 6 comments
Open

Use generalized dot product in Mahalanobis distance. #242

bvdmitri opened this issue Aug 8, 2022 · 6 comments

Comments

@bvdmitri
Copy link

bvdmitri commented Aug 8, 2022

It would be better to replace dot(z, Q * z) (here https://github.com/JuliaStats/Distances.jl/blob/master/src/mahalanobis.jl#L83 and in other places) with dot(z, Q, z) without storing the intermediate result of Q*z. The problem here is that Q * z allocates and sometimes this allocation might slowdown the overall execution by a lot.

This is the small benchmark:

julia> @btime dot($x, $Q*$x)
  87.190 ns (1 allocation: 80 bytes)
0.850287710619438

julia> @btime dot($x, $Q, $x)
  13.774 ns (0 allocations: 0 bytes)
0.8502877106194378

Another nice addition would be to provide a tmp storage for inplace z = a - b operation (which also allocates). That would help if some executes the mahalanobis distance in a tight for loop.

In our application, simply by changing the dot and providing tmp storage we reduced allocations from 1.107791 seconds (20.21 M allocations: 1.141 GiB, 18.80% gc time) to 0.532219 seconds (10.12 M allocations: 393.585 MiB, 17.05% gc time).

This is the sketch of our implementation (without checks):

struct FastSqMahalanobis{ M <: AbstractMatrix, V <: AbstractVector }
    Q :: M
    z :: V
end

function FastSqMahalanobis(Q::AbstractMatrix) 
    return FastSqMahalanobis(Q, zeros(eltype(Q), size(Q, 1)))
end

function (distance::FastSqMahalanobis)(a::AbstractVector, b::AbstractVector)
    if length(a) != length(b)
        throw(DimensionMismatch("first array has length $(length(a)) which does not match the length of the second, $(length(b))."))
    end

    Q = distance.Q
    z = distance.z

    # inplace z = a - b
    map!(-, z, a, b)

    return dot(z, Q, z)
end
@dkarrasch
Copy link
Member

This has been proposed earlier, but it is somewhat controversial. The 3-arg dot has specializations for structured matrices and outperform the 2-arg dot with product in small cases. Unfortunately, for large dense Q, the latter outperforms the former, so it's not clear what to do generically. But having a FastSqMahalanobis type may be a way out, or replace the call to dot by _dot, which generically computes dot(z, Q*z), and in structured cases dot(z, Q, z)?

@dkarrasch
Copy link
Member

x-ref #173

@bvdmitri
Copy link
Author

I see you reasoning, but in the link you've provided I see only a small isolated benchmark. It can be true, that a 2 argument dot function might be faster on large inputs (which I suppose can be filed as a performance bug in the Julia itself). The problem, however, is the fact that 2-argument function allocates and creates a lot of unnecessary pressure on the garbage collector in real applications. While it can be faster in an isolated single test, I doubt it will be faster in a real application where you need to perform this operation hundreds or thousands of times and allocate gigabytes of memory (that is what happened to us). Ineffective RAM usage slows down this kind of operation a lot as it requires to run GC more often.

I don't have a computer right now to test how it performs in case when you need to perform dot many many times, but I can benchmark it later,

@dkarrasch
Copy link
Member

The issue with temporal storage in a generic context is that you never know whether the storage has the right eltype. For instance, you could use a real Q, but multiply with complex vectors. Nothing stops you from doing that. However, in a specific use case, where you have control over the input/intermediate variables, you can define your own metric struct as you have done. But look, you may wish to also compute the colwise distance, then you'll need an intermediate matrix. In a generic context like this package, it's just not clear what storage may be needed for specific computations.

(which I suppose can be filed as a performance bug in the Julia itself)

It's not a performance bug. It is because the 2-arg dot can use BLAS for appropriate eltypes, which performs the intermediate multiplication much faster.

@bvdmitri
Copy link
Author

It is because the 2-arg dot can use BLAS for appropriate eltypes, which performs the intermediate multiplication much faster.

But nothing really stops the 3-arg dot to also use BLAS given appropriate eltypes or even call 2-arg version in some specific situations (e.g. as you suggested for _dot). Maybe the real winner here is the simd/vectorization of the 2-arg version, but I'm not very experienced in it to comment on this.

@dkarrasch
Copy link
Member

The promise of 3-arg dot in Base is not to allocate an intermediate vector, so you can't compute first z = Q*y and then dot(x,z). If users don't mind the allocation, they write dot(x, Q*y), if they do, they use dot(x, Q, y). It was the desire to remove intermediate allocations that triggered the 3-arg implementation. dot(x, Q*y) has been available ever since.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants