This repository has been archived by the owner on Aug 15, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 729
Output layer question #178
Comments
We have some ideas here based on approximate kNN methods. Stay tuned. |
Interesting -- are you thinking just for inference or for both inference and training? |
It's a no-brainer for inference, it's a science project for training.
…On Tue, Jun 12, 2018, 12:43 PM Ben Johnson ***@***.***> wrote:
Interesting -- are you thinking just for inference or for both inference
and training?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#178 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARNK9qAznaN0s3AZHISltNTnAnIXJFaVks5t8BnQgaJpZM4UfT6H>
.
|
Relevant code and paper, if you haven't seen it: |
Yep, I came up with roughly the same idea last Summer and then I read that
paper that fleshed it out even further last Fall. My sparse input kernels
end up ~20x faster than SGEMM so their seeing more or less the same speedup
with LSH makes intuitive sense: one ends up memory-limited. Older GPUs
would be a bear for this, but newer GPUs support arbitrary cuda streams and
they have much larger L2 caches so it shouldn't be all that hard to write.
…On Tue, Jun 12, 2018 at 12:47 PM, Ben Johnson ***@***.***> wrote:
Relevant code and paper, if you haven't seen it:
https://github.com/rdspring1/LSH_DeepLearning
I don't think they did it on GPUs or big models, but maybe interesting
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#178 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARNK9vofNis0ZyK1YVvZrZtVIfuu6FnVks5t8BrIgaJpZM4UfT6H>
.
|
Any updates on this? I'm trying to find an example of a library that uses approximate kNN methods to speed up the output layer. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The output layer in these networks is often a bottleneck, because you have to do a
(batch_size, hidden_dim)
by(hidden_dim, num_classes)
dense matrix multiplication. It doesn't seem like you'd get a speed up just by avoiding storing/multiplying by zeros -- are you doing any kinds of tricks here to reduce the cost of that operation?Thanks
~ Ben
The text was updated successfully, but these errors were encountered: