Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimum requirements of VRAM #16

Open
keishihara opened this issue May 25, 2022 · 4 comments
Open

Minimum requirements of VRAM #16

keishihara opened this issue May 25, 2022 · 4 comments

Comments

@keishihara
Copy link

keishihara commented May 25, 2022

Hi Chen, thank you so much for sharing your amazing work!

I tried and was able to run your pretrained agent with weight checkpoints in weights folder.
Currently, I decided to reproduce your work from training steps described in TRAINING.md.
At this moment, I am at the step of training privileged motion planner and getting CUDA out of memory error with your dev_test version of dataset which I think is small enough to start with.
Apparently, when I ran python -m lav.train_bev, this line of code is consuming a lot of gpu memory and causing the above error.
btw, my ubuntu machine has two little older Titan X with 12GB VRAM each.

I am wondering what the requirements of graphic cards' spec is to reproduce this work from scratch.
Is my pc not enough to do this or can you tell me about your machine's specification?

@dotchen
Copy link
Owner

dotchen commented May 25, 2022

This might depend on your PyTorch/CUDA/cuDNN version. With the one I use, this line is necessary to prevent a CUDA error. If it works fine without it in your setup you could comment that line out, I don't think it will affect the predictions. See: https://pytorch.org/docs/1.7.1/_modules/torch/nn/modules/rnn.html#RNNBase.flatten_parameters

I use 4 Titan Pascal.

@keishihara
Copy link
Author

Thank you for quick reply!
Ok, I was able to run that script just now by setting the batch size to 128 instead of the default 512.
So it appears the problem was actually a hardware limitation, since the 4 Titan Pascal have over 40GB of vram in total.

I have another question regarding your Dockerfile.
As I understand it, this Docker image is only for running evaluations of trained agents, not for training them, right?
If you have the one for training, could you share that too? Or did you do it on localhost rather than in a container?

@dotchen
Copy link
Owner

dotchen commented Jun 16, 2022

Hello,

We did not use a docker for training but uses a conda env to manage the dependencies. Let me know if you have any issues with the dependencies and I am happy to take a look.

@keishihara
Copy link
Author

Hi, thank you for your message.

I managed to create a docker image for training and maintain the dependencies, and apparently all of the modules provided in lav folder are working ok. However, while individual training logs in wandb seemed fine, when I use segmentation model that I trained myself to perform point painting or train full model, the segmentation performance decreases quite a lot probably due to seg_model.eval() call, like here

self.seg_model.eval()

I wonder if you have experienced the same issue when you trained your models provided in weights folder.

This seems to be somewhat related to the switching behavior stuff of BatchNorm / Dropout between training and testing, but I couldn't figure it out yet. Do you have any idea about this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants