Minimum requirements of VRAM #16

keishihara · 2022-05-25T04:45:49Z

Hi Chen, thank you so much for sharing your amazing work!

I tried and was able to run your pretrained agent with weight checkpoints in weights folder.
Currently, I decided to reproduce your work from training steps described in TRAINING.md.
At this moment, I am at the step of training privileged motion planner and getting CUDA out of memory error with your dev_test version of dataset which I think is small enough to start with.
Apparently, when I ran python -m lav.train_bev, this line of code is consuming a lot of gpu memory and causing the above error.
btw, my ubuntu machine has two little older Titan X with 12GB VRAM each.

I am wondering what the requirements of graphic cards' spec is to reproduce this work from scratch.
Is my pc not enough to do this or can you tell me about your machine's specification?

The text was updated successfully, but these errors were encountered:

dotchen · 2022-05-25T06:15:40Z

This might depend on your PyTorch/CUDA/cuDNN version. With the one I use, this line is necessary to prevent a CUDA error. If it works fine without it in your setup you could comment that line out, I don't think it will affect the predictions. See: https://pytorch.org/docs/1.7.1/_modules/torch/nn/modules/rnn.html#RNNBase.flatten_parameters

I use 4 Titan Pascal.

keishihara · 2022-05-25T09:10:05Z

Thank you for quick reply!
Ok, I was able to run that script just now by setting the batch size to 128 instead of the default 512.
So it appears the problem was actually a hardware limitation, since the 4 Titan Pascal have over 40GB of vram in total.

I have another question regarding your Dockerfile.
As I understand it, this Docker image is only for running evaluations of trained agents, not for training them, right?
If you have the one for training, could you share that too? Or did you do it on localhost rather than in a container?

dotchen · 2022-06-16T08:22:10Z

Hello,

We did not use a docker for training but uses a conda env to manage the dependencies. Let me know if you have any issues with the dependencies and I am happy to take a look.

keishihara · 2022-06-23T02:20:50Z

Hi, thank you for your message.

I managed to create a docker image for training and maintain the dependencies, and apparently all of the modules provided in lav folder are working ok. However, while individual training logs in wandb seemed fine, when I use segmentation model that I trained myself to perform point painting or train full model, the segmentation performance decreases quite a lot probably due to seg_model.eval() call, like here

LAV/lav/data_paint.py

Line 57 in dc9b4cf

self.seg_model.eval()

I wonder if you have experienced the same issue when you trained your models provided in weights folder.

This seems to be somewhat related to the switching behavior stuff of BatchNorm / Dropout between training and testing, but I couldn't figure it out yet. Do you have any idea about this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimum requirements of VRAM #16

Minimum requirements of VRAM #16

keishihara commented May 25, 2022 •

edited

Loading

dotchen commented May 25, 2022

keishihara commented May 25, 2022

dotchen commented Jun 16, 2022

keishihara commented Jun 23, 2022

Minimum requirements of VRAM #16

Minimum requirements of VRAM #16

Comments

keishihara commented May 25, 2022 • edited Loading

dotchen commented May 25, 2022

keishihara commented May 25, 2022

dotchen commented Jun 16, 2022

keishihara commented Jun 23, 2022

keishihara commented May 25, 2022 •

edited

Loading