Skip to content
This repository has been archived by the owner on Jul 5, 2022. It is now read-only.

mnist tutorial failing #15

Open
sp7412 opened this issue Jul 19, 2020 · 12 comments
Open

mnist tutorial failing #15

sp7412 opened this issue Jul 19, 2020 · 12 comments
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@sp7412
Copy link

sp7412 commented Jul 19, 2020

In Step 1: pip install -r requirements.txt fails to run.

@shcheklein
Copy link
Member

Probably it means that it's out of date as well. Needs some care.

@shcheklein shcheklein added good first issue Good for newcomers bug Something isn't working labels Jul 19, 2020
@jorgeorpinel
Copy link
Contributor

pip fails to install pandas 0.23.4 which is pretty old, yes.

@iesahin
Copy link
Contributor

iesahin commented Mar 14, 2021

I think it's not from the age of packages, the container is limited in CPU and memory. It takes infinite time to compile Pandas.

Screen Shot 2021-03-14 at 16 37 58

I don't think a newer version will solve the problem. A precompiled version from apt may run. (And, even in that case, I wonder how long would it need to train a model.)

I think we can remove this scenario completely.

I installed python3-pandas and python3-sklearn from apt packages. It looks SVM part of the tutorial can be run. I'm not sure about the torch/CNN part. (At someplace, it asks to install torch==1.0.0 but it cannot be found. torch-1.4.0 can be installed but the training script cannot find it. I need to look into that.)

We can have some SVM parameterization and use this as an experiments tutorial. The downloaded data is not image, though, it's CSV file extracted from images. I can add featurization as well.

All the content and commands should change though, dvc run has -f parameters, and there are many parts like this:

Screen Shot 2021-03-14 at 17 08 36

Actually removing may be OK. It needs a total rewrite. I can create a new one using https://github.com/iterative/dvc-checkpoints-mnist

WDYT? @jorgeorpinel @shcheklein @dberenbaum

@shcheklein
Copy link
Member

I'm fine to remove and start from the Dave's one.

@iesahin
Copy link
Contributor

iesahin commented Mar 15, 2021

I tested the basic branch of dvc-checkpoints-mnist and it's killed in the training step due to the memory limits in katacoda. Do you have a preference over which deep learning architecture / library / technique used in the examples? It may be possible to use TFLite and download models directly in the examples. I'm asking this because it may take more time to adjust to low memory environments.

Katacoda has 1.5GB of RAM. I can create a Docker environment to simulate this.

BTW I read Docker discussions in iterative/dvc.org#811 and iterative/dvc#2844

Instead of a general purpose Docker image, a container to download the data and set up the example project may be provided in the docs. We can use for the tests and if they like, people may build on top of these or create their own Docker environments.

@shcheklein @dberenbaum @jorgeorpinel

@shcheklein
Copy link
Member

@iesahin do we know what takes all the memory? It's a bit unexpected that MNIST requires that much RAM.

@iesahin
Copy link
Contributor

iesahin commented Mar 15, 2021

@shcheklein I didn't profile it thoroughly but the line in training that builds the prediction, y_pred = model(x) or something like that causes the kill. (I'm writing from the phone.) Data itself is downloaded, and loaded into the memory but the model may take that much RAM.

There may be some engineering, like increasing the swap space or manual gc to reduce the required memory. But Torch itself is a bit expensive library to run with 1-1.5 RAM+1GB swap.

There may be some different versions of the classifiers, like random forest, SVM, NB, CNN, MLP etc. to test & experiment. (Selected via parameters in DVC.) We can use the modest ones in katacoda, but the users may try all of them in their own environment.

@dberenbaum
Copy link

I can (and probably should) load mini-batches of data in the example, which could help, but maybe not if PyTorch already uses almost all the available memory. We could also try a more lightweight deep learning framework. Also curious which branch you are using @iesahin?

@iesahin
Copy link
Contributor

iesahin commented Mar 16, 2021 via email

@iesahin
Copy link
Contributor

iesahin commented Mar 16, 2021

If libtorch.so really takes up 1.2GB of RAM as discussed here, there's not much we can do about it.

@iesahin
Copy link
Contributor

iesahin commented Mar 18, 2021

I tested dogs and cats data and model versioning tutorial on katacoda in a docker container: https://dvc.org/doc/use-cases/versioning-data-and-model-files/tutorial

Tensorflow runs but creating the model takes a long time. python train.py takes around 30 minutes. Most of the time is spent before the epoch progress bars. It may be possible to load the model at once and reduce this considerably.

But I'm not sure if it can be done near instantly. We still may need to have smaller datasets/models for Katacoda.

I'll also test MNIST dataset with TF on katacoda. TF seems more suitable for low memory environments. MNIST is known better and it's like Hello World for the ML tutorials.

@shcheklein @dberenbaum

@iesahin
Copy link
Contributor

iesahin commented Mar 18, 2021

I tested the MNIST example in TF site.

https://gist.github.com/iesahin/f3a22ebca5b52579748dc7d724047c8d

It takes less than 1 minute for the whole script to finish on Katacoda. The model is a bit simple, no CNN, 1 Dense/128 layer. (97% val. acc.) But at least now we know it's possible to use MNIST on Katacoda.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants