mnist tutorial failing #15

sp7412 · 2020-07-19T10:54:48Z

In Step 1: pip install -r requirements.txt fails to run.

shcheklein · 2020-07-19T20:56:49Z

Probably it means that it's out of date as well. Needs some care.

jorgeorpinel · 2021-03-07T20:47:22Z

pip fails to install pandas 0.23.4 which is pretty old, yes.

iesahin · 2021-03-14T13:43:05Z

I think it's not from the age of packages, the container is limited in CPU and memory. It takes infinite time to compile Pandas.

I don't think a newer version will solve the problem. A precompiled version from apt may run. (And, even in that case, I wonder how long would it need to train a model.)

~~I think we can remove this scenario completely.~~

I installed python3-pandas and python3-sklearn from apt packages. It looks SVM part of the tutorial can be run. I'm not sure about the torch/CNN part. (At someplace, it asks to install torch==1.0.0 but it cannot be found. torch-1.4.0 can be installed but the training script cannot find it. I need to look into that.)

We can have some SVM parameterization and use this as an experiments tutorial. The downloaded data is not image, though, it's CSV file extracted from images. I can add featurization as well.

All the content and commands should change though, dvc run has -f parameters, and there are many parts like this:

Actually removing may be OK. It needs a total rewrite. I can create a new one using https://github.com/iterative/dvc-checkpoints-mnist

WDYT? @jorgeorpinel @shcheklein @dberenbaum

shcheklein · 2021-03-14T17:05:18Z

I'm fine to remove and start from the Dave's one.

iesahin · 2021-03-15T14:08:49Z

I tested the basic branch of dvc-checkpoints-mnist and it's killed in the training step due to the memory limits in katacoda. Do you have a preference over which deep learning architecture / library / technique used in the examples? It may be possible to use TFLite and download models directly in the examples. I'm asking this because it may take more time to adjust to low memory environments.

Katacoda has 1.5GB of RAM. I can create a Docker environment to simulate this.

BTW I read Docker discussions in iterative/dvc.org#811 and iterative/dvc#2844

Instead of a general purpose Docker image, a container to download the data and set up the example project may be provided in the docs. We can use for the tests and if they like, people may build on top of these or create their own Docker environments.

@shcheklein @dberenbaum @jorgeorpinel

shcheklein · 2021-03-15T19:38:26Z

@iesahin do we know what takes all the memory? It's a bit unexpected that MNIST requires that much RAM.

iesahin · 2021-03-15T21:09:31Z

@shcheklein I didn't profile it thoroughly but the line in training that builds the prediction, y_pred = model(x) or something like that causes the kill. (I'm writing from the phone.) Data itself is downloaded, and loaded into the memory but the model may take that much RAM.

There may be some engineering, like increasing the swap space or manual gc to reduce the required memory. But Torch itself is a bit expensive library to run with 1-1.5 RAM+1GB swap.

There may be some different versions of the classifiers, like random forest, SVM, NB, CNN, MLP etc. to test & experiment. (Selected via parameters in DVC.) We can use the modest ones in katacoda, but the users may try all of them in their own environment.

dberenbaum · 2021-03-16T12:52:22Z

I can (and probably should) load mini-batches of data in the example, which could help, but maybe not if PyTorch already uses almost all the available memory. We could also try a more lightweight deep learning framework. Also curious which branch you are using @iesahin?

iesahin · 2021-03-16T14:22:20Z

Let me first profile the script. I doubt mini-batches will solve the memory problem, (PyTorch download size is around 700MB,) but it may help to converge faster. It takes around 100 epochs to reach >0.90 accuracy. I tested on several branches but traced on `basic` branch. Thank you.

iesahin · 2021-03-16T14:46:01Z

If libtorch.so really takes up 1.2GB of RAM as discussed here, there's not much we can do about it.

iesahin · 2021-03-18T10:31:33Z

I tested dogs and cats data and model versioning tutorial on katacoda in a docker container: https://dvc.org/doc/use-cases/versioning-data-and-model-files/tutorial

Tensorflow runs but creating the model takes a long time. python train.py takes around 30 minutes. Most of the time is spent before the epoch progress bars. It may be possible to load the model at once and reduce this considerably.

But I'm not sure if it can be done near instantly. We still may need to have smaller datasets/models for Katacoda.

I'll also test MNIST dataset with TF on katacoda. TF seems more suitable for low memory environments. MNIST is known better and it's like Hello World for the ML tutorials.

@shcheklein @dberenbaum

iesahin · 2021-03-18T11:22:27Z

I tested the MNIST example in TF site.

https://gist.github.com/iesahin/f3a22ebca5b52579748dc7d724047c8d

It takes less than 1 minute for the whole script to finish on Katacoda. The model is a bit simple, no CNN, 1 Dense/128 layer. (97% val. acc.) But at least now we know it's possible to use MNIST on Katacoda.

shcheklein added good first issue Good for newcomers bug Something isn't working labels Jul 19, 2020

iesahin mentioned this issue Mar 19, 2021

MNIST dataset and Docker containers for Getting Started / Use Case documents and Katacoda scenarios iterative/dvc.org#2318

Closed

iesahin mentioned this issue Apr 10, 2021

Write a new Experiments scenario with MNIST/Tensorflow #60

Closed

iesahin mentioned this issue Apr 23, 2021

ug: add checkpoints tutorial iterative/dvc.org#2373

Merged

iesahin self-assigned this May 7, 2021

iesahin removed their assignment Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mnist tutorial failing #15

mnist tutorial failing #15

sp7412 commented Jul 19, 2020

shcheklein commented Jul 19, 2020

jorgeorpinel commented Mar 7, 2021

iesahin commented Mar 14, 2021 •

edited

Loading

shcheklein commented Mar 14, 2021

iesahin commented Mar 15, 2021

shcheklein commented Mar 15, 2021

iesahin commented Mar 15, 2021

dberenbaum commented Mar 16, 2021

iesahin commented Mar 16, 2021 via email •

edited

Loading

iesahin commented Mar 16, 2021

iesahin commented Mar 18, 2021 •

edited

Loading

iesahin commented Mar 18, 2021

mnist tutorial failing #15

mnist tutorial failing #15

Comments

sp7412 commented Jul 19, 2020

shcheklein commented Jul 19, 2020

jorgeorpinel commented Mar 7, 2021

iesahin commented Mar 14, 2021 • edited Loading

shcheklein commented Mar 14, 2021

iesahin commented Mar 15, 2021

shcheklein commented Mar 15, 2021

iesahin commented Mar 15, 2021

dberenbaum commented Mar 16, 2021

iesahin commented Mar 16, 2021 via email • edited Loading

iesahin commented Mar 16, 2021

iesahin commented Mar 18, 2021 • edited Loading

iesahin commented Mar 18, 2021

iesahin commented Mar 14, 2021 •

edited

Loading

iesahin commented Mar 16, 2021 via email •

edited

Loading

iesahin commented Mar 18, 2021 •

edited

Loading