MNIST dataset and Docker containers for Getting Started / Use Case documents and Katacoda scenarios #2318

iesahin · 2021-03-19T10:33:52Z

In our discussions with @shcheklein, he emphasized the importance of a stable and standard dataset for the whole documentation.

The example project uses a subset of Stack Overflow question tagging dataset. The data is still updated and difficult to present as a downloadable asset. The example project is based on Random Forests for classification. This dataset is trimmed to the first 12000 records in Katacoda scenarios due to RAM limitations and that amount of data is not adequate for a meaningful presentation for pipeline parameters. For example, increasing the feature size, n_grams, or predictors may or may not change the accuracy from 0.41-0.46 in Katacoda environment.

Also, reproducing the whole (non-trimmed) version requires at least 8GB of RAM. Although this is a modest requirement for Deep Learning workflows, aiming for 4GB seems rather sensible, considering the example may be run in a virtual environment for quick assessment.

We also have a use case based on Chollet's cats and dogs tutorial as a use case. It uses an older version of Keras. Although works on Katacoda, the single python train.py takes around 30 minutes. This is probably due to the feature generation part and having separate files for each image. This can probably be engineered to work faster.

For experimentation features DVC 2.0, @dberenbaum has created several showcases in https://github.com/iterative/dvc-checkpoints-mnist . These use MNIST with PyTorch. I tested them on Katacoda without success and this is most probably due to PyTorch's memory requirements.

Yesterday I tested MNIST example of Tensorflow in Katacoda and it runs quickly and has a 0.97 accuracy. It's not a very advanced type of model, it has a single hidden Dense/128 layer, but I think it can be modified by adding two more CNN layers and a few parameters to modify these to improve performance.

What I propose is something like this:

A standard dataset based on MNIST 3 to replace data.xml files. This can be a copy of the TF MNIST dataset and can have multiple versions to simulate the changing data. A corresponding update in models, training, evaluation, parameters, etc. is necessary as well. These models may be modest and open to improvement with a sensible initial performance as our goal is to show how DVC can be used for this kind of problem.
For each of the GS/UC documents, we can create a Docker container. These can be run in the user's machine with a simple docker run -it dvc/get-started-versioning and has all the code, data, requirements, artifacts to run them identically with the document's version on the site.
These containers can be run in Katacoda as well. Currently, Katacoda environments each have custom startup scripts. This is a maintenance burden. (Most of them weren't even starting up until a few weeks ago.) These startup scripts may be replaced with docker run commands as well.
These containers can be used to replay commands in docs (with a tool like rundoc or rundoc) and check the changes in output or data.

This issue is mainly for discussing these bullet points. Thanks.

@shcheklein @dberenbaum @jorgeorpinel @dmpetrov

The text was updated successfully, but these errors were encountered:

dberenbaum · 2021-03-19T13:21:07Z

Nice synopsis, @iesahin! Before giving my substantive thoughts, what do you think about separating the dataset and the Docker proposals into separate issues?

iesahin · 2021-03-20T08:21:12Z

Before giving my substantive thoughts, what do you think about separating the dataset and the Docker proposals into separate issues?

I'll do, thanks @dberenbaum.

iesahin · 2021-04-05T11:24:54Z

I have created the repository that contains Dockerfiles for Katacoda scenarios here.

Currently it has a script to build and push all containers.

We can discuss the naming convention for the containers in a separate issue #2354

I also built a markdown code runner after testing several other tools with (non-standard) Katacoda .md files. Executable Katacoda code blocks have {{execute}} as suffix, and other tools fail to recognize these.

Currently I'm testing it with the katacoda and documentation .md files. I'll transfer it to iterative when a 0.1 version seems appropriate.

BTW, I'm looking for a better name for this tool, any ideas are welcome 😄

@shcheklein @dberenbaum @jorgeorpinel

iesahin · 2021-04-06T11:47:06Z

I have transferred Markdown Code Runner to iterative.

iesahin · 2021-04-24T10:11:35Z

I have merged the new scenario and closed this issue. You can replay and review the scenario in https://katacoda.com/dvc/courses/get-started/experiments

@dberenbaum @shcheklein @jorgeorpinel

shcheklein added A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions labels Apr 3, 2021

shcheklein assigned iesahin Apr 3, 2021

This was referenced Apr 5, 2021

build documentation containers #2355

Closed

Automated testing for Katacoda scenarios with Docker iterative/katacoda-scenarios#49

Closed

iesahin mentioned this issue Apr 9, 2021

RFC: Naming conventions for the documentation containers iterative/dvc-doc-containers#3

Closed

iesahin mentioned this issue Apr 24, 2021

GS: Update the Experiments scenario with the MNIST dataset and Tensorflow based dvc-get-started-mnist project iterative/katacoda-scenarios#63

Merged

iesahin closed this as completed Apr 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNIST dataset and Docker containers for Getting Started / Use Case documents and Katacoda scenarios #2318

MNIST dataset and Docker containers for Getting Started / Use Case documents and Katacoda scenarios #2318

iesahin commented Mar 19, 2021

dberenbaum commented Mar 19, 2021

iesahin commented Mar 20, 2021

iesahin commented Apr 5, 2021

iesahin commented Apr 6, 2021

iesahin commented Apr 24, 2021

MNIST dataset and Docker containers for Getting Started / Use Case documents and Katacoda scenarios #2318

MNIST dataset and Docker containers for Getting Started / Use Case documents and Katacoda scenarios #2318

Comments

iesahin commented Mar 19, 2021

dberenbaum commented Mar 19, 2021

iesahin commented Mar 20, 2021

iesahin commented Apr 5, 2021

iesahin commented Apr 6, 2021

iesahin commented Apr 24, 2021