How to push in-memory object directly to remote store? #5068

BikashShaw · 2020-12-09T16:56:42Z

Please consider this as a naive question rather than any bug or improvement report.

Is there any equivalent "write" function like dvc,api.read()? We want to push the in-memory object directly to the remote store via python API without saving it to local storage and run the bash command.

Thanks!

efiop · 2020-12-11T23:29:07Z

Hi @BikashShaw !

There is no native interface for that right now in dvc itself :( Seems like that might be an API extension on a straight-to-remote feature for CLI #4520 , but that would only solve pushing and generating local *.dvc file. But you will still need to push git changes somehow. Could you elaborate on your scenario, please?

jayant91089 · 2020-12-15T16:20:55Z

Hi @efiop thanks for getting back on this. I am working with @BikashShaw on this, so let me further elaborate on the use case.
The snippet below is of the code we use for tracking our ML models using dvc/s3 - it writes a model to disk, dvc adds it, git adds the *.dvc and .gitignore, pushes the model to s3, and removes the model from disk. It allows us to do all this without ever leaving the jupyter notebook.

While we are making do with this (somewhat hacky) workflow for models, we don't really want to do the same with large data files as it uses the disk as intermediary. So we are looking to see if there is something more elegant like dvc.api.read() -- say a dvc.api.write() which would push a data/model object in RAM to s3 leaving behind dvc metadata in git and not using disk as an intermediary.

import git
import os
import subprocess

def get_git_root(path):
    git_repo = git.Repo(path, search_parent_directories=True)
    git_root = git_repo.git.rev_parse("--show-toplevel")
    return git_root

def commit_model_with_msg(model_info,
                       path = "path/to/somewhere/in/models/dir/of/project/repo",
                       name = "model_expt_x",
                       commit_msg = "Adding model to vc"
                      ):
    """Start tracking a model using dvc/s3 and git 
    
    Parameters
    ----------
    model_info (dict)
        A `dict` containing keys `'pipeline'`,`'features'`, and `'explainer'`.
    path (str)
        A path under `models/` dir of the project where the model's `.dvc` metadata will live
    name (str)
        A unique identifier for the model
    
    Returns
    -------
    None
    
    """
    # create directory if needed
    directory = os.path.join(get_git_root(os.getcwd()),path)
    if not os.path.exists(directory):
        os.makedirs(directory)
    fname = os.path.join(get_git_root(os.getcwd()),path,name)+".joblib"
    
    # dump model to disk
    joblib.dump(model_info, fname)
    
    # dvc add
    process = subprocess.Popen(["dvc", "add", fname],
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    print("dvc add... \n", stdout.decode(),stderr.decode())
    
    # git add 
    try:
        process = subprocess.Popen(["git", "add"]+stdout.decode().split("git add")[1].split(),
                         stdout=subprocess.PIPE, 
                         stderr=subprocess.PIPE)
        stdout, stderr = process.communicate()
        print("git add... \n", stdout.decode(),stderr.decode())
        
    except IndexError:
        print("Model name already under vc...no changes to the repo")
        return 
    
    # git commmit
    process = subprocess.Popen(["git", "commit", "-m", commit_msg],
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    print("git commit... \n",stdout.decode(),stderr.decode())
    
    # dvc push 
    process = subprocess.Popen(["dvc", "push"],
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    print("dvc push... \n",stdout.decode(),stderr.decode())
    
    process = subprocess.Popen(["rm", fname],
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    print("rm... \n",stdout.decode(),stderr.decode())

Thanks for helping us!

efiop · 2020-12-15T21:12:10Z

@jayant91089 Thanks for the example code! Makes sense! Ok, I definitely see this as #4520 but for API. All the needed internals for your feature request will be implemented in that ticket, most likely. Does the current workflow work fine for you for now? If it does, then I would advise to stick with it until #4520 is implemented.

efiop · 2021-01-04T18:43:42Z

Closing as stale.

shcheklein transferred this issue from iterative/dvc.org Dec 9, 2020

jorgeorpinel added the question I have a question? label Dec 10, 2020

efiop added the awaiting response we are waiting for your reply, please respond! :) label Dec 15, 2020

BikashShaw closed this as completed Dec 15, 2020

BikashShaw reopened this Dec 15, 2020

efiop closed this as completed Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to push in-memory object directly to remote store? #5068

How to push in-memory object directly to remote store? #5068

BikashShaw commented Dec 9, 2020

efiop commented Dec 11, 2020

jayant91089 commented Dec 15, 2020

efiop commented Dec 15, 2020

efiop commented Jan 4, 2021

How to push in-memory object directly to remote store? #5068

How to push in-memory object directly to remote store? #5068

Comments

BikashShaw commented Dec 9, 2020

efiop commented Dec 11, 2020

jayant91089 commented Dec 15, 2020

efiop commented Dec 15, 2020

efiop commented Jan 4, 2021