Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to push in-memory object directly to remote store? #5068

Closed
BikashShaw opened this issue Dec 9, 2020 · 4 comments
Closed

How to push in-memory object directly to remote store? #5068

BikashShaw opened this issue Dec 9, 2020 · 4 comments
Labels
awaiting response we are waiting for your reply, please respond! :) question I have a question?

Comments

@BikashShaw
Copy link

Please consider this as a naive question rather than any bug or improvement report.

Is there any equivalent "write" function like dvc,api.read()? We want to push the in-memory object directly to the remote store via python API without saving it to local storage and run the bash command.

Thanks!

@shcheklein shcheklein transferred this issue from iterative/dvc.org Dec 9, 2020
@jorgeorpinel jorgeorpinel added the question I have a question? label Dec 10, 2020
@efiop
Copy link
Contributor

efiop commented Dec 11, 2020

Hi @BikashShaw !

There is no native interface for that right now in dvc itself :( Seems like that might be an API extension on a straight-to-remote feature for CLI #4520 , but that would only solve pushing and generating local *.dvc file. But you will still need to push git changes somehow. Could you elaborate on your scenario, please?

@jayant91089
Copy link

Hi @efiop thanks for getting back on this. I am working with @BikashShaw on this, so let me further elaborate on the use case.
The snippet below is of the code we use for tracking our ML models using dvc/s3 - it writes a model to disk, dvc adds it, git adds the *.dvc and .gitignore, pushes the model to s3, and removes the model from disk. It allows us to do all this without ever leaving the jupyter notebook.

While we are making do with this (somewhat hacky) workflow for models, we don't really want to do the same with large data files as it uses the disk as intermediary. So we are looking to see if there is something more elegant like dvc.api.read() -- say a dvc.api.write() which would push a data/model object in RAM to s3 leaving behind dvc metadata in git and not using disk as an intermediary.

import git
import os
import subprocess

def get_git_root(path):
    git_repo = git.Repo(path, search_parent_directories=True)
    git_root = git_repo.git.rev_parse("--show-toplevel")
    return git_root

def commit_model_with_msg(model_info,
                       path = "path/to/somewhere/in/models/dir/of/project/repo",
                       name = "model_expt_x",
                       commit_msg = "Adding model to vc"
                      ):
    """Start tracking a model using dvc/s3 and git 
    
    Parameters
    ----------
    model_info (dict)
        A `dict` containing keys `'pipeline'`,`'features'`, and `'explainer'`.
    path (str)
        A path under `models/` dir of the project where the model's `.dvc` metadata will live
    name (str)
        A unique identifier for the model
    
    Returns
    -------
    None
    
    """
    # create directory if needed
    directory = os.path.join(get_git_root(os.getcwd()),path)
    if not os.path.exists(directory):
        os.makedirs(directory)
    fname = os.path.join(get_git_root(os.getcwd()),path,name)+".joblib"
    
    # dump model to disk
    joblib.dump(model_info, fname)
    
    # dvc add
    process = subprocess.Popen(["dvc", "add", fname],
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    print("dvc add... \n", stdout.decode(),stderr.decode())
    
    # git add 
    try:
        process = subprocess.Popen(["git", "add"]+stdout.decode().split("git add")[1].split(),
                         stdout=subprocess.PIPE, 
                         stderr=subprocess.PIPE)
        stdout, stderr = process.communicate()
        print("git add... \n", stdout.decode(),stderr.decode())
        
    except IndexError:
        print("Model name already under vc...no changes to the repo")
        return 
    
    # git commmit
    process = subprocess.Popen(["git", "commit", "-m", commit_msg],
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    print("git commit... \n",stdout.decode(),stderr.decode())
    
    # dvc push 
    process = subprocess.Popen(["dvc", "push"],
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    print("dvc push... \n",stdout.decode(),stderr.decode())
    
    process = subprocess.Popen(["rm", fname],
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    print("rm... \n",stdout.decode(),stderr.decode()) 

Thanks for helping us!

@efiop
Copy link
Contributor

efiop commented Dec 15, 2020

@jayant91089 Thanks for the example code! Makes sense! Ok, I definitely see this as #4520 but for API. All the needed internals for your feature request will be implemented in that ticket, most likely. Does the current workflow work fine for you for now? If it does, then I would advise to stick with it until #4520 is implemented.

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Dec 15, 2020
@BikashShaw BikashShaw reopened this Dec 15, 2020
@efiop
Copy link
Contributor

efiop commented Jan 4, 2021

Closing as stale.

@efiop efiop closed this as completed Jan 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :) question I have a question?
Projects
None yet
Development

No branches or pull requests

4 participants