GitHub - krandiash/clever-meerkat: Flexible data structures for complex machine learning datasets.

Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

⚡️ Quickstart

pip install meerkat-ml

Optional: some parts of Meerkat rely on optional dependencies. If you know which optional dependencies you'd like to install, you can do so using something like pip install meerkat-ml[dev,text] instead. See setup.py for a full list of optional dependencies.

Installing from dev: pip install "meerkat-ml[text] @ git+https://github.com/robustness-gym/meerkat@dev"

Load a dataset into a DataPanel and get going!

import meerkat as mk
from meerkat.contrib.imagenette import download_imagenette

download_imagenette(".")
dp = mk.DataPanel.from_csv("imagenette2-160/imagenette.csv")
dp["img"] = mk.ImageColumn.from_filepaths(dp["img_path"])

dp[["label", "split", "img"]].lz[:3]

To learn more, continue following along in our tutorial:

💡 What is Meerkat?

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood.

Meerkat's core contribution is the DataPanel, a simple columnar data abstraction. The Meerkat DataPanel can house columns of arbitrary type – from integers and strings to complex, high-dimensional objects like videos, images, medical volumes and graphs.

DataPanel loads high-dimensional data lazily. A full high-dimensional dataset won't typically fit in memory. Behind the scenes, DataPanel handles this by only materializing these objects when they are needed.

import meerkat as mk

# Images are NOT read from disk at DataPanel creation...
dp = mk.DataPanel({
    'text': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
    'image': mk.ImageColumn.from_filepaths(['fox.png', 'jump.png', 'dog.png']),
    'label': [0, 1, 0]
}) 

# ...only at this point is "fox.png" read from disk
dp["image"][0]

DataPanel supports advanced indexing. Using indexing patterns similar to those of Pandas and NumPy, we can access a subset of a DataPanel's rows and columns.

import meerkat as mk
dp = ... # create DataPanel

# Pull a column out of the DataPanel
new_col: mk.ImageColumn = dp["image"]

# Create a new DataPanel from a subset of the columns in an existing one
new_dp: mk.DataPanel = dp[["image", "label"]] 

# Create a new DataPanel from a subset of the rows in an existing one
new_dp: mk.DataPanel = dp[10:20] 
new_dp: mk.DataPanel = dp[np.array([0,2,4,8])]

# Pull a column out of the DataPanel and get a subset of its rows 
new_col: mk.ImageColumn = dp["image"][10:20]

DataPanel supports map, update and filter operations. When training and evaluating our models, we often perform operations on each example in our dataset (e.g. compute a model's prediction on each example, tokenize each sentence, compute a model's embedding for each example) and store them . The DataPanel makes it easy to perform these operations and produce new columns (via DataPanel.map), store the columns alongside the original data (via DataPanel.update), and extract an important subset of the datset (via DataPanel.filter). Under the hood, dataloading is multiprocessed so that costly I/O doesn't bottleneck our computation. Consider the example below where we use update a DataPanel with two new columns holding model predictions and probabilities.

# A simple evaluation loop using Meerkat 
dp: DataPanel = ... # get DataPanel
model: nn.Module = ... # get the model
model.to(0).eval() # prepare the model for evaluation

@torch.no_grad()
def predict(batch: dict):
    probs = torch.softmax(model(batch["input"].to(0)), dim=-1)
    return {"probs": probs.cpu(), "pred": probs.cpu().argmax(dim=-1)}

# updated_dp has two new `TensorColumn`s: 1 for probabilities and one
# for predictions
updated_dp: mk.DataPanel = dp.update(function=predict, batch_size=128, is_batched_fn=True)

✉️ About

Meerkat is being developed at Stanford's Hazy Research Lab. Please reach out to kgoel [at] cs [dot] stanford [dot] edu if you would like to use or contribute to Meerkat.

Name		Name	Last commit message	Last commit date
Latest commit History 319 Commits
.github		.github
docs		docs
examples		examples
meerkat		meerkat
tests		tests
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
make.bat		make.bat
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡️ Quickstart

💡 What is Meerkat?

✉️ About

About

Releases

Packages

Languages

License

krandiash/clever-meerkat

Folders and files

Latest commit

History

Repository files navigation

⚡️ Quickstart

💡 What is Meerkat?

✉️ About

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages