At ZML, we are creating exciting AI products on top of our high-performance AI inference stack. Our stack is built for production, using the amazing Zig language, MLIR, and the power of Bazel.
We're very happy to share our inference stack with the World and hope it allows you, too, to build cool and exciting AI projects.
To give you a glimpse of what you can do with ZML, here is an early demo:
It shows a prototype running a LLaMA2 model sharded on 1 NVIDIA RTX 4090, 1 AMD 6800XT, and 1 Google Cloud TPU v2. All accelerators were hosted in different locations, with activations being passed over a VPN.
All processes used the same model code, cross-compiled on a Mac, and copied onto the servers.
For more inspiration, see also the examples below or check out the examples folder.
We use bazel
to build ZML and its dependencies. The only prerequisite is
bazel
, which we recommend to download through bazelisk
, a version manager
for bazel
.
Please note: If you do not wish to install bazel
system-wide, we provide
examples/bazel.sh which downloads it to your home folder
and runs it.
Install Bazel (recommended):
curl -L -o /usr/local/bin/bazel 'https://github.com/bazelbuild/bazelisk/releases/download/v1.20.0/bazelisk-linux-amd64'
chmod +x /usr/local/bin/bazel
We have implemented a variety of example models in ZML. See our reference implementations in the examples folder.
The classic handwritten digits
recognition task. The model is tasked to recognize a handwritten digit, which
has been converted to a 28x28 pixel monochrome image. Bazel
will download a
pre-trained model, and the test dataset. The program will load the model,
compile it, and classify a randomly picked example from the test dataset.
On the command line:
cd examples
bazel run -c opt //mnist
# or
./bazel.sh run -c opt //mnist
Our LLM examples start with a small model trained specifically on children's history books. This model has been trained by Andrej Karpathy; you can read more about it on his GitHub.
cd examples
bazel run -c opt //llama:TinyLlama-Stories-15M
bazel run -c opt //llama:TinyLlama-Stories-15M -- --prompt="Once upon a time, there was a cute little dragon"
cd examples
bazel run -c opt //llama:OpenLLaMA-3B
bazel run -c opt //llama:OpenLLaMA-3B -- --prompt="Once upon a time,"
This model has restrictions, see here. It requires approval from Meta on Huggingface, which can take a few hours to get granted.
While waiting, you can already generate an access token to log into HuggingFace
from bazel
; see here.
Once you've been granted access, you're ready to download a gated model like
Meta-Llama-3.1-8B-Instruct
!
# requires token in $HOME/.cache/huggingface/token, as created by the
# `huggingface-cli login` command, or the `HUGGINGFACE_TOKEN` environment variable.
cd examples
bazel run -c opt //llama:Llama-3.1-8B-Instruct
bazel run -c opt //llama:Llama-3.1-8B-Instruct -- --prompt="Once upon a time,"
You can also try Llama-3.1-70B-Instruct if you have enough memory.
Like the 8B model above, this model also requires approval. See here for access requirements.
cd examples
bazel run -c opt //llama:Llama-3.2-1B-Instruct
bazel run -c opt //llama:Llama-3.2-1B-Instruct -- --prompt="Once upon a time,"
For a larger 3.2 model, you can also try Llama-3.2-3B-Instruct.
You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling / running a model:
- NVIDIA CUDA:
--@zml//runtimes:cuda=true
- AMD RoCM:
--@zml//runtimes:rocm=true
- Google TPU:
--@zml//runtimes:tpu=true
- AWS Trainium/Inferentia 2:
--@zml//runtimes:neuron=true
- AVOID CPU:
--@zml//runtimes:cpu=false
The latter, avoiding compilation for CPU, cuts down compilation time.
So, to run the OpenLLama model from above on your host sporting an NVIDIA GPU, run the following:
cd examples
bazel run -c opt //llama:OpenLLaMA-3B \
--@zml//runtimes:cuda=true \
-- --prompt="Once upon a time,"
bazel test //zml:test
const std = @import("std");
const zml = @import("zml");
/// Model definition
const Mnist = struct {
fc1: Layer,
fc2: Layer,
const Layer = struct {
weight: zml.Tensor,
bias: zml.Tensor,
pub fn forward(self: Layer, input: zml.Tensor) zml.Tensor {
return self.weight.matmul(input).add(self.bias).relu();
}
};
/// just two linear layers + relu activation
pub fn forward(self: Mnist, input: zml.Tensor) zml.Tensor {
std.log.info("Compiling for target: {s}", .{@tagName(input.getContext().target())});
var x = input.flattenAll().convert(.f32);
const layers: []const Layer = &.{ self.fc1, self.fc2 };
for (layers) |layer| {
x = zml.call(layer, .forward, .{x});
}
return x.argMax(0, .u8).indices;
}
};
const Sdpa = struct {
pub fn forward(_: Sdpa, ctx: *zml.Context, q_: zml.Tensor, k_: zml.Tensor, v_: zml.Tensor) zml.Tensor {
const q = q_.withTags(.{ .b, .h, .q, .hd });
const k = k_.withTags(.{ .b, .h, .k, .hd });
const v = v_.withTags(.{ .b, .h, .k, .hd });
const attn_mask = zml.nn.causalAttnMask(ctx, .{ .q = q.dim(.q), .k = k.dim(.k) }, q.dtype(), null);
return zml.nn.sdpa(ctx, q, k, v, .{ .attn_mask = attn_mask });
}
};
You might want to check out more examples, read through the documentation directly on GitHub, or, for the full rendering experience, browse the online documentation with included API reference.
See here.
ZML is licensed under the Apache 2.0 license.