Skip to content

Latest commit

 

History

History
880 lines (565 loc) · 32.6 KB

CHANGELOG.md

File metadata and controls

880 lines (565 loc) · 32.6 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.14.1] - 2024-11-16

This release adds Serde support for rten tensors and several optimizations which allow the Whisper example to run significantly faster.

rten-tensor

  • Support (de-)serializing tensors using Serde (#402)

rten

Examples

  • Output transcription speed as a multiple of real-time in Whisper example (#403)

  • Support longer audio inputs and normalize inputs in wav2vec2 speech recognition example (#400)

Bug fixes

  • Fixed an issue where metadata associated with output value nodes was lost after a graph fusion. In the Whisper example this prevented several Transpose-MatMul fusions from being used (#401).

Performance improvements

  • Added fast path for ArgMin / ArgMax for case when axis has unit stride (#411)

  • Optimized GatherND by avoiding redundant zeroing of output and adding fast path for contiguous inputs (#410)

  • Optimized copying of tensors with 5+ dimensions (#409)

  • Operators in subgraphs which capture their first input from a parent graph can now run in-place (#407)

  • After the initial execution plan is created, it is now re-ordered to enable more operations to run in-place (#405)

rten-generate

  • The strategy for reserving capacity for KV-cache growth has been modified to work with models that don't append to KV-cache inputs on the first run. This benefits Hugging Face "merged" transformer models with "past" and "no-past" branches (#408)

[0.14.0] - 2024-10-27

Breaking changes

  • The NodeId type used to identify model inputs and outputs is now an opaque u32-sized type instead of a usize (#381)

  • The tensor slicing APIs (TensorBase::slice etc.) now infer the rank of the output automatically, instead of requiring the caller to specify. See #367.

rten

New features

  • Added Whisper speech recognition example (#397)

  • Added background removal example using RMBG (#344)

  • Support i8 and u8 tensors in operator inputs, outputs and model weights (#345).

  • Support 8-bit int tensors in Cast, Gather, GatherElements, GatherND, ScatterElements, ScatterND, Expand, Flatten, Reshape, Squeeze, Transpose, Pad, Unsqueeze ops (#387)

  • Implement QuantizeLinear, DequantizeLinear and DynamicQuantizeLinear ops (#346)

  • Added reference implementation of MatMulInteger. Quantized models using this operator will now run, but very slowly. Optimized execution for quantized models will come in future releases (#356).

  • Support f16 models in model converter by widening to f32 (#372). This is an interim measure until f16 tensors are properly supported in RTen.

  • Added YOLOv11 support to YOLO example (#374)

Bug fixes

  • Fixed AVX-512 build (#376)

  • Fixed graph optimizations not being applied correctly when a fused operation feeds directly into a subsequent fused operation (#369)

  • Fixed errors when running WebAssembly builds compiled without SIMD support (#348)

Performance improvements

  • Made NodeId a u32-sized type with a niche, reducing the size of various internal data structures (#381)

  • Optimized Cast op when source and dest types are the same (#388)

  • Avoid unnecessary copying in Squeeze and Unsqueeze ops (#339, #340)

rten-cli

  • Added --no-optimize flag to enable testing impact of graph optimizations (#368)

rten-generate

  • Added more context to token generation errors (#396)

  • Support cache_position input in models exported from Optimum (#395)

  • Added API for modifying model outputs ("logits") before sampling (#393, #394)

  • Support the new merges format in tokenizer.json files exported by current versions of HuggingFace Transformers (#392)

rten-imageproc

  • Added normalize_image utility (#343)

rten-tensor

  • Improved debug formatting of tensors (#377)

  • Changed TensorBase::slice to infer the rank of the output based on the rank of the input and the number of index entries in the slice arguments (#367).

[0.13.1] - 2024-08-30

rten

New features

  • Added speech detection example using Silero VAD (#338)

  • Support int tensors in ArgMin and ArgMax ops (#329)

  • Support "reflect" padding mode (#326)

Bug fixes

  • Fixed panic with certain combinations of input, kernel size and padding in depthwise convolution (#336)

  • Fixed attempted out-of-bounds slice in depthwise convolution when input tensor has a row stride that exceeds the row length (#335)

  • Fixed conversion of auto_pad attribute for Conv operator (#333)

  • Round timings to microseconds in verbose log (#331)

  • Fixed panic when slicing empty tensors (#325)

  • Fixed 1D convolution failing with non-contiguous inputs (#324)

  • Fixed conversion of shape information for scalar tensors (#323)

  • Fixed panic in softmax if the size of the normalized axis is zero (#322)

rten-cli

  • Added --mmap flag to load model using memory mapping instead of reading whole file into a buffer (#330)

[0.13.0] - 2024-08-24

This release adds the infrastructure to support subgraphs, which are used in control flow operators like If, plus an implementation of the If operator and a TrOCR example which uses it.

rten

rten-cli

  • Added --quiet flag (#313)

  • Inputs named use_cache_branch now get a default value of 0 (ddf4109)

rten-generate

  • Support models with cross-attention KV caches that are computed on the first run of the decoder (#318). This is used by Hugging Face models for encoder-decoder systems.

  • Support models without a KV cache (#305)

rten-tensor

  • Added Tensor::remove_axis (b823d46)
  • Added Tensor::from_storage_and_layout (54d2941)

rten-text

  • The BPE tokenizer no longer complains if a tokenizer contains tokens in the vocabulary which are never generated by merges and are not added special tokens (18e9b2a)

[0.12.0] - 2024-07-30

rten

Breaking changes

  • The rten-convert tool now generates models in the V2 format by default (#272). These models can only be loaded by RTen version 0.11.0 or later. The V1 format can be generated by specifying the --v1 flag. The rten crate can load both V1 and V2 format models.

    See the .rten file format documentation for more details.

  • The reduce_{max, min, sum} tensor methods have moved from the FloatOperators trait to the Operators trait (#274).

Examples and documentation

  • Added Segment Anything example (#295). This supports the original SAM models plus several derivatives with lighter-weight image encoders.

  • Added chatbot example using Qwen2 (#282). This also works with SmolLM.

  • Model::load_mmap docs now have a better explanation of the memory and performance impact (ce0b717)

New features

  • Added partial support for Einsum operator (#295).

Performance improvements

  • Avoid allocations in most cases when broadcasting tensor shapes (c4b5f26).

  • Strides of size-1 dimensions are ignored when determining whether a tensor is contiguous (#292). This allows more operations to use fast paths for contiguous tensors.

  • Optimized LayerNormalization and ReduceMean (#291)

  • Added fast-path for Resize operator when input scale is 1 (#290)

  • Return input buffer to pool in Cast operator if input needs to be copied (#289).

  • Implemented LayerNormalization fusion (#280)

  • Implemented GELU fusion (#277)

rten-cli

  • Inputs with names matching the pattern *_ids now use zero as the auto-generated input value (78cd621)

rten-generate

  • TopKSampler now supports specifying a temperature (65b837b)

  • Added Generator::append_prompt to append to prompt after initial generation. This is useful for chat-like applications (5ef3cb2)

  • Fixed an issue where attention_mask input had the wrong size (cae6134)

rten-tensor

Breaking changes

  • The tensor and ndtensor macros have been deprecated in favor of Tensor::from and NdTensor::from (#286).

Other changes

  • Tensor::from now supports creating tensors from scalar values (d2ca876)

  • Tensor::lanes iterator performance was improved by making them exact-sized and fused (9e31556)

rten-text

  • Token IDs are now represented as u32 rather than usize, for consistency with rten-generate (#288).

  • The vocab mapping in tokenizer.json files is now used to determine token IDs when decoding (#287).

[0.11.1] - 2024-07-17

rten

  • Fixed a crash in WebAssembly due to unsupported use of Instant::now (#283).

[0.11.0] - 2024-07-05

rten

Breaking changes

  • The inputs argument to Model::run now accepts a Vec<(NodeId, InputOrOutput)> instead of &[(NodeId, Input)], where InputOrOutput is an enum that is either an owned Tensor or a TensorView. This enables passing ownership of an input to Model::run, which is in turn enables efficient in-place updates to cache-like inputs.

    The InputOrOutput type implements From for tensors and tensor views, so code such as:

    model.run(&[(input_id, tensor_view.into())], output_ids, None)

    Becomes:

    model.run(vec![(input_id, tensor_view.into())], output_ids, None)

New features

  • Add a new version of the .rten file format which supports models over 2GB in size. The rten-convert tool still generates V1 models by default but will generate the V2 format if the --v2 flag is provided (#260).

  • Support Gelu operator (#248)

Bug fixes

  • Prevent Model::partial_run from propagating values through randomized operators (#240).

  • Improved accuracy of timing metrics and eliminated unaccounted for ("[Other]") time #254.

Performance improvements

This release adds a new graph optimization step as part of loading models. This performs fusions and other optimizations to speed up inference. These optimizations are enabled by default, but can be disabled via options in ModelOptions.

  • Improved parallelism in the Softmax operator (#258)

  • Made Tensor::inner_iter faster (#259)

  • Made Gather, Concat and Unsqueeze operators faster for small inputs. These operations are common in subgraphs that operator on tensor shapes. #255, #256, #257.

  • Optimized vector-matrix multiplication (#250, #253). This benefits transformer decoder inference when the batch size is 1.

  • Fuse Mul(X, Sigmoid(X)) subgraphs into a Silu operation. This speeds up YOLOv8 by 8%. See #246.

  • Further reduce small allocations during graph execution (#243, #245).

  • Fuse MatMul(Transpose(X), Y) subgraphs to avoid materializing the transposed matrix (#242).

  • Perform constant propagation when loading models (#241).

  • Enabled Concat operator to run in-place if the caller has specifically reserved space in the first input's buffer (#239).

  • Cache the last-used execution plan. This avoids recomputing the sequence of execution steps when a model is run in a loop (#234).

  • Improved performance of unary operators for non-contiguous inputs (#223)

  • Optimized Where operator for non-contiguous inputs (#213)

  • Optimized variadic operators (#212)

  • Optimized Pow operator (#219)

rten-examples

  • Added GPT-2 text generation example (#228)
  • Added DistilViT image captioning example (#230)

rten-generate

This is a new crate which provides a convenient Iterator-based interface for running auto-regressive decoder models. See the gpt2 and distilvit examples in the rten-examples crate for code samples.

rten-tensor

  • Support more primitive element types in NdTensor::from (#226).

rten-text

  • Added Byte Pair Encoding (BPE) tokenizer (#227)

[0.10.0] - 2024-05-25

rten

Breaking changes

  • RTen now creates its own Rayon thread pool where the number of threads is configured to match the physical rather than logical core count, rather than using the global Rayon thread pool. This improves performance on systems with Simultaneous Multi-Threading (aka. SMT, Hyper-Threading) (most x86_64 CPUs), but can lead to contention if the calling application has its own multi-threaded parallelism. Applications may need to adjust their own use of threading to avoid this. RTen provides functions for applications to run their own tasks within this thread pool.

    See #183.

Bug fixes

  • Fixed conversion of Transpose operators without a perm attribute (#201)

  • The RunError type returned by Model::run is now exported (#206)

Performance improvements

  • Made Resize operator parallel over rows. This benefits resize operations on images with large spatial dimensions and few channels (#208).

  • Improved performance of Conv operator on Intel CPUs with a mitigation for the Gather Data Sampling / "Downfall" vulnerability applied. This affects most 6th-11th generation Intel CPUs (#204).

  • Optimized Concat operator when input is not contiguous (eg. following a Slice op) (#204)

  • Improved performance of GRU operator by combining operations on separate gates (#188)

  • Improved performance of binary operators on non-contiguous tensors (#190)

rten-cli

  • Added --n_iters flag to control how many times the model is run (#202)

  • Optimize model by performing constant propagation before running the model (#202)

  • Made it easier to specify sizes for dynamic inputs. The new syntax is --size dim_name=size. Additionally the size for dynamic dimensions defaults to 1. See #182.

  • Added --version flag (#181)

rten-imageproc

  • Added serde_traits feature which implements serde Serialize and Deserialize traits for geometry types (Thanks @luketpeterson, #198)

rten-tensor

  • Added Tensor::split_at and Tensor::split_at_mut ( #205, #207)

  • Tensor::{axis_chunks, axis_chunks_mut} iterators now preserve the layout in their output type (#207).

rten-vecmath, rten-simd

  • The internal crate providing portable SIMD and vectorized math functions was split into two. rten-simd now contains the portable SIMD code. rten-vecmath contains the vectorized math functions.

[0.9.0] - 2024-05-16

Breaking Changes

This release contains breaking changes to the model loading APIs and code using the TensorBase type directly (as opposed to aliases like Tensor). See the notes for the rten and rten-tensor crates respectively.

rten

Breaking changes

  • The Model::load API now takes a Vec<u8> rather than &[u8] as an argument. This enables it to avoid copying data internally. For the most common use case of loading a model from disk, use the new Model::load_file API.

  • The Model::load_with_ops API has been replaced by ModelOptions::with_ops.

New features

  • Added Model::load_file API for more convenient loading of a model from a file (#174)

  • Added Model::load_mmap API for zero-copy loading of models by using memory maps. This can be faster than Model::load for very large models (#174).

  • Added Piper text-to-speech example (#161)

  • Support 1D inputs and padding in ConvTranspose (#156)

  • Support GatherND operator (#155)

  • Support Softplus operator (#146)

  • Support converting ONNX models containing unnamed operator nodes (#143)

  • Support RandomNormal, RandomNormalLike, RandomUniformLike operators (#144)

Bug fixes

  • Fixed incorrect calculation of update slice size in ScatterND operator (#157)

  • Fixed incorrect conversion of axis attribute for ArgMin and ArgMax operators (#142)

  • Fixed uninitialized read in Gemm operator when alpha != 1 and beta == 0 (#150)

  • Fixed NonMaxSuppression operator missing overlap of boxes due to confusion of X/Y coordinates (#177)

Optimizations

  • Optimize Gather, NonZero operator by allocating from memory pool (#168)

  • Optimize Slice operator when slice ranges contain negative steps (#167)

  • Optimize Pad operator by making copying of non-contiguous views more efficient (#166)

  • Optimize Conv operator by avoiding redundant zeroing of packing buffers, optimizing im2col setup (#165)

  • Optimize ConvTranspose by fusing bias addition into col2im transform (#159)

  • Parallelize AveragePool operator (#138)

  • Improved model loading performance by avoiding copying weights in Model::load (#174)

rten-imageproc

  • The mask matrix argument to find_contours now uses bool instead of i32 for elements. This improves performance / reduces memory usage for large masks.

rten-tensor

Breaking changes

This release changes the signature of the TensorBase struct from TensorBase<T, S: AsRef<[T]>, L: MutLayout> to TensorBase<S: Storage, L: MutLayout>. The element type is now available via S::Elem. The type of S used by views has changed from slices to new custom types. The TensorBase::from_data method still accepts both Vec<T> and slices as the data argument, and will convert to the appropriate storage struct.

Code using the type aliases (Tensor, TensorView, TensorViewMut etc.) does not need to change.

New features

  • Added TensorBase::{as_cow, into_cow} (named after std::borrow::Cow) to convert tensor storage to a type which is Cow-like. This is useful for writing code which works with either borrowed or owned tensors (#153).

Bug fixes

  • Added missing checks for equality between old/new layout lengths in reshape operations (#170, #171)

  • Improved internal checks that storage slicing does not lead to out-of-bounds accesses (#163)

  • Refactored tensor storage types to fix a violation of Rust's unique ownership rules for mutable slices. This enables tests for rten-tensor and code using this crate to be run under Miri (#148).

rten-vecmath

  • Revised SIMD traits to make working with masks more ergonomic and efficient (#152). Integer and floating point types with the same number of lanes will now use the same mask type.

[0.8.0] - 2024-04-29

rten-tensor

  • Added Alloc trait which provides a simple allocator interface, and *_in-suffixed variants of several TensorBase methods, which allows specifying an allocator for the returned tensor's data buffer (#123).

rten-vecmath

  • Fixed crashes in several functions when running on pre-AVX2 x64 CPUs (see rten changes)

rten

New features

  • Support Elu operator (#132)

  • Support Reduce* operators that take axes as a dynamic input rather than static attribute (#132)

Bug fixes

  • Fixed crash in several operators when running on x64 CPUs that do not support AVX-2 instructions (#131, #134)

Performance improvements

  • Added a buffer pool that enables reuse of operator output and temporary buffers, avoiding the overhead of allocating and freeing large buffers using the system allocator (#108).

    Statistics about buffer pool usage are printed as part of RTEN_TIMING output.

  • Fixed a MatMul performance regression introduced in v0.7.0 due to virtual calls to get kernel tile size (#101)

  • Optimize convolutions by using SIMD operations for im2col transform (#104)

  • Parallelize depthwise convolution (#102)

  • Avoid redundant of zeroing buffers in Conv, OneHot, and various unary operations (#97, #99, #101, #106)

  • Optimize Unsqueeze by running in-place where possible (#96)

  • Optimize vector-matrix products where matrix is transposed (#94)

  • Reduced graph execution overhead by using faster hashing (#92)

  • Optimize ScatterND (#91)

  • Support AVX-512 acceleration for Exp, Sigmoid, Tanh, Softmax and Erf operators (#131). This requires nightly Rust and the avx512 feature enabled.

[0.7.0] - 2024-04-12

rten-tensor

  • Add Tensor::merge_axes method to simplify layouts (#78)

  • Add Tensor::{uninit, assume_init} methods for working with uninitialized buffers (#82)

rten

  • Reduced Graph::run overhead by reducing allocations (#89)

  • Added Model::partial_run API to speed up autoregressive / recurrent models by precomputing parts of the graph that depend only on inputs that are unchanging across loop iterations (#86)

  • Optimize MatMul and binary operators by avoiding unnecessary zeroing of output buffers (#82, #88)

  • Fixed incorrect output from Gemm operator when the bias is zero and the "C" input contained infinities / NaNs (#81)

  • Optimize matrix packing operations on Intel CPUs using AVX-2 instructions (#80)

  • Optimize Transpose operations where input dimensions are powers of 2 by using blocking and tiling (#78)

  • Exclude test files and tools from published crate (#77)

  • Optimize RNN operators for the case where the input sequence is short, by avoiding prepacking of weights in this case (#74)

[0.6.0] - 2024-03-31

rten

  • Updated AVX-512 support to work with latest Rust nightly releases (#58)

  • Improved performance of vector-matrix product operations (#61)

  • Slightly improved WASM matrix multiplication performance with a dedicated kernel (#64)

  • Fixed conversion of RNN operators (LSTM, GRU) that explicitly declare the direction as forward (#67)

  • Support tensors with 3 or 5+ dimensions in BatchNormalization operator (#68)

  • Support RandomUniform operator (#69)

  • Improve matrix prepacking performance by eliminating unnecessary zero-initialization of buffers (#70)

[0.5.0] - 2024-02-29

rten

  • Changed OperatorType enum in .rten schema from byte to ubyte, to allow for more operator types in future (#56)

  • Made Model instances Send, enabling use with PyO3 (#55)

  • The ONNX => rten model conversion tool is now an installable Python package called rten-convert (#53)

  • Implemented ReduceSumSquare operator (36bbf89f)

[0.4.0] - 2024-02-08

rten

  • Support count_include_pad attr in AveragePool operator (09ecb729)

  • Support license/version/provenance metadata in RTen models (#48)

  • Fix error when a negative index was used with Gather operator (573ded4c)

  • Improve performance of MatMul operator when row count of LHS is small and batch size is large (#51)

rten-imageproc

  • Optimized find_contours for large images (c471a6c, 7a14f43)

rten-tensor

  • Optimize TensorBase::map for contiguous tensors (5562fd23)
  • Add TensorBase::{from_fn, from_simple_fn} (5e654ea0)
  • Add TensorBase::try_from_data (18817907)
  • Support get_unchecked on owned/mutable tensors (06b02eaf)

[0.3.1] - 2024-01-23

  • Updated rten-vecmath dependency to latest version

[0.3.0] - 2024-01-23

Breaking changes

The static and dynamic tensor types (NdTensorBase, TensorBase) have been unified into a single implementation. Most code uses these via type aliases (NdTensor, Tensor etc.), which remain the same. However there have been some API changes as a result:

  • The View and NdView traits were combined into AsView. The recommended way to import this trait is via the prelude (use rten_tensor::prelude::*)

  • Some inherent methods of TensorBase moved to the AsView trait. You may need to add additional imports of this trait or the prelude.

  • NdTensor::from_data now has the same API signature as Tensor::from_data. This means the order of arguments is reversed compared to before. It is now from_data(shape, data). Creating tensors with custom strides is now done via from_data_with_strides or from_slice_with_strides.

  • Tensor methods for broadcasting and reshaping tensors now determine the rank of the result from the type of the shape argument. If passed an array, they return a static-rank view. If passed a slice, they return a dynamic-rank view.

  • Methods that insert, remove or swap axes now have an _axis suffix (eg. move_axis). Previously some of these methods had a _dim suffix.

  • The slice method now always returns a static rank view. Usage is tensor.slice::<M, _>(range) where M is the rank of the result. To create a view with a dynamic dimension count, use tensor.slice_dyn(range) instead.

New features

  • Implemented LayerNormalization operator (#44)
  • Added "Depth Anything" monocular depth estimation example (#44)
  • Added support for align_corners value for coordinate_transformation_mode attr in Resize operator (#44).

Performance improvements

  • Optimized index iteration for tensors (d3fd3c9)
  • Optimized col2im transform used by ConvTranspose (fbc541b)
  • Optimized depthwise convolution (20e83e8)
  • Improved performance on Arm via a better optimized GEMM kernel (#32) and vectorized kernels for other functions (#31).

[0.2.0] - 2024-01-03

  • Improved inference performance on ARM #30

[0.1.1] - 2024-01-01

  • Fix softmax operator on non-x64 / wasm32 platforms (59f4815)

[0.1.0] - 2023-12-31

Initial release.