-
-
Notifications
You must be signed in to change notification settings - Fork 130
CPU Support
Aphrodite supports CPU-only inference at relatively fast speeds. Currently, only AVX512 CPUs are supported. You can verify this by running the following in a terminal:
cat /proc/cpuinfo | grep avx512
If your CPU does not support AVX512 instructions, the command will not output anything.
- Install system-wide dependencies
$ sudo apt-get update -y
$ sudo apt-get install -y gcc-12 g++-12 # you can skip this if you already have a gcc/g++>=12.3.0 installed
$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
- Install the python dependencies
$ pip install -U pip
$ pip install wheel packaging ninja setuptools>=49.4.0 numpy
$ pip install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
- Build Aphrodite Engine
APHRODITE_TARGET_DEVICE=cpu python setup.py install
You can run the engine as normal. There are some points you will need to note:
-
Use the environment variable
APHRODITE_CPU_KVCACHE_SPACE
to specify the amount of memory (in GiBs) allocated for the KV cache. Higher numbers allow a higher degree of parallelism. -
The CPU backend uses OpenMP for thread-parallel computation. If you want the best performance on CPU, it'll be critical to isolate CPU cores for OpenMP threads with other thread pools (like web-service even-loop) to avoid CPU oversubscription.
-
If running on bare-metal, you should probably disable hyper-threading.
-
If you're on a multi-socket machine with NUMA, make sure the process uses only a single socket to avoid remote memory access. You can use
numactl
to do this.