-
-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Apply different LoRA dynamically #103
Comments
@snowyu Can you please provide an example of a proposed usage with I want to keep the API of this library relatively high-level while still offering advanced capabilities, so I wouldn't necessarily want to expose the |
Usually the base LLM model is more than 4GB in size. The corresponding LoRA adapters are relatively small: about a few hundred megabytes. If there is LoRA dynamic loading, several LoRAs fine-tuned under the same basic model can be quickly switched in memory. There is no need to use multiple full huge LLM models. // pseudocode
llama_model * model = llama_load_model_from_file('ggml-base-model-f16.bin', mparams);
...
// switch to the Animal Domain LoRA adapter
int err = llama_model_apply_lora_from_file(model,
"animal-lora-adapter.bin",
lora_scale,
NULL, //<--- optional lora_base model
params.n_threads);
// switch to the Astronomy Domain LoRA adapter
int err = llama_model_apply_lora_from_file(model,
"animal-lora-adapter.bin",
lora_scale,
NULL, //<--- optional lora_base model
params.n_threads);
|
Just curious if there has been any progress on this? I think it would be nice to be able to specify a LoRa adapter in the If that makes sense, I'd be willing to start looking into it. |
@vlamanna The beta of version 3 is now mature enough, so I've added support for loading a LoRA adapter as part of loading a model (#217); set the @snowyu Changing a LoRA on a model at runtime is not possible at the moment, as there's no way to unload an adapter after it has been applied to a model; every call to This feature will be available in the next beta version that I'll release soon. |
🎉 This issue has been resolved in version 3.0.0-beta.20 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
@giladgd It can not be done on the lowlevel API. But It could be ok on high level API, like this: // pseudocode
class LlamaModel {
loadLoRAs(loraFiles, scale, threads, baseModelPath?) {
let needDeinit = false
// check loaded loraModels
for (let i=0; i< this.loraModels.length; i++) {
const loraModel = this.loraModels[i]
const ix = loraFiles.indexOf(loraModel.file)
if (ix === -1) {needDeinit=true;break}
loraFiles.splice(ix, 1)
}
//free model if already load other lora
if (needDeinit) {
// deinit and load base model again
this.reloadModel()
}
for (const loraFile of loraFiles) {
const model = _loadLoRA(loraFile, scale, threads, baseModelPath)
if (model) this.loraModels.push(model)
}
} |
@snowyu It can be done with the high-level API that I've added. |
@giladgd If you only consider the creation of APIs from the perspective of performance, this is indeed the case. But from the perspective of ease of use, it is worth exploring. Let me talk about my usage scenario, a simple intelligent agent script engine, they can call each other, each agent may use a different LLM. LLM reloading is commonplace. My current pain is that I have to These LLMs are managed in the agent script engine:
Although these should be the responsibility of the LLM engine and not the agent script engine. |
@snowyu We have plans to make the memory management transparent, so you can focus on what you'd like to do with models, and Over the past few months, I've laid the infrastructure for building such a mechanism, but there's still work to do to achieve this. Perhaps you've noticed, for example, that you don't have to specify Allowing to modify a model state at runtime on the library level will make using this library more complicated (due to all of the hassle it incurs to keep things working or the performance tradeoffs it embodies), and I think is a lacking solution to the memory management hassle that I work on solving from its root. |
It's great.looking forward to it.
Yes. I have. Do you think about adding the estimate of memory for mmprojector model?
Totally agree. |
I don't know what model you are referring to. I reverse-engineered To find out how accurate the estimation is for a given model, you can run this command: npx node-llama-cpp@beta inspect measure <model path> If you notice that the estimation is way off for some model and want to fix it, you can look at the |
@giladgd Sorry, I've been busy with my project lately. mmprojector is from Multimodal LLM, maybe you haven't used the llava part yet. You may be interested in the Programmable Prompt Engine project I'm working on. I hope to add node-llama-cpp as the default provider in the near future, but for now, I don’t see a good API entry point to start. I need a simple API: // come from https://github.com/isdk/ai-tool.js/blob/main/src/utils/chat.ts
export const AITextGenerationFinishReasons = [
'stop', // model generated stop sequence
'length', // model generated maximum number of tokens
'content-filter', // content filter violation stopped the model
'tool-calls', // model triggered tool calls
'abort', // aborted by user or timeout for stream
'error', // model stopped because of an error
'other', null, // model stopped for other reasons
] as const
export type AITextGenerationFinishReason = typeof AITextGenerationFinishReasons[number]
export interface AIResult<TValue = any, TOptions = any> {
/**
* The generated value.
*/
content?: TValue;
/**
* The reason why the generation stopped.
*/
finishReason?: AITextGenerationFinishReason;
options?: TOptions
/**
* for stream mode
*/
stop?: boolean
taskId?: AsyncTaskId; // for stream chunk
}
// https://github.com/isdk/ai-tool-llm.js/blob/main/src/llm-settings.ts
export enum AIModelType {
chat, // text to text
vision, // image to text
stt, // audio to text
drawing, // text to image
tts, // text to audio
embedding,
infill,
}
// fake API
class AIModel {
llamaLoadModelOptions: LlamaLoadModelOptions
supports: AIModelType|AIModelType[]
options: LlamaModelOptions // default options
static async loadModel(filename: string, options?: {aborter?: AbortController, onLoadProgress, ...} & LlamaLoadModelOptions): Promis<AIModel>;
async completion(prompt: string, options?: {stream?: boolean, aborter?: AbortController,...} & LlamaModelOptions): Promise<AIResult|ReadStream<AIResult>>
fillInMiddle...
tokenize...
detokenize...
} |
🎉 This PR is included in version 3.0.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
Feature Description
Can change LoRA dynamically after loading LLaMa model.
The Solution
See
llama_model_apply_lora_from_file()
function inllama.cpp
.https://github.com/ggerganov/llama.cpp/blob/e9c13ff78114af6fc6a4f27cc8dcdda0f3d389fb/llama.h#L353C1-L359C1
Considered Alternatives
None.
Additional Context
No response
Related Features to This Feature Request
Are you willing to resolve this issue by submitting a Pull Request?
No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.
The text was updated successfully, but these errors were encountered: