[Feature] Local LLM Support #53

rapidarchitect · 2023-12-16T16:43:28Z

Would like to be able to run this with local llm stacks like litellm or ollama etc.

Could you provide a parameter to specify llm and base url

kagevazquez · 2024-01-03T01:26:53Z

I personally use LM Studio for my local LLM server and would love to use it with this as well.

hopefully the dev can use this example python code for future development.

Example: reuse your existing OpenAI setup

from openai import OpenAI

Point to the local server

client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")

completion = client.chat.completions.create(
model="local-model", # this field is currently unused
messages=[
{"role": "system", "content": "Always answer in rhymes."},
{"role": "user", "content": "Introduce yourself."}
],
temperature=0.7,
)

print(completion.choices[0].message)

marcusschiesser · 2024-01-03T04:17:00Z

We're about to add Ollama support to LlamaIndexTS first, see run-llama/LlamaIndexTS#305 - then it could be used in chat-llamaindex

m0wer · 2024-01-11T20:31:20Z

We're about to add Ollama support to LlamaIndexTS first, see run-llama/LlamaIndexTS#305 - then it could be used in chat-llamaindex

Some things are missing for the class Ollama to fit into the current implementation, at least the maxTokens metadata entry and the tokens() method.

m0wer · 2024-01-11T22:01:46Z

I've managed to make it work, with LiteLLM (and Ollama behind it) by setting export OPENAI_BASE_URL=https://litellm.mydomain.tld/v1 and changing

diff --git app/client/platforms/llm.ts app/client/platforms/llm.ts
index ddc316f..a907ee7 100644
--- app/client/platforms/llm.ts
+++ app/client/platforms/llm.ts
@@ -36,6 +36,9 @@ export const ALL_MODELS = [
   "gpt-4-vision-preview",
   "gpt-3.5-turbo",
   "gpt-3.5-turbo-16k",
+  "mixtral_default",
+  "mistral",
+  "phi",
 ] as const;

 export type ModelType = (typeof ALL_MODELS)[number];

And then creating a new bot using one of those models, and adjusting its params.

marcusschiesser · 2024-01-12T09:11:35Z

@m0wer thanks! cool hack! Yes Ollama currently doesn't have tokens() implemented, that's why SummaryChatHistory is not working with it. But SimpleChatHistory should.
You can try setting sendMemory to false for your bot, see:
https://github.com/run-llama/chat-llamaindex/blob/aeee808134a9b267d22d3d48900ba7393e37cdbc/app/api/llm/route.ts#L167C1-L169

m0wer · 2024-01-12T10:01:52Z

Thanks @marcusschiesser ! But I found additional problems when uploading files and building for production. In llamaindexTS there is a list of valid OpenAI model names so the type check fails. Maybe renaming mixtral as gpt4 in litellm would do the trick.

I needed to do the sendMemory to false as you said to get it to work but that was it for the development version.

Now I don't know if I should go on the direction of extending the Ollama class or doing the rename in litellm to just be able to reuse most of the current chat-llamaindex code.

Any advice or ideas are welcome ;-)

marcusschiesser · 2024-01-12T10:43:20Z

llamaindexTS there is a list of valid OpenAI model names so the type check fails.
This example doesn't work with mixtral? https://github.com/run-llama/LlamaIndexTS/blob/main/examples/ollama.ts
Should work as it's using model: string

m0wer · 2024-01-12T11:44:39Z

llamaindexTS there is a list of valid OpenAI model names so the type check fails.
This example doesn't work with mixtral? https://github.com/run-llama/LlamaIndexTS/blob/main/examples/ollama.ts
Should work as it's using model: string

Yes that works. But what I'm trying to do is to get the complete https://chat.llamaindex.ai/ to work with local LLMs. So either I fake the OpenAI API with litellm or I extend the class Ollama from LlamaIndexTS to support the missing methods.

m0wer · 2024-01-13T07:18:17Z

To make it work with Ollama I had to adapt my reverse proxy settings and do the following changes:

diff --git app/api/llm/route.ts app/api/llm/route.ts
index aa3066c..de9a806 100644
--- app/api/llm/route.ts
+++ app/api/llm/route.ts
@@ -4,7 +4,8 @@ import {
   DefaultContextGenerator,
   HistoryChatEngine,
   IndexDict,
-  OpenAI,
+  LLMMetadata,
+  Ollama,
   ServiceContext,
   SimpleChatHistory,
   SummaryChatHistory,
@@ -120,6 +121,33 @@ function createReadableStream(
   return responseStream.readable;
 }
 
+class OllamaCustom extends Ollama {
+    maxTokens: number;
+
+  constructor(init: Partial<OllamaCustom> & {
+    model: string;
+  }) {
+    super(init);
+    this.maxTokens = init.maxTokens || 2048;
+  }
+
+  get metadata(): LLMMetadata {
+    return {
+      model: this.model,
+      temperature: this.temperature,
+      topP: this.topP,
+      maxTokens: this.maxTokens,
+      contextWindow: this.contextWindow,
+      tokenizer: undefined,
+    };
+  }
+
+  tokens(messages: ChatMessage[]): number {
+    let tokens = 10;
+    return tokens;
+  }
+}
+
 export async function POST(request: NextRequest) {
   try {
     const body = await request.json();
@@ -146,11 +174,14 @@ export async function POST(request: NextRequest) {
       );
     }
 
-    const llm = new OpenAI({
+    const llm = new OllamaCustom({
+      baseURL: "https://ollama.mydomain.tld",
       model: config.model,
       temperature: config.temperature,
       topP: config.topP,
+      contextWindow: config.maxTokens,
       maxTokens: config.maxTokens,
+      requestTimeout: 5 * 60 * 1000,
     });
 
     const serviceContext = serviceContextFromDefaults({
diff --git app/client/platforms/llm.ts app/client/platforms/llm.ts
index ddc316f..33273c9 100644
--- app/client/platforms/llm.ts
+++ app/client/platforms/llm.ts
@@ -31,11 +31,11 @@ export interface ResponseMessage {
 }
 
 export const ALL_MODELS = [
-  "gpt-4",
-  "gpt-4-1106-preview",
-  "gpt-4-vision-preview",
-  "gpt-3.5-turbo",
-  "gpt-3.5-turbo-16k",
+  "mistral",
+  "mixtral_default",
+  "dolphin-mixtral:8x7b-v2.7-q4_K_M",
+  "llava",
+  "phi",
 ] as const;
 
 export type ModelType = (typeof ALL_MODELS)[number];
diff --git package.json package.json
index 0ba2c1b..b6dfdfb 100644
--- package.json
+++ package.json
@@ -38,7 +38,7 @@
     "dotenv": "^16.3.1",
     "emoji-picker-react": "^4.4.12",
     "encoding": "^0.1.13",
-    "llamaindex": "0.0.0-20231110031459",
+    "llamaindex": "0.0.44",
     "lucide-react": "^0.277.0",
     "mermaid": "^10.3.1",
     "nanoid": "^5.0.2",

marcusschiesser · 2024-02-27T07:44:49Z

see #77 for how to use a Ollama

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Local LLM Support #53

[Feature] Local LLM Support #53

rapidarchitect commented Dec 16, 2023

kagevazquez commented Jan 3, 2024

marcusschiesser commented Jan 3, 2024

m0wer commented Jan 11, 2024

m0wer commented Jan 11, 2024

marcusschiesser commented Jan 12, 2024

m0wer commented Jan 12, 2024

marcusschiesser commented Jan 12, 2024

m0wer commented Jan 12, 2024

m0wer commented Jan 13, 2024

marcusschiesser commented Feb 27, 2024

[Feature] Local LLM Support #53

[Feature] Local LLM Support #53

Comments

rapidarchitect commented Dec 16, 2023

kagevazquez commented Jan 3, 2024

Example: reuse your existing OpenAI setup

Point to the local server

marcusschiesser commented Jan 3, 2024

m0wer commented Jan 11, 2024

m0wer commented Jan 11, 2024

marcusschiesser commented Jan 12, 2024

m0wer commented Jan 12, 2024

marcusschiesser commented Jan 12, 2024

m0wer commented Jan 12, 2024

m0wer commented Jan 13, 2024

marcusschiesser commented Feb 27, 2024