Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/new embedding global variable #18809

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

charleschile
Copy link
Contributor

@charleschile charleschile commented Sep 14, 2024

User description

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #

What this PR does / why we need it:


PR Type

Enhancement, Tests


Description

  • Implemented LLM chunking, embedding, and text extraction functions in the SQL plan.
  • Added support for LLM secondary index definitions and handling in DDL operations.
  • Integrated Ollama service for embeddings and defined system variables for LLM configurations.
  • Added unit and integration tests for LLM functions, covering various scenarios.

Changes walkthrough 📝

Relevant files
Enhancement
14 files
func_llm.go
Implement LLM chunking and embedding functions                     

pkg/sql/plan/function/func_llm.go

  • Implemented various chunking strategies for text processing.
  • Added functions for LLM chunking, embedding, and text extraction.
  • Integrated file reading and error handling for LLM operations.
  • +706/-0 
    build_ddl.go
    Add LLM secondary index support in DDL                                     

    pkg/sql/plan/build_ddl.go

  • Added support for building LLM secondary index definitions.
  • Created functions to handle LLM index table creation.
  • +164/-0 
    ddl.go
    Integrate LLM index handling in DDL operations                     

    pkg/sql/compile/ddl.go

  • Integrated LLM index handling in table alteration processes.
  • Added logic for managing LLM index definitions.
  • +88/-0   
    ddl_index_algo.go
    Implement LLM index table handling                                             

    pkg/sql/compile/ddl_index_algo.go

  • Implemented handling for LLM index basic and chunk embedding tables.
  • Added functions to manage LLM index entries.
  • +168/-0 
    ollama_service.go
    Add Ollama service interaction for embeddings                       

    pkg/sql/plan/function/ollama_service.go

  • Added functions to interact with the Ollama service for embeddings.
  • Implemented HTTP request handling for Ollama API.
  • +154/-0 
    secondary_index_utils.go
    Add LLM index type and utilities                                                 

    pkg/catalog/secondary_index_utils.go

  • Added LLM index type and utility functions.
  • Defined default options for LLM index algorithms.
  • +42/-0   
    list_builtIn.go
    Register LLM functions in built-in list                                   

    pkg/sql/plan/function/list_builtIn.go

  • Registered new LLM functions in the built-in function list.
  • Defined overloads for LLM chunking and embedding functions.
  • +101/-0 
    datalink.go
    Implement Datalink URL parsing and validation                       

    pkg/container/types/datalink.go

  • Implemented parsing for Datalink type URLs.
  • Added validation for file paths and offsets.
  • +81/-0   
    embedding_serviece.go
    Introduce embedding service interface                                       

    pkg/sql/plan/function/embedding_serviece.go

  • Introduced an interface for embedding services.
  • Implemented Ollama embedding service.
  • +63/-0   
    function_id.go
    Add function IDs for LLM operations                                           

    pkg/sql/plan/function/function_id.go

  • Added new function IDs for LLM operations.
  • Updated function ID register with LLM functions.
  • +11/-0   
    types.go
    Define constants for LLM secondary index                                 

    pkg/catalog/types.go

  • Defined constants for LLM secondary index table types and columns.
  • Added LLM-related catalog entries.
  • +14/-0   
    create.go
    Add LLM index type to parser                                                         

    pkg/sql/parsers/tree/create.go

  • Added LLM index type to parser.
  • Updated index type enumeration with LLM.
  • +3/-0     
    keywords.go
    Add LLM keyword to MySQL parser                                                   

    pkg/sql/parsers/dialect/mysql/keywords.go

  • Added LLM keyword to MySQL dialect parser.
  • Updated keyword map with LLM support.
  • +1/-0     
    mysql_sql.y
    Integrate LLM index in SQL grammar                                             

    pkg/sql/parsers/dialect/mysql/mysql_sql.y

  • Integrated LLM index type in SQL grammar.
  • Added parsing rules for LLM index options.
  • +23/-1   
    Tests
    6 files
    func_llm_test.go
    Add unit tests for ChunkString function                                   

    pkg/sql/plan/function/func_llm_test.go

  • Added unit tests for the ChunkString function.
  • Covered various chunking strategies and error cases.
  • +100/-0 
    func_llm_chunk.sql
    Add test cases for LLM chunk function                                       

    test/distributed/cases/function/func_llm_chunk.sql

  • Added test cases for LLM chunk function.
  • Covered various chunking modes and inputs.
  • +14/-0   
    1.txt
    Add test resource file for LLM chunking                                   

    test/distributed/resources/llm_test/chunk/1.txt

    • Added test resource file for LLM chunking.
    +1/-0     
    2.txt
    Add test resource file for LLM chunking                                   

    test/distributed/resources/llm_test/chunk/2.txt

    • Added test resource file for LLM chunking.
    +1/-0     
    3.txt
    Add test resource file for LLM chunking                                   

    test/distributed/resources/llm_test/chunk/3.txt

    • Added test resource file for LLM chunking.
    +3/-0     
    4.txt
    Add test resource file for LLM chunking                                   

    test/distributed/resources/llm_test/chunk/4.txt

    • Added test resource file for LLM chunking.
    +3/-0     
    Configuration changes
    1 files
    variables.go
    Add system variables for LLM embedding                                     

    pkg/frontend/variables.go

  • Added system variables for LLM embedding configurations.
  • Defined default values for LLM-related settings.
  • +32/-0   
    Additional files (token-limit)
    1 files
    mysql_sql.go
    ...                                                                                                           

    pkg/sql/parsers/dialect/mysql/mysql_sql.go

    ...

    +9115/-9077

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 Security concerns

    Sensitive information exposure:
    The LLMChunk function in pkg/sql/plan/function/func_llm.go reads files based on user input without proper validation. This could potentially lead to unauthorized file access if the input is not properly sanitized or if the file permissions are not correctly set. Proper input validation and access controls should be implemented to mitigate this risk.

    ⚡ Recommended focus areas for review

    Error Handling
    The error handling in the LLM functions could be improved. Many errors are returned directly without additional context, which may make debugging difficult.

    Security Concern
    The LLMChunk function reads files based on user input without proper validation, which could lead to unauthorized file access if not properly secured.

    Error Handling
    The error handling in the Ollama service functions could be improved. Some errors are wrapped with additional context, but this approach is not consistent throughout the file.

    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Add error handling for the LLM response generation to prevent silent failures and improve robustness

    Consider adding error handling for the getOllamaGeneratedResponse function call.
    Currently, if an error occurs, it's not being checked or handled, which could lead
    to unexpected behavior or silent failures.

    pkg/sql/plan/function/func_llm.go [698-700]

     prompt := fmt.Sprintf("Please answer the question: [%s], based on the context: [%s]", question, context)
     response, err := getOllamaGeneratedResponse(prompt, llmModelStr, proxyStr)
    +if err != nil {
    +    return err
    +}
     
     rs.AppendMustBytesValue(util.UnsafeStringToBytes(response))
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: Adding error handling for the getOllamaGeneratedResponse function is crucial for preventing silent failures and ensuring robustness. This suggestion addresses a potential issue and enhances the code's reliability.

    8
    Enhancement
    Refactor type handling into a separate function for improved code organization and maintainability

    Consider using a switch statement for the type handling to improve readability and
    maintainability. This would make it easier to add new types in the future.

    pkg/sql/compile/util.go [180-214]

     if planCol.Name == catalog.CPrimaryKeyColName {
         continue
     }
     if i >= 1 {
         sql += ","
     }
     sql += "`" + planCol.Name + "`" + " "
     typeId := types.T(planCol.Typ.Id)
    -switch typeId {
    -case types.T_char:
    -    sql += fmt.Sprintf("CHAR(%d)", planCol.Typ.Width)
    -case types.T_varchar:
    -    sql += fmt.Sprintf("VARCHAR(%d)", planCol.Typ.Width)
    -case types.T_binary:
    -    sql += fmt.Sprintf("BINARY(%d)", planCol.Typ.Width)
    -case types.T_varbinary:
    -    sql += fmt.Sprintf("VARBINARY(%d)", planCol.Typ.Width)
    -case types.T_decimal64:
    -    sql += fmt.Sprintf("DECIMAL(%d,%d)", planCol.Typ.Width, planCol.Typ.Scale)
    -case types.T_decimal128:
    -    sql += fmt.Sprintf("DECIMAL(%d,%d)", planCol.Typ.Width, planCol.Typ.Scale)
    -case types.T_array_float32:
    -    sql += fmt.Sprintf("VECF32(%d)", planCol.Typ.Width)
    -case types.T_array_float64:
    -    sql += fmt.Sprintf("VECF64(%d)", planCol.Typ.Width)
    -case types.T_datalink:
    -    sql += fmt.Sprintf("Datalink")
    -default:
    -    sql += typeId.String()
    +sql += getTypeString(typeId, planCol.Typ.Width, planCol.Typ.Scale)
    +
    +// Define getTypeString function
    +func getTypeString(typeId types.T, width, scale int32) string {
    +    switch typeId {
    +    case types.T_char:
    +        return fmt.Sprintf("CHAR(%d)", width)
    +    case types.T_varchar:
    +        return fmt.Sprintf("VARCHAR(%d)", width)
    +    case types.T_binary:
    +        return fmt.Sprintf("BINARY(%d)", width)
    +    case types.T_varbinary:
    +        return fmt.Sprintf("VARBINARY(%d)", width)
    +    case types.T_decimal64, types.T_decimal128:
    +        return fmt.Sprintf("DECIMAL(%d,%d)", width, scale)
    +    case types.T_array_float32:
    +        return fmt.Sprintf("VECF32(%d)", width)
    +    case types.T_array_float64:
    +        return fmt.Sprintf("VECF64(%d)", width)
    +    case types.T_datalink:
    +        return "Datalink"
    +    default:
    +        return typeId.String()
    +    }
     }
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: The suggestion to refactor type handling into a separate function improves code organization and maintainability. It makes the code cleaner and easier to extend with new types in the future, which is beneficial for long-term maintenance.

    7
    Add error handling for hardcoded LLM global variables to improve robustness

    Consider implementing error handling for the hardcoded values in
    getLLMGlobalVariable function. This will make the code more robust when
    transitioning from hardcoded values to actual variable resolution.

    pkg/sql/plan/function/embedding_serviece.go [56-57]

     func getLLMGlobalVariable(proc *process.Process) (string, string, string, error) {
    -    return "ollama", "http://localhost:11434/api/embed", "llama3", nil
    +    platform := "ollama"
    +    proxy := "http://localhost:11434/api/embed"
    +    model := "llama3"
    +    
    +    if platform == "" || proxy == "" || model == "" {
    +        return "", "", "", moerr.NewInternalErrorNoCtx("LLM global variables are not properly set")
    +    }
    +    
    +    return platform, proxy, model, nil
     }
    • Apply this suggestion
    Suggestion importance[1-10]: 6

    Why: Adding error handling for hardcoded LLM global variables enhances the robustness of the code. This is a useful improvement, especially when transitioning from hardcoded values to dynamic variable resolution, although the current implementation is not incorrect.

    6
    Group related constants and add comments to improve code organization and readability

    Consider grouping related constants together and adding comments to explain their
    purpose or usage. This can improve code readability and maintainability.

    pkg/sql/parsers/dialect/mysql/mysql_sql.go [347-356]

    +// LLM-related constants
     const LLM = 57671
    +
    +// Data structure constants
     const ZONEMAP = 57672
    +const LISTS = 57677
    +
    +// String manipulation constants
     const LEADING = 57673
     const BOTH = 57674
     const TRAILING = 57675
    +
    +// Miscellaneous constants
     const UNKNOWN = 57676
    -const LISTS = 57677
     const OP_TYPE = 57678
     const REINDEX = 57679
     const EXPIRE = 57680
    • Apply this suggestion
    Suggestion importance[1-10]: 5

    Why: Grouping related constants and adding comments can enhance code readability and maintainability. However, this suggestion does not address any critical issues or bugs, so its impact is moderate.

    5
    Best practice
    Use iota for sequential constant declarations to improve maintainability

    Consider using an enum or iota for these constants to ensure they are sequential and
    easier to maintain.

    pkg/sql/parsers/dialect/mysql/mysql_sql.go [347-356]

    -const LLM = 57671
    -const ZONEMAP = 57672
    -const LEADING = 57673
    -const BOTH = 57674
    -const TRAILING = 57675
    -const UNKNOWN = 57676
    -const LISTS = 57677
    -const OP_TYPE = 57678
    -const REINDEX = 57679
    -const EXPIRE = 57680
    +const (
    +    LLM = iota + 57671
    +    ZONEMAP
    +    LEADING
    +    BOTH
    +    TRAILING
    +    UNKNOWN
    +    LISTS
    +    OP_TYPE
    +    REINDEX
    +    EXPIRE
    +)
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Using iota for sequential constants is a good practice that simplifies maintenance and reduces the risk of errors in constant values. This suggestion improves the code's maintainability, making it more impactful.

    7
    Improve error handling and logging for the LLM embedding process to facilitate debugging and enhance system robustness

    Consider implementing proper error handling and logging for the LLM embedding
    process. Currently, if an error occurs during embedding, it's not clear how it's
    being handled or reported.

    pkg/sql/plan/function/func_llm.go [196-202]

     embedding, err := embeddingService.GetEmbedding(input, llmModelStr, proxyStr)
     if err != nil {
    -    return err
    +    // Log the error for debugging purposes
    +    log.Printf("Error getting embedding: %v", err)
    +    // Consider how to handle this error - maybe append a null value or a placeholder?
    +    return fmt.Errorf("failed to get embedding: %w", err)
     }
     embeddingBytes = types.ArrayToBytes[float32](embedding)
     
     if err := rs.AppendBytes(embeddingBytes, false); err != nil {
    -    return err
    +    return fmt.Errorf("failed to append embedding bytes: %w", err)
     }
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: The suggestion to improve error handling and add logging for the embedding process is beneficial for debugging and robustness. It provides a clear and actionable improvement to the existing error handling strategy.

    7
    Introduce a constant for the default size value to enhance code clarity

    Consider using a constant for the default size value (-1) to improve code
    readability and maintainability.

    pkg/container/types/datalink.go [60-70]

    -offsetSize := []int{0, -1}
    -if _, ok := urlParams["offset"]; ok {
    -    if offsetSize[0], err = strconv.Atoi(urlParams["offset"]); err != nil {
    +const defaultSize = -1
    +offsetSize := []int{0, defaultSize}
    +if offsetStr, ok := urlParams["offset"]; ok {
    +    if offsetSize[0], err = strconv.Atoi(offsetStr); err != nil {
             return "", nil, "", err
         }
     }
    -if _, ok := urlParams["size"]; ok {
    -    if offsetSize[1], err = strconv.Atoi(urlParams["size"]); err != nil {
    +if sizeStr, ok := urlParams["size"]; ok {
    +    if offsetSize[1], err = strconv.Atoi(sizeStr); err != nil {
             return "", nil, "", err
         }
     }
    • Apply this suggestion
    Suggestion importance[1-10]: 5

    Why: Using a constant for the default size value improves code readability and maintainability. While the change is minor, it helps in understanding the code better and makes future modifications easier.

    5
    Performance
    Use a more efficient string concatenation method for better performance when building large strings

    Consider using a more efficient string concatenation method, such as
    strings.Builder, instead of += for building the 'context' string in the loop. This
    can significantly improve performance, especially for large amounts of text.

    pkg/sql/plan/function/func_llm.go [680-694]

    +var contextBuilder strings.Builder
     for _, filePath := range results {
         fs := proc.GetFileService()
         moUrl, offsetSize, err := ParseDatalink(filePath, proc)
         if err != nil {
             return err
         }
         r, err := ReadFromFileOffsetSize(moUrl, fs, int64(offsetSize[0]), int64(offsetSize[1]))
         if err != nil {
             return err
         }
     
         defer r.Close()
     
         fileBytes, err := io.ReadAll(r)
         if err != nil {
             return err
         }
    -    context += util.UnsafeBytesToString(fileBytes)
    +    contextBuilder.WriteString(util.UnsafeBytesToString(fileBytes))
     }
    +context := contextBuilder.String()
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: The suggestion to use strings.Builder for string concatenation is valid and can improve performance, especially when dealing with large strings. This change is relevant and beneficial for the code's efficiency.

    7

    💡 Need additional feedback ? start a PR chat

    @charleschile charleschile marked this pull request as draft October 16, 2024 15:21
    @m-schen
    Copy link
    Contributor

    m-schen commented Oct 24, 2024

    新增文件缺少license信息。 (function目录下)

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants