Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error occurs in structured data extraction using deeply nested Pydantic models #1871

Open
rohitgarud opened this issue Nov 28, 2024 · 9 comments

Comments

@rohitgarud
Copy link

This is an example of a deeply nested JSON schema generated due to Pydantic type hint on Signature outputfield, which is huge and causes smaller models like Llama 3.18b with Openai compatibility API to fail with error:

litellm.exceptions.BadRequestError: litellm.BadRequestError: OpenAIException - Error code: 400 - {'error': "'response_format.type' must be 'json_schema'"}

While experimenting I came across BAML and tried to call the LLM directly by modifying the JSON schema to something similar to BAML representation and it worked really well:
Example of JSON schema from Pydantic, which is directly injected in the DSPy prompt:

{
    "$defs": {
        "Address": {
            "properties": {
                "street": {
                    "title": "Street",
                    "type": "string"
                },
                "city": {
                    "title": "City",
                    "type": "string"
                },
                "state": {
                    "pattern": "^[A-Z]{2}$",
                    "title": "State",
                    "type": "string"
                },
                "postal_code": {
                    "pattern": "^\\d{5}$",
                    "title": "Postal Code",
                    "type": "string"
                },
                "country": {
                    "default": "USA",
                    "title": "Country",
                    "type": "string"
                }
            },
            "required": [
                "street",
                "city",
                "state",
                "postal_code"
            ],
            "title": "Address",
            "type": "object"
        },
        "ContactInfo": {
            "properties": {
                "email": {
                    "format": "email",
                    "title": "Email",
                    "type": "string"
                },
                "phone": {
                    "anyOf": [
                        {
                            "pattern": "^\\+1\\d{10}$",
                            "type": "string"
                        },
                        {
                            "type": "null"
                        }
                    ],
                    "default": null,
                    "title": "Phone"
                },
                "website": {
                    "anyOf": [
                        {
                            "format": "uri",
                            "maxLength": 2083,
                            "minLength": 1,
                            "type": "string"
                        },
                        {
                            "type": "null"
                        }
                    ],
                    "title": "Website"
                }
            },
            "required": [
                "email",
                "website"
            ],
            "title": "ContactInfo",
            "type": "object"
        },
        "Product": {
            "properties": {
                "productId": {
                    "description": "Product code",
                    "pattern": "^[A-Z]{3}-\\d{4}$",
                    "title": "Productid",
                    "type": "string"
                },
                "name": {
                    "title": "Name",
                    "type": "string"
                },
                "price": {
                    "description": "Price must be non-negative",
                    "minimum": 0.0,
                    "title": "Price",
                    "type": "number"
                },
                "available": {
                    "default": true,
                    "description": "Is the product available?",
                    "title": "Available",
                    "type": "boolean"
                },
                "created_at": {
                    "format": "date-time",
                    "title": "Created At",
                    "type": "string"
                }
            },
            "required": [
                "productId",
                "name",
                "price"
            ],
            "title": "Product",
            "type": "object"
        },
        "Role": {
            "enum": [
                "admin",
                "user",
                "guest"
            ],
            "title": "Role",
            "type": "string"
        },
        "User": {
            "properties": {
                "id": {
                    "title": "Id",
                    "type": "integer"
                },
                "name": {
                    "description": "The user's full name",
                    "maxLength": 100,
                    "title": "Full Name",
                    "type": "string"
                },
                "role": {
                    "$ref": "#/$defs/Role",
                    "default": "user",
                    "description": "User role (admin, user, or guest)"
                },
                "signup_date": {
                    "description": "Signup timestamp",
                    "format": "date-time",
                    "title": "Signup Date",
                    "type": "string"
                },
                "active": {
                    "description": "Is the user currently active?",
                    "title": "Active",
                    "type": "boolean"
                },
                "addresses": {
                    "items": {
                        "$ref": "#/$defs/Address"
                    },
                    "title": "Addresses",
                    "type": "array"
                },
                "contact_info": {
                    "$ref": "#/$defs/ContactInfo"
                }
            },
            "required": [
                "id",
                "name",
                "active",
                "addresses",
                "contact_info"
            ],
            "title": "User",
            "type": "object"
        }
    },
    "properties": {
        "order_id": {
            "title": "Order Id",
            "type": "integer"
        },
        "user": {
            "$ref": "#/$defs/User"
        },
        "products": {
            "items": {
                "$ref": "#/$defs/Product"
            },
            "title": "Products",
            "type": "array"
        },
        "total_price": {
            "minimum": 0.0,
            "title": "Total Price",
            "type": "number"
        }
    },
    "required": [
        "order_id",
        "user",
        "products",
        "total_price"
    ],
    "title": "Order",
    "type": "object"
}

I have written a script to convert it into something like the following:

{
    "order_id": "integer",
    "user": {
        "id": "integer",
        "name": "string  //The user's full name",
        "role": "'admin' or 'user' or 'guest'",
        "signup_date": "string or null  //Signup timestamp",
        "active": "boolean  //Is the user currently active?",
        "addresses": {
            "street": "string",
            "city": "string",
            "state": "string  //pattern: ^[A-Z]{2}$",
            "postal_code": "string  //pattern: ^\\d{5}$",
            "country": "string (defaults to USA)"
        },
        "contact_info": {
            "email": "string",
            "phone": "string or null",
            "website": "string or null"
        }
    },
    "products": {
        "productId": "string  //Product code  //pattern: ^[A-Z]{3}-\\d{4}$",
        "name": "string",
        "price": "number  //Price must be non-negative",
        "available": "boolean (defaults to True)  //Is the product available?",
        "created_at": "string or null"
    },
    "total_price": "number"
}

This is less verbose, less number of tokens and concise and gives good results after some postprocessing. Something similar to BAML can be implemented to parse the output into a desired format.

Please let me know your thoughts about this approach.

@rohitgarud rohitgarud changed the title Error occurs in structured data extraction using deeply nested Pydantic models as signature types while using with small local models Error occurs in structured data extraction using deeply nested Pydantic models Nov 29, 2024
@rohitgarud
Copy link
Author

I see that model_dump_json() from Pydantic model is directly injected in the prompt, and the function mentioned above can be added before that to simplify the prompt

@okhat
Copy link
Collaborator

okhat commented Nov 29, 2024

Seems quite interesting!

@rohitgarud
Copy link
Author

rohitgarud commented Dec 1, 2024

@okhat This is the Script to process the Pydantic JSON schema. I hope the community will help to test and improve it.
Gist for Process Schema

@rohitgarud
Copy link
Author

rohitgarud commented Dec 1, 2024

litellm.exceptions.BadRequestError: litellm.BadRequestError: OpenAIException - Error code: 400 - {'error': "'response_format.type' must be 'json_schema'"}

The error is due to the following part in the __call__ of JSONAdapter

try:
    provider = lm.model.split('/', 1)[0] or "openai"
    if 'response_format' in litellm.get_supported_openai_params(model=lm.model, custom_llm_provider=provider):
        outputs = lm(**inputs, **lm_kwargs, response_format={ "type": "json_object" })
    else:
        outputs = lm(**inputs, **lm_kwargs)

except litellm.UnsupportedParamsError:
    outputs = lm(**inputs, **lm_kwargs)

The above fails for OpenAI compatible API hosted using LM Studio in my case

@okhat
Copy link
Collaborator

okhat commented Dec 1, 2024

Thanks so much @rohitgarud !! This does seem elegant to me, but we'll need to test out whether it's more portable/reliable than the current approach or not.

Tagging @dbczumar who is thinking about how to best handle types in adapters for future reference to this.

@rohitgarud
Copy link
Author

Thank you, @okhat I am working on a custom JSONAdapter for data extraction using smaller models/APIs for which response_format={ "type": "json_object" } is not available. I will share it soon

@rohitgarud
Copy link
Author

rohitgarud commented Dec 6, 2024

@okhat This is the Script to process the Pydantic JSON schema. I hope the community will help to test and improve it. Gist for Process Schema

I have updated the ProcessSchema script to directly output JSON-like schema as a string without quotes, which further reduces the number of tokens. Works really well with even smaller local models. Defining the Pydantic models well is crucial for avoiding Pydantic validation errors on the extracted JSONs.

I am also working on getting output JSON without quotes and newline characters for formatting, and json-repair can handle such JSON and gives valid JSON

@rohitgarud
Copy link
Author

rohitgarud commented Dec 6, 2024

This is the CustomJSONAdapter using the ProcessSchema. @okhat I believe that the Basic Chat and JSON adapters could benefit from some restructuring to facilitate easier customization of their individual components

@rohitgarud
Copy link
Author

@okhat This is the Script to process the Pydantic JSON schema. I hope the community will help to test and improve it. Gist for Process Schema

Can anybody help me write tests for this? There are a lot of scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants