Error occurs in structured data extraction using deeply nested Pydantic models #1871

rohitgarud · 2024-11-28T12:20:57Z

This is an example of a deeply nested JSON schema generated due to Pydantic type hint on Signature outputfield, which is huge and causes smaller models like Llama 3.18b with Openai compatibility API to fail with error:

litellm.exceptions.BadRequestError: litellm.BadRequestError: OpenAIException - Error code: 400 - {'error': "'response_format.type' must be 'json_schema'"}

While experimenting I came across BAML and tried to call the LLM directly by modifying the JSON schema to something similar to BAML representation and it worked really well:
Example of JSON schema from Pydantic, which is directly injected in the DSPy prompt:

{
    "$defs": {
        "Address": {
            "properties": {
                "street": {
                    "title": "Street",
                    "type": "string"
                },
                "city": {
                    "title": "City",
                    "type": "string"
                },
                "state": {
                    "pattern": "^[A-Z]{2}$",
                    "title": "State",
                    "type": "string"
                },
                "postal_code": {
                    "pattern": "^\\d{5}$",
                    "title": "Postal Code",
                    "type": "string"
                },
                "country": {
                    "default": "USA",
                    "title": "Country",
                    "type": "string"
                }
            },
            "required": [
                "street",
                "city",
                "state",
                "postal_code"
            ],
            "title": "Address",
            "type": "object"
        },
        "ContactInfo": {
            "properties": {
                "email": {
                    "format": "email",
                    "title": "Email",
                    "type": "string"
                },
                "phone": {
                    "anyOf": [
                        {
                            "pattern": "^\\+1\\d{10}$",
                            "type": "string"
                        },
                        {
                            "type": "null"
                        }
                    ],
                    "default": null,
                    "title": "Phone"
                },
                "website": {
                    "anyOf": [
                        {
                            "format": "uri",
                            "maxLength": 2083,
                            "minLength": 1,
                            "type": "string"
                        },
                        {
                            "type": "null"
                        }
                    ],
                    "title": "Website"
                }
            },
            "required": [
                "email",
                "website"
            ],
            "title": "ContactInfo",
            "type": "object"
        },
        "Product": {
            "properties": {
                "productId": {
                    "description": "Product code",
                    "pattern": "^[A-Z]{3}-\\d{4}$",
                    "title": "Productid",
                    "type": "string"
                },
                "name": {
                    "title": "Name",
                    "type": "string"
                },
                "price": {
                    "description": "Price must be non-negative",
                    "minimum": 0.0,
                    "title": "Price",
                    "type": "number"
                },
                "available": {
                    "default": true,
                    "description": "Is the product available?",
                    "title": "Available",
                    "type": "boolean"
                },
                "created_at": {
                    "format": "date-time",
                    "title": "Created At",
                    "type": "string"
                }
            },
            "required": [
                "productId",
                "name",
                "price"
            ],
            "title": "Product",
            "type": "object"
        },
        "Role": {
            "enum": [
                "admin",
                "user",
                "guest"
            ],
            "title": "Role",
            "type": "string"
        },
        "User": {
            "properties": {
                "id": {
                    "title": "Id",
                    "type": "integer"
                },
                "name": {
                    "description": "The user's full name",
                    "maxLength": 100,
                    "title": "Full Name",
                    "type": "string"
                },
                "role": {
                    "$ref": "#/$defs/Role",
                    "default": "user",
                    "description": "User role (admin, user, or guest)"
                },
                "signup_date": {
                    "description": "Signup timestamp",
                    "format": "date-time",
                    "title": "Signup Date",
                    "type": "string"
                },
                "active": {
                    "description": "Is the user currently active?",
                    "title": "Active",
                    "type": "boolean"
                },
                "addresses": {
                    "items": {
                        "$ref": "#/$defs/Address"
                    },
                    "title": "Addresses",
                    "type": "array"
                },
                "contact_info": {
                    "$ref": "#/$defs/ContactInfo"
                }
            },
            "required": [
                "id",
                "name",
                "active",
                "addresses",
                "contact_info"
            ],
            "title": "User",
            "type": "object"
        }
    },
    "properties": {
        "order_id": {
            "title": "Order Id",
            "type": "integer"
        },
        "user": {
            "$ref": "#/$defs/User"
        },
        "products": {
            "items": {
                "$ref": "#/$defs/Product"
            },
            "title": "Products",
            "type": "array"
        },
        "total_price": {
            "minimum": 0.0,
            "title": "Total Price",
            "type": "number"
        }
    },
    "required": [
        "order_id",
        "user",
        "products",
        "total_price"
    ],
    "title": "Order",
    "type": "object"
}

I have written a script to convert it into something like the following:

{
    "order_id": "integer",
    "user": {
        "id": "integer",
        "name": "string  //The user's full name",
        "role": "'admin' or 'user' or 'guest'",
        "signup_date": "string or null  //Signup timestamp",
        "active": "boolean  //Is the user currently active?",
        "addresses": {
            "street": "string",
            "city": "string",
            "state": "string  //pattern: ^[A-Z]{2}$",
            "postal_code": "string  //pattern: ^\\d{5}$",
            "country": "string (defaults to USA)"
        },
        "contact_info": {
            "email": "string",
            "phone": "string or null",
            "website": "string or null"
        }
    },
    "products": {
        "productId": "string  //Product code  //pattern: ^[A-Z]{3}-\\d{4}$",
        "name": "string",
        "price": "number  //Price must be non-negative",
        "available": "boolean (defaults to True)  //Is the product available?",
        "created_at": "string or null"
    },
    "total_price": "number"
}

This is less verbose, less number of tokens and concise and gives good results after some postprocessing. Something similar to BAML can be implemented to parse the output into a desired format.

Please let me know your thoughts about this approach.

The text was updated successfully, but these errors were encountered:

rohitgarud · 2024-11-29T05:26:24Z

I see that model_dump_json() from Pydantic model is directly injected in the prompt, and the function mentioned above can be added before that to simplify the prompt

okhat · 2024-11-29T12:43:49Z

Seems quite interesting!

rohitgarud · 2024-12-01T06:57:16Z

@okhat This is the Script to process the Pydantic JSON schema. I hope the community will help to test and improve it.
Gist for Process Schema

rohitgarud · 2024-12-01T09:12:17Z

litellm.exceptions.BadRequestError: litellm.BadRequestError: OpenAIException - Error code: 400 - {'error': "'response_format.type' must be 'json_schema'"}

The error is due to the following part in the __call__ of JSONAdapter

try:
    provider = lm.model.split('/', 1)[0] or "openai"
    if 'response_format' in litellm.get_supported_openai_params(model=lm.model, custom_llm_provider=provider):
        outputs = lm(**inputs, **lm_kwargs, response_format={ "type": "json_object" })
    else:
        outputs = lm(**inputs, **lm_kwargs)

except litellm.UnsupportedParamsError:
    outputs = lm(**inputs, **lm_kwargs)

The above fails for OpenAI compatible API hosted using LM Studio in my case

okhat · 2024-12-01T16:33:00Z

Thanks so much @rohitgarud !! This does seem elegant to me, but we'll need to test out whether it's more portable/reliable than the current approach or not.

Tagging @dbczumar who is thinking about how to best handle types in adapters for future reference to this.

rohitgarud · 2024-12-02T05:19:01Z

Thank you, @okhat I am working on a custom JSONAdapter for data extraction using smaller models/APIs for which response_format={ "type": "json_object" } is not available. I will share it soon

rohitgarud · 2024-12-06T17:55:47Z

@okhat This is the Script to process the Pydantic JSON schema. I hope the community will help to test and improve it. Gist for Process Schema

I have updated the ProcessSchema script to directly output JSON-like schema as a string without quotes, which further reduces the number of tokens. Works really well with even smaller local models. Defining the Pydantic models well is crucial for avoiding Pydantic validation errors on the extracted JSONs.

I am also working on getting output JSON without quotes and newline characters for formatting, and json-repair can handle such JSON and gives valid JSON

rohitgarud · 2024-12-06T18:25:33Z

This is the CustomJSONAdapter using the ProcessSchema. @okhat I believe that the Basic Chat and JSON adapters could benefit from some restructuring to facilitate easier customization of their individual components

rohitgarud · 2024-12-07T07:18:05Z

@okhat This is the Script to process the Pydantic JSON schema. I hope the community will help to test and improve it. Gist for Process Schema

Can anybody help me write tests for this? There are a lot of scenarios.

rohitgarud changed the title ~~Error occurs in structured data extraction using deeply nested Pydantic models as signature types while using with small local models~~ Error occurs in structured data extraction using deeply nested Pydantic models Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error occurs in structured data extraction using deeply nested Pydantic models #1871

Error occurs in structured data extraction using deeply nested Pydantic models #1871

rohitgarud commented Nov 28, 2024

rohitgarud commented Nov 29, 2024

okhat commented Nov 29, 2024

rohitgarud commented Dec 1, 2024 •

edited

Loading

rohitgarud commented Dec 1, 2024 •

edited

Loading

okhat commented Dec 1, 2024 •

edited

Loading

rohitgarud commented Dec 2, 2024

rohitgarud commented Dec 6, 2024 •

edited

Loading

rohitgarud commented Dec 6, 2024 •

edited

Loading

rohitgarud commented Dec 7, 2024

Error occurs in structured data extraction using deeply nested Pydantic models #1871

Error occurs in structured data extraction using deeply nested Pydantic models #1871

Comments

rohitgarud commented Nov 28, 2024

rohitgarud commented Nov 29, 2024

okhat commented Nov 29, 2024

rohitgarud commented Dec 1, 2024 • edited Loading

rohitgarud commented Dec 1, 2024 • edited Loading

okhat commented Dec 1, 2024 • edited Loading

rohitgarud commented Dec 2, 2024

rohitgarud commented Dec 6, 2024 • edited Loading

rohitgarud commented Dec 6, 2024 • edited Loading

rohitgarud commented Dec 7, 2024

rohitgarud commented Dec 1, 2024 •

edited

Loading

rohitgarud commented Dec 1, 2024 •

edited

Loading

okhat commented Dec 1, 2024 •

edited

Loading

rohitgarud commented Dec 6, 2024 •

edited

Loading

rohitgarud commented Dec 6, 2024 •

edited

Loading