sql_database.with_resources(*) reflects unnecessary table schemas #2121

jeff-skoldberg-gmds · 2024-12-04T19:02:35Z

dlt version

1.4.0

Describe the problem

Summary

This code does schema reflection for every table and column in the database.schema, even if only one table is passed in.

source = sql_database(
    credentials=self.connection_string,
    schema=schema,
    reflection_level="full",
).with_resources(*table_names)

However, this code does exactly what a user would expect, it only does schema reflection for the necessary columns:

source = sql_database(
    table_names=table_names,
    credentials=self.connection_string,
    schema=schema,
    reflection_level="full",
)

bug or feature?

There is a debate in the slack thread that this is not a bug. But I would ask these questions:

Would a user ever want dlt to do schema reflections for tables & columns they are not using?
Can a user leverage the extra column schema reflections? And even if they can, what would they do with it?
Is there value to reflecting the schema for unused columns? If yes, what value?

If the answers are negative on these questions, I feel like we have to categorize this as a bug.

Personally, I can't imagine a single user would want dlt to reflect the schema for random columns they are not using.

impact to my code

using table_names= kwarg limits me with certain features, since now I am not defining specific resources, it is just table names. Therefore I have to do hints after the source is created instead of using config up front with the hints. (Maybe I'm doing something wrong, but that has been my experience as a non-expert here...)

Expected behavior

I would expect that only tables and columns requested by the user undergo schema reflection.

Steps to reproduce

Run this:

source = sql_database(
    credentials=self.connection_string,
    schema=schema,
    reflection_level="full",
).with_resources(*table_names)

I think this is the same behavior for all sql_database types, but I am using DB2 if that matters at all.

Operating system

Windows

Runtime environment

Local

Python version

3.11

dlt data source

sql_database

dlt destination

DuckDB, Snowflake

Other deployment details

No response

Additional information

Work around is to pass a list of tables then do something like this:

            if tbl.get("incremental_column"):
                initial_value = (
                    datetime.strptime(tbl["initial_value"], "%Y-%m-%d")
                    if tbl.get("initial_value")
                    else None
                )

                logger.info(
                    f"Applying hints for table: {tbl['name']}, cursor column: {tbl['incremental_column']}, initial value: {initial_value}, write disposition: {tbl['write_disposition']}, primary key: {tbl['primary_key']}"
                )
                table_resource.apply_hints(
                    incremental=dlt.sources.incremental(
                        cursor_path=tbl["incremental_column"],
                        initial_value=initial_value,
                        lag=172800,  # 2 days to account for late arriving data
                    ),
                    write_disposition="merge",
                    primary_key=tbl["primary_key"],
                )

The text was updated successfully, but these errors were encountered:

rudolfix · 2024-12-10T09:48:40Z

@jeff-skoldberg-gmds I do not understand the kind of problem you have with passing a list of tables upfront. I assume you know them? The problem you describe comes from the way Python executes code: it will first create sql_database instance and then call with_resources on it. That's why the code in sql_database is not aware of what happens later.

How we can make your code simpler? If you know all your table names in advance, you can pass them to the sql_database with the defer_table_reflect set to True. In that case reflection happens only when particular table is selected so you will be able to use with_resources several times on the same instance to get copies and tables that you do not finally extract won't be reflected at all.

the other options is passing your own metadata with all tables already defined. You mention that you do not want to reflect the columns that you do not use so if you know your schema in advance you could just create the metadata yourself.

Another option coming to my mind is that we add an option that fully skips reflection and in that case we just infer everything from the data and you must provide a custom SELECT statement to extract

jeff-skoldberg-gmds · 2024-12-11T02:38:11Z

Hi @rudolfix , thanks for your questions here. Let me see if I can organize my thoughts.

why do I not want to pass a list of tables

I like the idea of defining resources up front in config, including what column is the cursor column, etc, write disposition, etc.
I haven't tried, but I guessed you can't pass a list of dlt resources to the tables argument. Is that wrong?
If I go with a list of tables, I think I have to apply these hints AFTER the source is created, and all of these hints are now applies as code instead of config.
Maybe I'm thinking about this all wrong. I will admit, I'm not really sure on the best approach here.

Passing a list of tables is working for me, but I would like the flexibility to pass resources.

`defer_table_reflection` plus `with_resources`

Sounds very promising. I will have to test to see if this solves it. I trust it does...
This sounds like it is the fix.

skipping reflection, passing a SQL query

This honestly sounds like a dreamy bonus feature. I have posted a lot of challenges in Slack about schema reflection. Pyarrow seems to very much hate my source data.
Passing a SQL query could also provide other hacks for filtering the data (start date, custom look backs, etc.)
Also I could eliminate some views in the upstream database where they did not provide a good cursor column.
Do you want me to create a feature request for this?

closing thoughts

While I see that I have some good work-arounds to solve the unnecessary reflection (defer reflection), and I understand logically why the code does what it does, this unnecessary reflection is behavior that no one would ever use. I guess if it is simple to turn off with "defer reflection", then it is not worth fixing. But man, it just seems like something you don't want in your product.

github-project-automation bot added this to dlt core library Dec 4, 2024

github-project-automation bot moved this to Todo in dlt core library Dec 4, 2024

rudolfix moved this from Todo to In Progress in dlt core library Dec 9, 2024

rudolfix self-assigned this Dec 10, 2024

rudolfix added the question Further information is requested label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql_database.with_resources(*) reflects unnecessary table schemas #2121

sql_database.with_resources(*) reflects unnecessary table schemas #2121

jeff-skoldberg-gmds commented Dec 4, 2024

rudolfix commented Dec 10, 2024

jeff-skoldberg-gmds commented Dec 11, 2024 •

edited

Loading

sql_database.with_resources(*) reflects unnecessary table schemas #2121

sql_database.with_resources(*) reflects unnecessary table schemas #2121

Comments

jeff-skoldberg-gmds commented Dec 4, 2024

dlt version

Describe the problem

Summary

bug or feature?

impact to my code

Expected behavior

Steps to reproduce

Operating system

Runtime environment

Python version

dlt data source

dlt destination

Other deployment details

Additional information

rudolfix commented Dec 10, 2024

jeff-skoldberg-gmds commented Dec 11, 2024 • edited Loading

why do I not want to pass a list of tables

defer_table_reflection plus with_resources

skipping reflection, passing a SQL query

closing thoughts

jeff-skoldberg-gmds commented Dec 11, 2024 •

edited

Loading

`defer_table_reflection` plus `with_resources`