Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support alternate format for Utf8 unparsing (CHAR) #11494

Merged
merged 3 commits into from
Jul 17, 2024

Conversation

sgrebnov
Copy link
Member

Which issue does this PR close?

PR addresses Utf8 unparser issue producing invalid CAST(col AS TEXT) SQL for MySQL.

Rationale for this change

MySQL cast function does not support TEXT and requires CHAR for CAST (automatically returns VARCHAR, TEXT, LONGTEXT)

CHAR[(N)] [charset_info]
Produces a string with the VARCHAR data type, unless the expression expr is empty (zero length), in which case the result type is CHAR(0). If the optional length N is given, CHAR(N) causes the cast to use no more than N characters of the argument. No padding occurs for values shorter than N characters. If the optional length N is not given, MySQL calculates the maximum length from the expression. If the supplied or calculated length is greater than an internal threshold, the result type is TEXT. If the length is still too long, the result type is LONGTEXT.

Example query:

 select * from customer where c_custkey = 'building

Before this change (fails in MySQL)

SELECT `customer`.`c_custkey`, ... FROM `customer` WHERE (CAST(`customer`.`c_custkey` AS TEXT) = 'building')

After this change (works in MySQL)

SELECT `customer`.`c_custkey`,  ... FROM `customer` WHERE (CAST(`customer`.`c_custkey` AS CHAR) = 'building')

What changes are included in this PR?

PR introduces configurable use_char_for_utf8_cast dialect parameter that controls whether CHAR vs TEXT/VARCHAR is used for Utf8 unparsing.

Are these changes tested?

Yes, added unit tests + manual testing

Are there any user-facing changes?

CustomDialectBuilder now supports use_char_for_utf8_cast that can be used to specify whether CHAR vs TEXT/VARCHAR data type should be used for Utf8 unparsing.

@github-actions github-actions bot added the sql SQL Planner label Jul 16, 2024
@sgrebnov sgrebnov changed the title Support alternate format to use CHAR for Utf8 unparsing Support alternate format for Utf8 unparsing (CHAR) Jul 16, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @sgrebnov . I had a suggestion on the API -- let me know what you think

datafusion/sql/src/unparser/dialect.rs Outdated Show resolved Hide resolved
@sgrebnov
Copy link
Member Author

I had a suggestion on the API -- let me know what you think

@alamb - 👍 I was thinking about this as well, but was unable to come up with other good potential types for string unparsing (asf data types) so decided to proceed with this approach. As we have more people thinking that we should better to return the type here I'll update my PR.

Note - as we have different types for Utf8 and LargeUtf8 there will be two functions/settings.

@sgrebnov sgrebnov force-pushed the sgrebnov/utf8-unparsing branch from e263539 to 6c3b941 Compare July 17, 2024 00:57
@sgrebnov sgrebnov requested a review from alamb July 17, 2024 01:22
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @sgrebnov

@alamb alamb merged commit b0925c8 into apache:main Jul 17, 2024
23 checks passed
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 17, 2024
* Add dialect param to use CHAR instead of TEXT for Utf8 unparsing for MySQL (apache#12)

* Configurable data type instead of flag for Utf8 unparsing

* Fix type in comment
@phillipleblanc phillipleblanc deleted the sgrebnov/utf8-unparsing branch July 17, 2024 22:28
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 18, 2024
* Add dialect param to use CHAR instead of TEXT for Utf8 unparsing for MySQL (apache#12)

* Configurable data type instead of flag for Utf8 unparsing

* Fix type in comment
Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 23, 2024
* Add dialect param to use CHAR instead of TEXT for Utf8 unparsing for MySQL (apache#12)

* Configurable data type instead of flag for Utf8 unparsing

* Fix type in comment
wiedld pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jul 31, 2024
* Add dialect param to use CHAR instead of TEXT for Utf8 unparsing for MySQL (#12)

* Configurable data type instead of flag for Utf8 unparsing

* Fix type in comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants