Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differentiate between one-row and multi-row structs #3

Open
krlmlr opened this issue Feb 19, 2024 · 7 comments
Open

Differentiate between one-row and multi-row structs #3

krlmlr opened this issue Feb 19, 2024 · 7 comments

Comments

@krlmlr
Copy link
Contributor

krlmlr commented Feb 19, 2024

With a different class. For the former, $ into list columns would unpack, for the latter, we could offer a generic accessor function that selects according to equality with the first column.

@krlmlr
Copy link
Contributor Author

krlmlr commented Feb 19, 2024

It's only useful if autocomplete actually picks it up.

@moodymudskipper
Copy link
Collaborator

Is the auto-unnesting $ worth it ? This comes with hidden surprises, to be consistent we need [[, and that means lapply() and map() behave differently (lapply() uses [[, map() iterates on true elements) low level. df$nested_col[[1]] will return the first column of the nested df to the surprise of the user expecting regular df behavior, which might occasionally translate into silent bugs.

This accessor according to first column looks a lot like row names.

@krlmlr
Copy link
Contributor Author

krlmlr commented Feb 20, 2024

Do we really need [[ to be consistent with $ ? Let's first see if autocomplete would pick it up, before continuing the discussion.

Databases don't have row names, tibbles will never have them.

@moodymudskipper
Copy link
Collaborator

Now we have list_structs, that are lists, and there is no need for unpacking or unnesting.
And we have tibble_structs, that have rows we can extract into list_structs.
A problem is that we don't have packed columns anymore. To have packed columns in tibble_structs we'd need a new class in list_struct so we can have the round trip, similar to how we have a "scalar" class to signal that we don't want to nest the value when going from list_struct to tibble_struct.

@krlmlr
Copy link
Contributor Author

krlmlr commented Mar 7, 2024

A reprex would be nice ;-)

@moodymudskipper
Copy link
Collaborator

moodymudskipper commented Mar 7, 2024

Yes, and my last statement about packed columns was misguided too, here's a summary:

List_structs classed lists, stricter, and print like one row dfs with some custom pillar methods.

The printing method might be improved, and maybe the best is to display an improved tree
(something similar to str, but tailored for structs, we can also have a glimpse method)

library(struct)
foo <- list_struct(
  a = scalar(1),
  a2 = 1, 
  b = tibble_struct(c = 2, d = 3), 
  e = list_struct(f = 4, g = 5)
)
foo
#> # list_struct object: 4 element(s)
#>   a            a2        b                  e             
#> * <dbl>        <dbl>     <tbbl_str[,2]>     <named list>  
#> 1 <scalar [1]> <dbl [1]> <tbbl_str [1 × 2]> <lst_strc [2]>
print_tree(foo)
#> █─ foo <lst_strc>
#> ├─── a <scalar>
#> ├─── a2 <dbl>
#> ├─█─ b <tbbl_str[,2]>
#>   ├─── c <dbl>
#>   ├─── d <dbl>
#> ├─█─ e <lst_strc>
#>   ├─── f <dbl>
#>   ├─── g <dbl>

We can bind list_structs into tibble_structs, scalars are not nested, the rest is nested.
We need some pillar methods here to differentiate for a standard tibble.

bar <- bind_structs(foo, foo)
class(bar)
#> [1] "tibble_struct" "tbl_df"        "tbl"           "data.frame"

# this tibble can be subset normally
bar[1,]
#> # A tibble: 1 × 4
#>       a a2        b                  e             
#>   <dbl> <list>    <list>             <list>        
#> 1     1 <dbl [1]> <tbbl_str [1 × 2]> <lst_strc [2]>
bar[[1]]
#> [1] 1 1

To go back to a struct we need to use extract, where extract is evaluated in bar and returns an integerish or logical

identical(bar[extract = 1], foo)
#> [1] TRUE

note we have autocomplete after bar[extract = 1]$

bar is not correct at the moment I think, nested list_structs should be changed to packed columns, not nested, we should have

#> A tibble_struct: 2 × 4
#>       a a2        b                    e$f    $g
#> * <dbl> <list>    <list>             <dbl> <dbl>
#> 1     1 <dbl [1]> <tbbl_str [1 × 2]>     4     5
#> 2     1 <dbl [1]> <tbbl_str [1 × 2]>     5     5

where e is a packed tibble_struct

@krlmlr
Copy link
Contributor Author

krlmlr commented Mar 12, 2024

Agree that e should be a packed column.

drop = TRUE instead of extract = 1 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants