Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fst_table as a serious class #236

Open
hope-data-science opened this issue Mar 19, 2020 · 5 comments
Open

fst_table as a serious class #236

hope-data-science opened this issue Mar 19, 2020 · 5 comments
Assignees
Milestone

Comments

@hope-data-science
Copy link

I find fst_table a very useful class, do not have to read the file physically but could get enough information to know how to process. Perhaps there could be more methods to deal on it, e.g. is.fst.table, path.fst.table, summary.fst.table, etc. I think this is going to be popular in big data analysis in R.

@MarcusKlik MarcusKlik self-assigned this Apr 3, 2020
@MarcusKlik MarcusKlik added this to the Candidate milestone Apr 3, 2020
@MarcusKlik
Copy link
Collaborator

Hi @hope-data-science , thanks for the feature request!

Having more options to manipulate and view characteristics of the offline dataset would be very useful indeed. But those can be better served in separate R packages that import fst for the low-level operations (such as the fstplyr, fsttable or your tidyfst packages).

So fst can provide the lower-level operations and access to meta-data while the downstream package can use those functionalities to provide functionality in their own specific API. Does that sound reasonable?

For example, fst can provide the following low-level abilities:

  • read from file using custom (random) row-filters
  • read from file using a custom ordering
  • read from file using group-windows (in the background) and apply custom R operations on each group
  • read from file and sort the result while reading (on background threads)
  • join two fst files using (sorted) keys

Downstream packages could use these features to facilitate their own API's and provide functionality like offline sorting, partial loading, etc...

@hope-data-science
Copy link
Author

I am not so familiar with the implementations underneath, what you mention as "low-level abilities" are acutually quite "high-level" to me. If these abilities could be done in fst, faster and memory efficient, I think that would be amazing! At the very first, my expectations are just:

How to access data more efficiently from fst file? How to subset data more flexibly (by group? filter? slice? select?[I think I've handled this part in some way] )?

I did make a function named filter_fst, but that might not be fast. I think fst could help to facilitate the access part very well. And about the computation part, if that can really be brought to us, that is a brand new revolution! I think that will open a new era to do computation out-of-memory, especially for some tough tasks.

BTW: A small problem, I am tring to get the zero row of fst table but failed. In data.frame or data.table, you can get DT[0,] to get the column names and classes, this facilitates selection. Maybe fst table could do that too? Currently, I used ft[1,][0,] to access that, it is OK, but a little verbose perhaps. And if there are lots of columns, this may take some time. Is is possible to make ft[0,] work?

Thanks!

@MarcusKlik
Copy link
Collaborator

Hi @hope-data-science, you're right, ft[0, ] should definitely have an output identical to DT[0, ], using the example code above:

# identical
x[1, ]
#>   X Y
#> 1 1 2
fst_table[1, ]
#>   X Y
#> 1 1 2

# not identical
x[0, ]
#> [1] X Y
#> <0 rows> (or 0-length row.names)
fst_table[0, ]
#> Error in read_fst(meta_info$path, from = min_row, to = max_row): Parameter 'from' should have a numerical value equal or larger than 1.

thanks for pointing that out, I'll schedule a fix for the next release!

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Apr 3, 2020

added as a separate issue

@hope-data-science
Copy link
Author

I've designed a new tool to work with fst, which is considered to be more memory efficient.
Link: https://hope-data-science.github.io/tidyft/articles/Introduction.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants