fst_table as a serious class #236

hope-data-science · 2020-03-19T07:07:11Z

I find fst_table a very useful class, do not have to read the file physically but could get enough information to know how to process. Perhaps there could be more methods to deal on it, e.g. is.fst.table, path.fst.table, summary.fst.table, etc. I think this is going to be popular in big data analysis in R.

The text was updated successfully, but these errors were encountered:

MarcusKlik · 2020-04-03T13:19:43Z

Hi @hope-data-science , thanks for the feature request!

Having more options to manipulate and view characteristics of the offline dataset would be very useful indeed. But those can be better served in separate R packages that import fst for the low-level operations (such as the fstplyr, fsttable or your tidyfst packages).

So fst can provide the lower-level operations and access to meta-data while the downstream package can use those functionalities to provide functionality in their own specific API. Does that sound reasonable?

For example, fst can provide the following low-level abilities:

read from file using custom (random) row-filters
read from file using a custom ordering
read from file using group-windows (in the background) and apply custom R operations on each group
read from file and sort the result while reading (on background threads)
join two fst files using (sorted) keys

Downstream packages could use these features to facilitate their own API's and provide functionality like offline sorting, partial loading, etc...

hope-data-science · 2020-04-03T13:36:12Z

I am not so familiar with the implementations underneath, what you mention as "low-level abilities" are acutually quite "high-level" to me. If these abilities could be done in fst, faster and memory efficient, I think that would be amazing! At the very first, my expectations are just:

How to access data more efficiently from fst file? How to subset data more flexibly (by group? filter? slice? select?[I think I've handled this part in some way] )?

I did make a function named filter_fst, but that might not be fast. I think fst could help to facilitate the access part very well. And about the computation part, if that can really be brought to us, that is a brand new revolution! I think that will open a new era to do computation out-of-memory, especially for some tough tasks.

BTW: A small problem, I am tring to get the zero row of fst table but failed. In data.frame or data.table, you can get DT[0,] to get the column names and classes, this facilitates selection. Maybe fst table could do that too? Currently, I used ft[1,][0,] to access that, it is OK, but a little verbose perhaps. And if there are lots of columns, this may take some time. Is is possible to make ft[0,] work?

Thanks!

MarcusKlik · 2020-04-03T14:10:54Z

Hi @hope-data-science, you're right, ft[0, ] should definitely have an output identical to DT[0, ], using the example code above:

# identical
x[1, ]
#>   X Y
#> 1 1 2
fst_table[1, ]
#>   X Y
#> 1 1 2

# not identical
x[0, ]
#> [1] X Y
#> <0 rows> (or 0-length row.names)
fst_table[0, ]
#> Error in read_fst(meta_info$path, from = min_row, to = max_row): Parameter 'from' should have a numerical value equal or larger than 1.

thanks for pointing that out, I'll schedule a fix for the next release!

MarcusKlik · 2020-04-03T14:21:53Z

added as a separate issue

hope-data-science · 2020-04-11T04:30:22Z

I've designed a new tool to work with fst, which is considered to be more memory efficient.
Link: https://hope-data-science.github.io/tidyft/articles/Introduction.html

MarcusKlik self-assigned this Apr 3, 2020

MarcusKlik added the feature request label Apr 3, 2020

MarcusKlik added this to the Candidate milestone Apr 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fst_table as a serious class #236

fst_table as a serious class #236

hope-data-science commented Mar 19, 2020

MarcusKlik commented Apr 3, 2020

hope-data-science commented Apr 3, 2020

MarcusKlik commented Apr 3, 2020

MarcusKlik commented Apr 3, 2020 •

edited

Loading

hope-data-science commented Apr 11, 2020

fst_table as a serious class #236

fst_table as a serious class #236

Comments

hope-data-science commented Mar 19, 2020

MarcusKlik commented Apr 3, 2020

hope-data-science commented Apr 3, 2020

MarcusKlik commented Apr 3, 2020

MarcusKlik commented Apr 3, 2020 • edited Loading

hope-data-science commented Apr 11, 2020

MarcusKlik commented Apr 3, 2020 •

edited

Loading