Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: A quantile function for ordinal data only #27367

Open
pdeffebach opened this issue Jun 1, 2018 · 8 comments
Open

Request: A quantile function for ordinal data only #27367

pdeffebach opened this issue Jun 1, 2018 · 8 comments
Labels
feature Indicates new feature / enhancement requests

Comments

@pdeffebach
Copy link
Contributor

There are a variety of types of data for which order is defined, but not other mathematical operations. It seems to be the consensus, for instance, that the Date type should not have + or / defined for it.

However, if you have a vector of dates, you might still want to know the "quantiles" of those dates. If you can sort a vector of dates, because < is defined, you can ask "What is the 25th percentile of dates in my vector?"

You can't do this with the current quantile function, because in the case of a tie, it finds a midpoint between the two values by taking a mean.

R's quantile function has the keyword argument Type, and when you call quantile(x, ..., Type = 1) it returns the lower of the two values in the case of a tie.

I am currently working on a better describe function for returning summary statistics of a DataFrame, and think it would be useful to return a quantile-like value for ordinal data. Unfortunately, such a function is not defined either here or in StatsBase.

quantile is a super well-written function in Base, being clever enough to only sort values between the minimum and maximum percentiles asked for. Writing an ordinal quantile function in StatsBase would essentially mean re-writing the quantile function entirely. Rather, I think it makes sense to add a new method, call it ordquantile or something that keeps everything in the current quantile function except for the part that takes the mean of ties, and returns the lower value instead.

Does this reasoning make sense for it to live here?

@nalimilan
Copy link
Member

How about adding an argument to quantile?

@pdeffebach
Copy link
Contributor Author

For reference here is the relevant code. We would just add a new function that's just like @inline function _quantile(v::AbstractVector, p::Real) that gets called with a keyword argument ordinal = true.

@andreasnoack
Copy link
Member

I agree that this would be useful. This was discussed a bit in https://github.com/JuliaLang/julia/issues/19190#issuecomment-257885325, #19359 (comment), and in the very old #1333.

@matthieugomez
Copy link
Contributor

matthieugomez commented Jun 2, 2018

Would it be so bad to change the default quantile to type 1 (lower value)? It sounds that it would simplify a lot things (type stability, quantile of ordinal data). Type 7 is also harder to extend with weights. Stata uses type 1 by default btw. Stata uses type 2 by default.

@nalimilan
Copy link
Member

Doesn't Stata default to types 6 or 2 depending on the commands (see this)? Anyway, I think we should use the same type for median and quantile by default, and AFAIK the standard definition of the median is the middle of the two central elements.

@matthieugomez
Copy link
Contributor

matthieugomez commented Jun 2, 2018

Actually yes you are right. Stata uses type 2 which does give the median as the average of the two central elements.

@pdeffebach
Copy link
Contributor Author

pdeffebach commented Jun 4, 2018

To be clear, julia doesn't have implementations for all the different quantile versions described here, right?

Should there be an effort to implement all 9? seems excessive, but maybe there is more demand.

Maybe I should write a Quantiles package that implements all of them?

@nalimilan
Copy link
Member

No, we only support one variant currently. I'm not sure there's a real demand to support all variants, most software only implement a few of them. But having a way to compute them would still be useful to replicate results, either in a package or in Base.

@brenhinkeller brenhinkeller added the feature Indicates new feature / enhancement requests label Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Indicates new feature / enhancement requests
Projects
None yet
Development

No branches or pull requests

5 participants