-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Replace infix ~ for formulas with a model macro #9
Conversation
Hah, so here's the thing. The curse of space-delimited macro arguments: @model y => x1 + x2 - 1 # becomes Formula(:y, :(x1 + x2 - 1))
@model y => x1 + x2 -1 # errors, since -1 is parsed as a macro argument I'm not sure what the best course of action for this is. For this approach to continue to be viable, we may have to require using parentheses, e.g. @model(y => x1 + x2 -1) # parsed identically regardless of spacing That's not ideal, but it's not so bad (IMO). Thoughts? |
Thanks for doing this. Indeed it looks like we're going to have to recommend using the parenthesized call to avoid spacing issues. Regarding the name of the macro, why not Finally, I would still use |
Someone will have to implement the parsing change before we can be sure, but I suspect For all but the simplest one-block annotation macros, I think parenthesizing is better form anyway. |
You could also do something like creating a new expression for the RHS is there's more than 2 arguments. Something like FWIW I'm in favor of using parentheses, but I imagine some people might still try to use the non-parenthesized syntax because it's used all over the place. A reasonable compromise might be to do the transformation (combining all the RHS arguments with |
Thanks for the feedback and help here, guys!
I went with
I'm not sure what you mean about the direction. Can you elaborate? I went with I had played around with using
Should be fine, actually. Once
That seems reasonable to me. Then it's just a vararg macro, right? |
Outside of R and friends, I've seen the tilde used in a kind of descriptive way in statistics/ML papers, to describe in a more informal way the form of how a random variable depends on others. E.g., |
Right, ~ is used for "distributed as" when describing the probability distribution of a random variable. But does that really translate to models? I suppose in the sense of explaining the variance of Y based on some X1, X2, ..., Xp it sort of makes sense, as the Xi have their own probability distributions. I guess I just haven't seen it for specifying models though. In my experience, that's nearly always =. |
Yes, "distributed as" is a much more concise way of saying what I was thinking. I've only ever seen |
We definitely need to encourage parens, since |
Simon makes a good point, and I think that makes a compelling case for requiring parentheses here, which would also make the vararg thing unnecessary. Edit: Well, the requirement can't be enforced, but we can at least tell people that they may get unexpected behavior without parentheses. |
length(ex.args) == 3 || error("malformed expression in formula") | ||
lhs = Base.Meta.quot(ex.args[2]) | ||
rhs = Base.Meta.quot(ex.args[3]) | ||
elseif ex.head === :(~) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be :call
for most infix operators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think you're right. I was reusing the logic for =>
, which isn't a call
.
Okay, now that I've fixed my dumb Vim find/replace mistakes and the tests are passing, here's what it looks like: @formula(y ~ 1 + x1 + x2 & x3) # bare object
fit(SomeModel, @formula(y ~ 1 + x1 + x2 & x3), ...) # in context Does that seem reasonable? |
Hm, once That's something else I like about |
It's just that the dependent variable is on the LHS, so it feels weird that the
I kind of like |
Going back to the "is distributed as" interpretation of
in mathematical notation and very close to that as a Julia expression. It may be a little too wordy to write out the expression for the linear predictor in place of 𝐗 in that expression but it definitely relates the model to the expression. For generalized linear models it is even more meaningful. The probability model for logistic regression is
using I have seen that model written incorrectly as
or something like that so often that I have lost track. The point is that people want to write the model in a
form and have a simplified expression for the distribution of I think I still would be in favour of an alternative notation for the model formula but I just wanted to note that the |
Great explanation, Doug. Thanks! Makes total sense.
Very good point. I guess in my mind I was thinking of the use of the arrow as pointing to what we're modeling the response as a function of, so "LHS as a function of (=>) RHS." I'm on board with |
AFAICT, as Tony said, |
Precedence is set on an operator by operator basis whether or not there's a meaning in base, I don't think all currently unassigned operators are at the same precedence. Given these packages are the main user so far (even after it gets changed to parse as a normal operator), I suspect it can be given whichever precedence would be most convenient for usage here. But it may change if someone comes up with a new meaning for the operator that should have different precedence. |
Okay, sounds good to me. |
a2ef44d
to
81c5c4e
Compare
Any further comments or are we ready to commit to doing formulas this way? |
I think this is good to go. Can you update the documentation to reflect these changes? |
Ah right, documentation! I had forgotten that some people like to know how to use software. 😜 Will do. Thanks for the reminder, @kleinschmidt! |
To be perfectly honest, I only thought of it because I just wrote this documentation... |
Does anyone think it's worth saying something about why we're using an explicit macro, or would that be too much information for the documentation? I don't have a good sense of whether explaining these kind of design decisions is helpful to users or just noise... One reason why it might be good is that it might preempt griping of the "why can't we have naked formulas like in R" sort. |
I think documenting the design decision could be useful so long as it's separate from the usage documentation--otherwise it's a little noisy IMO. Though following Base Julia's lead, we could leave questions of design decisions to "search the GitHub issues/PRs." 😉 |
I agree with separating from usage. I'd rather be explicit about it (since a lot of what happens on github assumes a lot of context that a curious user might not have), but it's also more work to summarize things in a concise but useful way... |
? Edit: Edited to incorporate comments below |
I might say "while in R..." (instead of "but") |
"R uses nonstandard evaluation" -> maybe "R supports/allows nonstandard evaluation" |
Good suggestions, thanks! I've edited my comment. How does that look now? If it looks alright I'll stick it in a commit and send it up here. |
I might even say something like "Julia, unlike R, uses macros to explicitly indicate when code itself will be manipulated before it's evaluated", just to emphasize why this is a good idea (or at least reasonable). |
Current coverage is 93.56% (diff: 100%)@@ master #9 diff @@
==========================================
Files 5 5
Lines 307 311 +4
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 290 291 +1
- Misses 17 20 +3
Partials 0 0
|
Sorry for the delay. I've incorporated the comments in the docs regarding the reasoning behind macros for formulas. Unless there are further comments, I'll go ahead and merge this later today. I'm trying to get this good to go ASAP in light of the impending feature freeze for Base Julia v0.6, wherein we hope to stop parsing @kleinschmidt After this is merged, is there anything else that needs to be done here before we can tag a release? Thanks so much for your help and input, everyone! |
Not sure what the nightly failure is about but it appears unrelated |
Can this be merged? What deprecation strategy for the parsing of |
I tried to write the macro in such a way that it should work regardless of whether |
Kinda. Are other packages using the old syntax in tests, or does this only ever appear in user code? |
I think many packages that currently depend on DataFrames (which still defines this syntax) do use formulas in tests. Examples off the top of my head include MixedModels and FixedEffectModels. |
And they're getting the definition from here? So I guess this package should give a dep warn for tilde when it gets called as a macro rather than inside of one? |
This package isn't registered yet, so the definition will come from here, but doesn't yet. The next release of DataFrames will not contain any of the formula code; that will all come from here. So for 0.6 compatibility with the current DataFrames we'd have to do a patch release with a deprecation warning. I guess one way to do it would be to have DataFrames define |
Oh! Thanks for the explanation, didn't realize that. Is the old release DataFrames branch the only place that defined a ~ macro implementation? Would this be ready for other packages to use as a replacement soon? Or instead of DataFrames, maybe Base could keep ~ parsing as a macro but always throwing a depwarn? Looks like this may not change in Base for 0.6, but if packages can be made ready then maybe we hard change the parsing during 1.0-dev and packages might not notice if they've transitioned to this? |
Should still be on DataFrames master as well until this is registered, but otherwise yes.
I think so, but @kleinschmidt may know better than I would.
👍 Sounds like the best course of action. |
Fixes #3
I've gone with
=>
here since we need something that parses as an infix operator. Once~
stops being parsed as an infix macro, it's not clear to me whether it will still be parsed as an infix operator at all. Thus I've opted for something we can guarantee is infix.