-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve documentation on TreeNode
#10035
Conversation
FYI @peter-toth |
datafusion/common/src/tree_node.rs
Outdated
/// Defines a visitable and rewriteable tree node. This trait is implemented | ||
/// for plans ([`ExecutionPlan`] and [`LogicalPlan`]) as well as expression | ||
/// trees ([`PhysicalExpr`], [`Expr`]) in DataFusion. | ||
/// API for visiting (read only) and rewriting tree data structures . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just tried to explain how this API works better.
/// * `map_*`: applies a transformation to rewrite owned nodes | ||
/// * `apply_*`: invokes a function on borrowed nodes | ||
/// * `transform_`: applies a transformation to rewrite owned nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im probably missing it but does it make sense to add more details to distinguish map_*
and transform_
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't understand the difference between map
and transform
here. Is the former creating a new node and the latter mutating an existing node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a shot at consolidating the terminology. It's probably wrong, but hopefully it will help the conversation. 🙈 (edited)
API | action | traversal | ref | actor |
---|---|---|---|---|
apply | traverse, no rewrite | top-down, pre-order | &self | FnMut |
transform | traverse & rewrite | suffix dependent | self | suffix dependent |
- | - | *_up suffix = bottom-up, post-order | - | - |
- | - | *_down suffix = top-down, pre-order | - | - |
- | - | *_up_down suffix = does both, see docs | - | - |
- | - | - | - | Fn |
- | - | - | - | *_mut suffix = FnMut |
visit | visit, no rewrite | depth-first † | &self | dyn TreeNodeVisitor |
rewrite | visit & rewrite | depth-first † | &self | dyn TreeNodeRewrite |
apply_children | inspect children, no rewrite | not traversal per se‡ | &self | Fn |
map_children | inspect children & rewrite in place | not traversal per se‡ | self | FnMut |
† The specific implementations of the visitors (TreeNodeVisitor or TreeNodeRewrite) can impact the traversal patterns.
‡ I think the implementations decide how to handle any travsersal (a.k.a. does the mapping of children inspect it's own children). But I'm not really clear on this one.
Caveat (edited): I didn't attempt to generalize the map_*
terminology (vs transform) as there are several other use cases across intersecting code paths (and traits), which I haven't dug into yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above table is not quite right and misses a few things. But, let me comment on this ticket tomorrow when I'm back home and have better github access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous efforts (especially #8891 but others too: #9876, #9913, #9965, #9999) consolidated the behaviour of existing APIs (consistent TreeNodeRecursion
behaviour and common Transformed
return types of transforming APIs) but didn't try to normalize naming of the APIs in favour of causing the least incompatibility with previous DataFusion versions. So this means that the current API names evolved organically and thus they might not be intuitive in all cases and are a bit inconsistent here and there...
Basically, there are 2 kinds of TreeNode
APIs:
- "Inspecting" APIs to taverse over a tree and inspect
&TreeNode
s. - "Transforming" APIs to traverse over a tree while consuming
TreeNode
s and produce possibly changedTreeNode
s
(i.e. there is no in-place mutation API, that uses&mut TreeNode
, but we are improving the consume and produce APIs to cleate less new objects unnecessarly e.g.: AvoidLogicalPlan::clone()
inLogicalPlan::map_children
when possible #9999)
There are only 2 inspecting TreeNode
APIs:
TreeNode::apply()
:
Top-down (pre-order) traverses the nodes and applies the povidedf
closure on the nodes.- As
TreeNode
trees are hierarchical collections of nodes, something likeforeach()
would better describe the operation. - Because the API name doesn't explicitely defines the traversal direction IMO its docs shouldn't say anything about its implementation just the fact that it should be used for order agnostic inspection of nodes.
- Its transforming counterpart should be
TreeNode::transform()
. ButTreeNode::transform()
is an alias forTreeNode::transform_up()
so it does a bottom-up (post-order) traversal.
- As
TreeNode::visit()
:
Requires aTreeNodeVisitor
that incorporates 2 methods to do a combined traversal.TreeNodeVisitor::f_down()
is called in top-down order,TreeNodeVisitor::f_up()
is called in bottom-up order on the nodes in one combined traversal.- As there is no
apply_up()
exists,visit()
can be used to do a bottom-up inspecting traversal (TreeNodeVisitor::f_down()
can remain its default implementation, that does nothing). - Its transforming counterpart is
TreeNode::rewrite()
.
- As there is no
There are 7 transforming TreeNode
APIs:
TreeNode::transform()
:
An alias forTreeNode::transform_up()
so it does bottom-up traversal on the nodes. Consumes them with the providedf
closure and produces a possibly transformed node.- Should be the transforming counterpart of
apply()
but this runs bottom-up for some reason... - IMO its docs shouldn't mention traversal direction and probably should be used for order agnostic transformation of nodes.
- Should be the transforming counterpart of
TreeNode::transform_down()
andTreeNode::transform_down_mut()
:
These 2 are top-down traverse the nodes, consume them with the providedf
closure and produce a possibly transformed node.- I have no idea why we have the
..._mut()
version, most likely for just convenience, but IMO onetransform_down()
withFnMut
should be fair enough.
- I have no idea why we have the
TreeNode::transform_up()
andTreeNode::transform_up_mut()
:
These 2 are the bottom-up versions of the above.- I have no idea why we have the
..._mut()
version. - Don't have an inspecting counterpart
- I have no idea why we have the
TreeNode::transform_down_up()
:
This API accepts 2 closuresf_down
andf_up
.f_down
is called in top-down order andf_up
is called in bottom-up order on the nodes in one combined traversal.- Doesn't have an inspecting counterpart
TreeNode::rewrite()
:
Requires aTreeNodeRewriter
that incorporates 2 methods to do a combined traversal.TreeNodeRewriter::f_down()
is called in top-down order,TreeNodeRewriter::f_up()
is called in bottom-up order on the nodes in one combined traversal.- This API is the transforming counterpart of
TreeNode::visit()
- If there is no mutable shared data between
TreeNodeRewriter::f_down()
andTreeNodeRewriter::f_up()
then it is easier to useTreeNode::transform_down_up()
that acceptsf_down
andf_up
closures directly and no need to define aTreeNodeRewriter
instance.
- This API is the transforming counterpart of
Additional TreeNode
APIs:
TreeNode::apply_children()
andTreeNode::map_children()
:
Building blocks for the above 2 + 7TreeNode
APIs.apply_children()
inspects,map_children()
transforms the children of aTreeNode
node with the providedf
closure.TreeNode
implementations (e.g.LogicalPlan
,Expr
,Arc<dyn ExecutionPlan>
,Arc<dyn PhysicalExpr>
, ...) have to implementapply_children()
for inspecting andmap_children()
for transforming APIs.- These 2 methods are public, but they should almost never be called explicitely by the
TreeNode
API users. If, for some reason, the above 2 + 7 APIs don't cover the recursion needed then these 2 methods can be used to define a custom algorithms. - The
apply_children()
name is aligned withapply()
, but have no idea why we call the othermap_children()
,transform_children()
would probably be a better fit...
Besides the above general TreeNode
APIs we have some LogicalPlan
APIs:
LogicalPlan::..._with_subqueries()
:
All the above 2 + 7TreeNode
APIs have aLogicalPlan::..._with_subqueries()
version that do the same as the originalTreeNode
API but also recurse into subqueries (same type innerLogicalPlan
trees that are defined in the expressions of the originalLogicalPlan
tree nodes).- We could generalize this concept into
TreeNode
butLogicalPlan
seems to be onlyTreeNode
type that makes use of these APIs.
- We could generalize this concept into
Additional LogicalPlan
APIs:
LogicalPlan::apply_expressions()
,LogicalPlan::map_expressions()
,LogicalPlan::apply_subqueries()
andLogicalPlan::map_subqueries()
:
Building blocks forLogicalPlan::..._with_subqueries()
APIs.apply_expressions()
andapply_subqueries()
inspect,map_expressions()
andmap_subqueries()
transform the expressions/subqueries of aLogicalPlan
node with the providedf
closure.- These 4 are similar to
TreeNode::apply_children()
andTreeNode::map_children()
and so they are rarely called directly but can be used to define a custom algorithms.
- These 4 are similar to
Here is the summary of the current situation:
TreeNode
APIs:
traversal order | inspecting | transforming |
---|---|---|
order agnostic | transform() |
|
top-down | apply() |
transform_down() , transform_down_mut() |
bottom-up | transform_up() , transform_up_mut() |
|
combined with separete f_down and f_up closures |
transform_down_up() |
|
combined with incorporated f_down() and f_up() methods into an object |
visit() + TreeNodeVisitor |
rewrite() + TreeNodeRewriter |
Additional TreeNode
APIs: apply_children()
, map_children()
.
LogicalPlan
APIs:
traversal order | inspecting | transforming |
---|---|---|
order agnostic | transform_with_subqueries() |
|
top-down | apply_with_subqueries() |
transform_down_with_subqueries() , transform_down_mut_with_subqueries() |
bottom-up | transform_up_with_subqueries() , transform_up_mut_with_subqueries() |
|
combined with separete f_down and f_up closures |
transform_down_up_with_subqueries() |
|
combined with incorporated f_down() and f_up() methods into an object |
visit_with_subqueries() + TreeNodeVisitor |
rewrite_with_subqueries() + TreeNodeRewriter |
Additional LogicalPlan
APIs: apply_expressions()
, map_expressions()
, apply_subqueries()
, map_subqueries()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, if we wanted to consolidate the TreeNode
API terminonlogy I would suggest the following:
traversal order | inspecting | transforming |
---|---|---|
order agnostic | for_each() (new API, an alias to for_each_up() , but its docs shouldn't specify any order) |
transform() (remains an alias to transform_up() but its docs shouldn't specify any order) |
top-down | for_each_down() (new API), apply() (deprecated, from now just an alias to for_each_down() ) |
transform_down() (changes to f: &mut F where F: FnMut(Self) -> Result<Transformed<Self>> , which is a breaking change but requires a trivial fix from the users), transform_down_mut() (deprecated, from now just an alias to transform_down() ) |
bottom-up | for_each_up() (new API) |
transform_up() (changes to f: &mut F where F: FnMut(Self) -> Result<Transformed<Self>> , which is a breaking change but requires a trivial fix from the users), transform_up_mut() (deprecated, from now just an alias to transform_up() ) |
combined with separete f_down and f_up closures |
transform_down_up() |
|
combined with incorporated f_down() and f_up() methods into an object |
visit() + TreeNodeVisitor |
rewrite() + TreeNodeRewriter |
+ Maybe rename apply_children()
to for_each_children()
and map_children()
to transform_children()
. Although renaming these would be a breaking change if something that is built on DataFusion defines a new kind of TreeNode
, but fixing them would also be trivial.
+ LogicalPlan::..._with_subqueries()
APIs should be alligned with the above TreeNode
ones, but as they are fairly new (#9913) they haven't landed in any release yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this information Peter. Amazing and mind-blowing
I think there are two threads to pursue here. First what currently exists and second what we can do to improve the situation.
I will attempt to reformat this information about how things currently work into this pr. Then I will see if there is anything we can pull out into follow on tasks (eg remove the _mut variants)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are two threads to pursue here. First what currently exists and second what we can do to improve the situation.
I will attempt to reformat this information about how things currently work into this pr. Then I will see if there is anything we can pull out into follow on tasks (eg remove the _mut variants)
Yes, I totally agree with you. It is more important to document what we already have. I just wanted to give you some details about the remaining inconsistency in API naming. If we pursue to fix it, I'm happy to open a new PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, update here:
- I incorporated the information (and table!) from @wiedld and @peter-toth in this PR's documentation
- Added the suggestion for consolidated terminology to Epic: Unified TreeNode rewrite API #8913 (comment)
- Filed Deprecate TreeNode
transform_xx_mut
methods #10097 to clean up the transform APIs
This PR is ready for another look if anyone has the time and inclination to read more docs 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I have only minor comments.
Can a committer possibly review and approve this PR? I think it is ready to go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Co-authored-by: Jeffrey Vo <[email protected]>
Thank you for the review @Jefffrey 🙏 |
🚀 📖 |
* Improve documentation on `TreeNode` * Document inspecting, rewriting APIs. Add chart * tweak * Add references to PlanContext and ExprContext * refine TreeNodeRewriter docs * Add note about exists * Update datafusion/common/src/tree_node.rs Co-authored-by: Jeffrey Vo <[email protected]> --------- Co-authored-by: Jeffrey Vo <[email protected]>
Which issue does this PR close?
Part of #8913 and part of #7013 and part of #10121
Rationale for this change
As I was working on various rewrites using the TreeNode APIs (which are awesome) I sometimes get lost in all the similarly named functions (
map
,apply
,visit
,rewrite
, etc) and what the difference is. Now that I have that all loaded into my brain, I wanted to write it down for my future self (and others hopefully too).What changes are included in this PR?
Improve documentation on
TreeNode
,TreeNodeVisitor
, andTreeNodeRewriter
Are these changes tested?
Doc tests
Are there any user-facing changes?
Better docs, no functional changes