feat: Allow join-free alignment of analytic expressions#1168
Conversation
| def replace_child( | ||
| self, new_child: BigFrameNode, validate: bool = False | ||
| ) -> UnaryNode: | ||
| new_self = replace(self, child=new_child) # type: ignore |
There was a problem hiding this comment.
I had to open up the file to figure out this is from dataclasses. Per https://google.github.io/styleguide/pyguide.html#s2.2-imports please import dataclasses, not individual functions / classes from it. dataclasses is not one of the allowed exceptions.
There was a problem hiding this comment.
fixed this import
| self, new_child: BigFrameNode, validate: bool = False | ||
| ) -> UnaryNode: | ||
| new_self = replace(self, child=new_child) # type: ignore | ||
| if validate: |
There was a problem hiding this comment.
I'm curious, in what cases is it necessary to have a validate argument instead of a public validate() method? If it's for method chaining, we could have validate() return self.
There was a problem hiding this comment.
eh, I think I"ll remove that, its not being used, and yeah, could run validations it as a separate invocation
| return ArrayValue(join_node), (l_mapping, r_mapping) | ||
|
|
||
| def try_align_as_projection( | ||
| def try_new_row_join( |
There was a problem hiding this comment.
Maybe just try_row_join since we plan on this being the only one early next year?
| genned_ids.append(attempted_id) | ||
| i = i + 1 | ||
| return genned_ids | ||
| return [ids.ColumnId.unique().name for _ in range(n)] |
There was a problem hiding this comment.
Does this make the column IDs less deterministic than the previous logic?
There was a problem hiding this comment.
Not any less deterministic than we are now. If we want an isomorphism between query structure and output syntax, there is a bit more work that needs to be done to the system, which I think basically amounts to late binding identifiers serially through the tree.
|
|
||
|
|
||
| def try_row_join( | ||
| def try_new_row_join( |
There was a problem hiding this comment.
Ditto, re try_row_join here, but perhaps in a separate PR since I see that it would make it harder to identify the one's that should be replaced with try_legacy_row_join if we did that.
| ) -> Optional[bigframes.core.nodes.BigFrameNode]: | ||
| """Joins the two nodes""" | ||
| if l_node.projection_base != r_node.projection_base: | ||
| return None |
There was a problem hiding this comment.
Since there are return Nones should the be try_join_as_projection?
| return None | ||
| # check join key | ||
| for l_key, r_key in join_keys: | ||
| # Caller is block, so they still work with raw strings rather than ids |
There was a problem hiding this comment.
Should we fix that? e.g. make block use ColumnId?
There was a problem hiding this comment.
Yes, we should do that refactor, but its a bit of an undertaking. For now, block uses string ids, and ArrayValue is responsible for wrapping them up as ColumnIds.
| import bigframes.core.window_spec | ||
| import bigframes.operations.aggregations | ||
|
|
||
| ADDITIVE_NODES = ( |
There was a problem hiding this comment.
Would be helpful to have some comments about what these node types have in common. How would I decide if a node type should be added to this list?
There was a problem hiding this comment.
added comment explaining
|
|
||
|
|
||
| def pull_up_selection( | ||
| node: bigframes.core.nodes.BigFrameNode, rename_vars: bool = False |
There was a problem hiding this comment.
What's the purpose of the rename_vars parameter? A docstring would be helpful.
There was a problem hiding this comment.
added docstring. its intended to make sure that when we combine columns from two sides, we don't get conflicts.
| (bigframes.core.expression.DerefOp(field.id), field.id) | ||
| for field in node.fields | ||
| ) | ||
| assert isinstance(node, (bigframes.core.nodes.SelectionNode, *ADDITIVE_NODES)) |
There was a problem hiding this comment.
Am I remembering correctly that SelectionNode represents a WHERE clause, based on Selection from relational algebra terminology?
Given the high-level that implicit joiner no longer handles these, I'm a little confused why we need to rewrite these? Is it so that they are all consolidated so that the nodes we can combine appear together in the tree? I would appreciate a bit more info.
There was a problem hiding this comment.
Its actually poorly named. SelectionNode just renames/reorders existing columns, without any scalar transforms
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕