feat: add support for pandas series & data frames as inputs for ml models. #1088
feat: add support for pandas series & data frames as inputs for ml models. #1088
Conversation
|
|
||
|
|
||
| def convert_to_dataframe(*input: ArrayType) -> Generator[bpd.DataFrame, None, None]: | ||
| def convert_to_dataframe( |
There was a problem hiding this comment.
Do you think we can merge the logic in this file into the core/convert module logic? Ideally we don't have pandas->bigframes logic in two places.
There was a problem hiding this comment.
Good point! Though I think it's not very straightforward to do so. This is mainly because this function returns a Generator, while the one in the core package returns a single value. We will need some extra effort to make everything consistent (function names, parameter types, return types, etc).
Considering that this RP is already not trivial, I think we can for now only focus on the feature delivery. I will migrate the conversion logic in another PR. Does it sound good to you?
There was a problem hiding this comment.
b/373716095 for reference
There was a problem hiding this comment.
Can split up the work of course, as long as each step stands alone as an improvement. Not sure how much the generator aspect matters - but unifying the two approaches I think will be a good exercise and result in some improvements to both.
| X: Union[bpd.DataFrame, bpd.Series, pd.DataFrame, pd.Series], | ||
| y: Optional[Union[bpd.DataFrame, bpd.Series, pd.DataFrame, pd.Series]] = None, |
There was a problem hiding this comment.
Do you think we can define a single annotation representing this set of types and use it everywhere. This will make it easier to accomodate additional types in the future
There was a problem hiding this comment.
My main concern is that type aliases may not be expanded/resolved when generating docs, and thus confuse our end users.
There was a problem hiding this comment.
It seems that sklearn doesn't try to define a set of types in its docs: https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit ?
There was a problem hiding this comment.
Updated all to use type alias because Sphinx is able to resolve them
|
|
||
|
|
||
| def _convert_to_dataframe(frame: ArrayType) -> bpd.DataFrame: | ||
| def _convert_to_dataframe(frame: InputArrayType) -> bpd.DataFrame: |
There was a problem hiding this comment.
Should we also handle array-like data like numpy arrays or even plain python list/tuples?
There was a problem hiding this comment.
Yeah, the effort should be trivial if we are to consolidate the conversion functions from the ml package to the core package. I can handle this in another CL as what is proposed above. Let me know your thoughts.
| return frame | ||
| if isinstance(frame, pd.DataFrame): | ||
| # Recursively call this method to re-use the length-checking logic | ||
| return _convert_to_series(bpd.read_pandas(frame)) |
There was a problem hiding this comment.
we might not always want the default session, if the other argument is a bigframes object with a non-default session. the core version uses the session from the co-argument
There was a problem hiding this comment.
Good point! I decided to use the sessions provided from the bqml_model whenever possible. The global session acts as a default.
|
|
||
| Args: | ||
| X (bigframes.dataframe.DataFrame or bigframes.series.Series): | ||
| X (bigframes.dataframe.DataFrame or bigframes.series.Series or pandas.core.frame.DataFrame or pandas.core.series.Series): |
There was a problem hiding this comment.
I worry that fully enumerating the accepted types will be too much once we further extend
There was a problem hiding this comment.
I agree with you. The Google style prefers type hints over type documents, and it makes more sense. Here I'm just keeping the style consistent.
No description provided.