Skip to content
This repository was archived by the owner on Apr 1, 2026. It is now read-only.

feat: add support for pandas series & data frames as inputs for ml models. #1088

Merged
sycai merged 15 commits intomainfrom
b362723869
Oct 23, 2024
Merged

feat: add support for pandas series & data frames as inputs for ml models. #1088
sycai merged 15 commits intomainfrom
b362723869

Conversation

@sycai
Copy link
Copy Markdown
Contributor

@sycai sycai commented Oct 15, 2024

No description provided.

@sycai sycai requested review from a team and chelsea-lin October 15, 2024 22:06
@product-auto-label product-auto-label bot added the size: l Pull request size is large. label Oct 15, 2024
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. label Oct 15, 2024
Comment thread bigframes/ml/utils.py


def convert_to_dataframe(*input: ArrayType) -> Generator[bpd.DataFrame, None, None]:
def convert_to_dataframe(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can merge the logic in this file into the core/convert module logic? Ideally we don't have pandas->bigframes logic in two places.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Though I think it's not very straightforward to do so. This is mainly because this function returns a Generator, while the one in the core package returns a single value. We will need some extra effort to make everything consistent (function names, parameter types, return types, etc).

Considering that this RP is already not trivial, I think we can for now only focus on the feature delivery. I will migrate the conversion logic in another PR. Does it sound good to you?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b/373716095 for reference

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can split up the work of course, as long as each step stands alone as an improvement. Not sure how much the generator aspect matters - but unifying the two approaches I think will be a good exercise and result in some improvements to both.

Comment thread bigframes/ml/pipeline.py Outdated
Comment on lines +105 to +106
X: Union[bpd.DataFrame, bpd.Series, pd.DataFrame, pd.Series],
y: Optional[Union[bpd.DataFrame, bpd.Series, pd.DataFrame, pd.Series]] = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can define a single annotation representing this set of types and use it everywhere. This will make it easier to accomodate additional types in the future

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is that type aliases may not be expanded/resolved when generating docs, and thus confuse our end users.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated all to use type alias because Sphinx is able to resolve them

Comment thread bigframes/ml/utils.py Outdated


def _convert_to_dataframe(frame: ArrayType) -> bpd.DataFrame:
def _convert_to_dataframe(frame: InputArrayType) -> bpd.DataFrame:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also handle array-like data like numpy arrays or even plain python list/tuples?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the effort should be trivial if we are to consolidate the conversion functions from the ml package to the core package. I can handle this in another CL as what is proposed above. Let me know your thoughts.

@sycai sycai requested a review from TrevorBergeron October 15, 2024 22:48
Comment thread bigframes/ml/utils.py Outdated
return frame
if isinstance(frame, pd.DataFrame):
# Recursively call this method to re-use the length-checking logic
return _convert_to_series(bpd.read_pandas(frame))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might not always want the default session, if the other argument is a bigframes object with a non-default session. the core version uses the session from the co-argument

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I decided to use the sessions provided from the bqml_model whenever possible. The global session acts as a default.

@sycai sycai requested a review from TrevorBergeron October 16, 2024 21:36
Comment thread bigframes/ml/llm.py

Args:
X (bigframes.dataframe.DataFrame or bigframes.series.Series):
X (bigframes.dataframe.DataFrame or bigframes.series.Series or pandas.core.frame.DataFrame or pandas.core.series.Series):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that fully enumerating the accepted types will be too much once we further extend

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. The Google style prefers type hints over type documents, and it makes more sense. Here I'm just keeping the style consistent.

@sycai sycai enabled auto-merge (squash) October 23, 2024 17:30
@sycai sycai merged commit 30c8883 into main Oct 23, 2024
@sycai sycai deleted the b362723869 branch October 23, 2024 18:25
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants