Skip to content

Unable to apply parquet dataset from s3 #3216

@mzwiessele

Description

@mzwiessele

Expected Behavior

We point our file source to a parquet dataset:

file_source = FileSource(
    name="dummy_file_source",
    path="s3://data/dummy/"),
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
    file_format=ParquetFormat(),
)

I'm expecting to be able to use the parquet dataset format the same way I'd use a single file.

Current Behavior

Feast errors at the apply stage:

File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/feast/infra/offline_stores/file_source.py", line 164, in get_table_column_names_and_types
      filesystem.open_input_file(path), filesystem=filesystem
    File "pyarrow/_fs.pyx", line 588, in pyarrow._fs.FileSystem.open_input_file
    File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
    File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
  OSError: Path does not exist 'data/dummy'
Full Traceback
Traceback (most recent call last):
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/bin/feast", line 8, in <module>
    sys.exit(cli())
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/feast/cli.py", line 519, in apply_total_command
    apply_total(repo_config, repo, skip_source_validation)
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/feast/usage.py", line 283, in wrapper
    return func(*args, **kwargs)
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/feast/repo_operations.py", line 335, in apply_total
    apply_total_with_repo_instance(
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/feast/repo_operations.py", line 296, in apply_total_with_repo_instance
    registry_diff, infra_diff, new_infra = store.plan(repo)
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/feast/usage.py", line 294, in wrapper
    raise exc.with_traceback(traceback)
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/feast/usage.py", line 283, in wrapper
    return func(*args, **kwargs)
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/feast/feature_store.py", line 723, in plan
    self._make_inferences(
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/feast/feature_store.py", line 601, in _make_inferences
    update_feature_views_with_inferred_features_and_entities(
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/feast/inference.py", line 179, in update_feature_views_with_inferred_features_and_entities
    _infer_features_and_entities(
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/feast/inference.py", line 217, in _infer_features_and_entities
    table_column_names_and_types = fv.batch_source.get_table_column_names_and_types(
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/Users/mzwiessele/feast_s3_dataset_error/.venv/lib/python3.8/site-packages/feast/infra/offline_stores/file_source.py", line 164, in get_table_column_names_and_types
    filesystem.open_input_file(path), filesystem=filesystem
  File "pyarrow/_fs.pyx", line 588, in pyarrow._fs.FileSystem.open_input_file
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: Path does not exist 'data/dummy'

Steps to reproduce

  1. Create default example using feast init
  2. Upload example driver_stats.parquet data to s3 dataset path: s3://data/dummy/driver_stats.parquet
  3. Change the data source to point to the s3 dataset:
    @@ -1,28 +1,28 @@
    driver_stats_source = FileSource(
        name="driver_hourly_stats_source",
    -   path="data/driver_stats.parquet",
    +   path="s3://data/dummy/",
        timestamp_field="event_timestamp",
        created_timestamp_column="created",
    )
  4. Run feast apply

Specifications

  • Version: 0.24.0
  • Platform: MacOS
  • Subsystem: Python 3.8

Possible Solution

This PR fixes this issue: #3217

I have found this line to cause the error:

filesystem.open_input_file(path), filesystem=filesystem

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions