Skip to content

[RFC] A self-contained Feldera Python package similar to Spark or chDB #6048

@mdrakiburrahman

Description

@mdrakiburrahman

Summary

Currently the only way to evaluate Feldera is either building the binary from source and running it, or running in Docker. Not everyone is tech savvy with Docker, many enterprise laptops do not even let you install Docker Desktop.

But most people have Python, so if there was a way to pip install pyfeldera, that does the equivalent of start_manager.sh, that would be a great way to give people a one liner to spin up and try the product, or even use it in a batch processing style where you spin down the server, store the state in distributed storage, and pick back up the next day, each time, only doing a little bit of IVM work.

Motivation

Many many engines offer this single node Python self contained installs:

https://pypi.org/project/pyspark/
https://pypi.org/project/chdb/
https://pypi.org/project/duckdb/
https://pypi.org/project/pysail/
https://pypi.org/project/daft/

https://clickhouse.com/blog/chdb-embedded-clickhouse-rocket-engine-on-a-bicycle

So on.

Reference-level explanation

In short Feldera python (say pyfeldera) could ship like this:

https://github.com/apache/spark/blob/master/python/packaging/client/setup.py

Takes some pre-packaged binaries that are built with the OS-friendly deps (build/feature flags etc), figure out python version, OS etc, stash the binaries somewhere in a well known location/temp dir on the OS, and fire the Feldera pipeline manager with or without UI.

The REST API can then be interacted with via the python client, or say, via dbt.

Drawbacks

Extra work and managing overhead on installation bugs as many people out in the wild probably have Python running on weird machines like ARM64 or Rosetta etc.

Rationale and alternatives

  • Why is this design the best in the space of possible designs?
    Other people do it, they gain new customers from it due to the ease of install.

  • What other designs have been considered and what is the rationale for not choosing them?
    Use docker, but there are pains with Host mounting etc in Docker that not everyone understands.

Also, you can't run docker in many Cloud VMs, like say: https://learn.microsoft.com/en-us/fabric/data-engineering/using-python-experience-on-notebook

But you can pip install whatever-you-want in the above.

  • What is the impact of not doing this?

Not having as many people try out Feldera.

Prior art

See my links above, there are many, many reference implementations

Unresolved questions

Support matrix probably, need a robust CI of the most used OS-es etc. GitHub Actions offers this:

https://docs.github.com/en/actions/how-tos/write-workflows/choose-what-workflows-do/run-job-variations

Future possibilities

  • Integrate pyfeldera into benchmarks like this: https://github.com/microsoft/LakeBench
  • Promote a one-line install experience in your website: pip install pyfeldera and also ship a quickstart right in the codebase

Metadata

Metadata

Labels

pythonPull requests that update python code

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions