[RFC] A self-contained Feldera Python package similar to Spark or chDB

# Summary
[summary]: #summary

Currently the only way to evaluate Feldera is either building the binary from source and running it, or running in Docker. Not everyone is tech savvy with Docker, many enterprise laptops do not even let you install Docker Desktop.

But most people have Python, so if there was a way to `pip install pyfeldera`, that does the equivalent of `start_manager.sh`, that would be a great way to give people a one liner to spin up and try the product, or even use it in a batch processing style where you spin down the server, store the state in distributed storage, and pick back up the next day, each time, only doing a little bit of IVM work.

# Motivation
[motivation]: #motivation

Many many engines offer this single node Python self contained installs:

https://pypi.org/project/pyspark/
https://pypi.org/project/chdb/
https://pypi.org/project/duckdb/
https://pypi.org/project/pysail/
https://pypi.org/project/daft/

https://clickhouse.com/blog/chdb-embedded-clickhouse-rocket-engine-on-a-bicycle

So on.

# Reference-level explanation
[reference-level-explanation]: #reference-level-explanation

In short Feldera python (say `pyfeldera`) could ship like this:

https://github.com/apache/spark/blob/master/python/packaging/client/setup.py

Takes some pre-packaged binaries that are built with the OS-friendly deps (build/feature flags etc), figure out python version, OS etc, stash the binaries somewhere in a well known location/temp dir on the OS, and fire the Feldera pipeline manager with or without UI.

The REST API can then be interacted with via the python client, or say, via dbt.

# Drawbacks
[drawbacks]: #drawbacks

Extra work and managing overhead on installation bugs as many people out in the wild probably have Python running on weird machines like ARM64 or Rosetta etc.

# Rationale and alternatives
[rationale-and-alternatives]: #rationale-and-alternatives

- Why is this design the best in the space of possible designs?
Other people do it, they gain new customers from it due to the ease of install.

- What other designs have been considered and what is the rationale for not choosing them?
Use docker, but there are pains with Host mounting etc in Docker that not everyone understands.

Also, you can't run docker in many Cloud VMs, like say: https://learn.microsoft.com/en-us/fabric/data-engineering/using-python-experience-on-notebook

But you can `pip install whatever-you-want` in the above.

- What is the impact of not doing this?

Not having as many people try out Feldera.

# Prior art
[prior-art]: #prior-art

See my links above, there are many, many reference implementations

# Unresolved questions
[unresolved-questions]: #unresolved-questions

Support matrix probably, need a robust CI of the most used OS-es etc. GitHub Actions offers this:

https://docs.github.com/en/actions/how-tos/write-workflows/choose-what-workflows-do/run-job-variations

# Future possibilities
[future-possibilities]: #future-possibilities

- Integrate `pyfeldera` into benchmarks like this:  https://github.com/microsoft/LakeBench
- Promote a one-line install experience in your website: `pip install pyfeldera` and also ship a quickstart right in the codebase


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] A self-contained Feldera Python package similar to Spark or chDB #6048

Summary

Motivation

Reference-level explanation

Drawbacks

Rationale and alternatives

Prior art

Unresolved questions

Future possibilities

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] A self-contained Feldera Python package similar to Spark or chDB #6048

Description

Summary

Motivation

Reference-level explanation

Drawbacks

Rationale and alternatives

Prior art

Unresolved questions

Future possibilities

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions