Summary
Currently the only way to evaluate Feldera is either building the binary from source and running it, or running in Docker. Not everyone is tech savvy with Docker, many enterprise laptops do not even let you install Docker Desktop.
But most people have Python, so if there was a way to pip install pyfeldera, that does the equivalent of start_manager.sh, that would be a great way to give people a one liner to spin up and try the product, or even use it in a batch processing style where you spin down the server, store the state in distributed storage, and pick back up the next day, each time, only doing a little bit of IVM work.
Motivation
Many many engines offer this single node Python self contained installs:
https://pypi.org/project/pyspark/
https://pypi.org/project/chdb/
https://pypi.org/project/duckdb/
https://pypi.org/project/pysail/
https://pypi.org/project/daft/
https://clickhouse.com/blog/chdb-embedded-clickhouse-rocket-engine-on-a-bicycle
So on.
Reference-level explanation
In short Feldera python (say pyfeldera) could ship like this:
https://github.com/apache/spark/blob/master/python/packaging/client/setup.py
Takes some pre-packaged binaries that are built with the OS-friendly deps (build/feature flags etc), figure out python version, OS etc, stash the binaries somewhere in a well known location/temp dir on the OS, and fire the Feldera pipeline manager with or without UI.
The REST API can then be interacted with via the python client, or say, via dbt.
Drawbacks
Extra work and managing overhead on installation bugs as many people out in the wild probably have Python running on weird machines like ARM64 or Rosetta etc.
Rationale and alternatives
-
Why is this design the best in the space of possible designs?
Other people do it, they gain new customers from it due to the ease of install.
-
What other designs have been considered and what is the rationale for not choosing them?
Use docker, but there are pains with Host mounting etc in Docker that not everyone understands.
Also, you can't run docker in many Cloud VMs, like say: https://learn.microsoft.com/en-us/fabric/data-engineering/using-python-experience-on-notebook
But you can pip install whatever-you-want in the above.
- What is the impact of not doing this?
Not having as many people try out Feldera.
Prior art
See my links above, there are many, many reference implementations
Unresolved questions
Support matrix probably, need a robust CI of the most used OS-es etc. GitHub Actions offers this:
https://docs.github.com/en/actions/how-tos/write-workflows/choose-what-workflows-do/run-job-variations
Future possibilities
- Integrate
pyfeldera into benchmarks like this: https://github.com/microsoft/LakeBench
- Promote a one-line install experience in your website:
pip install pyfeldera and also ship a quickstart right in the codebase
Summary
Currently the only way to evaluate Feldera is either building the binary from source and running it, or running in Docker. Not everyone is tech savvy with Docker, many enterprise laptops do not even let you install Docker Desktop.
But most people have Python, so if there was a way to
pip install pyfeldera, that does the equivalent ofstart_manager.sh, that would be a great way to give people a one liner to spin up and try the product, or even use it in a batch processing style where you spin down the server, store the state in distributed storage, and pick back up the next day, each time, only doing a little bit of IVM work.Motivation
Many many engines offer this single node Python self contained installs:
https://pypi.org/project/pyspark/
https://pypi.org/project/chdb/
https://pypi.org/project/duckdb/
https://pypi.org/project/pysail/
https://pypi.org/project/daft/
https://clickhouse.com/blog/chdb-embedded-clickhouse-rocket-engine-on-a-bicycle
So on.
Reference-level explanation
In short Feldera python (say
pyfeldera) could ship like this:https://github.com/apache/spark/blob/master/python/packaging/client/setup.py
Takes some pre-packaged binaries that are built with the OS-friendly deps (build/feature flags etc), figure out python version, OS etc, stash the binaries somewhere in a well known location/temp dir on the OS, and fire the Feldera pipeline manager with or without UI.
The REST API can then be interacted with via the python client, or say, via dbt.
Drawbacks
Extra work and managing overhead on installation bugs as many people out in the wild probably have Python running on weird machines like ARM64 or Rosetta etc.
Rationale and alternatives
Why is this design the best in the space of possible designs?
Other people do it, they gain new customers from it due to the ease of install.
What other designs have been considered and what is the rationale for not choosing them?
Use docker, but there are pains with Host mounting etc in Docker that not everyone understands.
Also, you can't run docker in many Cloud VMs, like say: https://learn.microsoft.com/en-us/fabric/data-engineering/using-python-experience-on-notebook
But you can
pip install whatever-you-wantin the above.Not having as many people try out Feldera.
Prior art
See my links above, there are many, many reference implementations
Unresolved questions
Support matrix probably, need a robust CI of the most used OS-es etc. GitHub Actions offers this:
https://docs.github.com/en/actions/how-tos/write-workflows/choose-what-workflows-do/run-job-variations
Future possibilities
pyfelderainto benchmarks like this: https://github.com/microsoft/LakeBenchpip install pyfelderaand also ship a quickstart right in the codebase