Betflow is a modern data and analytics engineering platform designed for real-time sports betting analytics, combining streaming and batch processing capabilities to provide comprehensive insights for sports betting decisions.
Note:
This project in any capacity does not endorse or encourage gambling. The goal of this project is to learn and apply Data Engineering and Analytics Engineering concepts and understand open-source tech stack. The purpose of this project is to simply offer data-as-a-product for analyzing transient data at huge scale.
UI or dashboards are not publicly available for two big reasons:
- It may raise legal concerns for me as there is an analysis pointing towards arbitrage.
- Current architecture requires EC2 instance to keep running and hitting APIs at high rate, which is very expensive for me as I am paying for everything by myself.
If you are curious to see how dashboards look like, feel free to checkout the below two links for video explanation. For recruiters, for hiring purposes, please email me that you want to see UI, it takes just 5 minutes to get the pipeline running, I will share the link with you.
- Real-time odds movement tracking and analysis
- Live game statistics processing
- Weather impact correlation
- Historical pattern recognition
- Market efficiency metrics
- Multi-source data integration
- Automated data quality checks
- Custom analytics dashboards
The platform processes both streaming data for immediate insights and historical data for pattern analysis, providing a complete solution for sports betting analytics.
%%{init: {
'theme': 'base',
'themeVariables': {
'fontSize': '16px',
'fontWeight': 'bold',
'primaryTextColor': '#000000',
'primaryColor': '#e1f5fe',
'primaryBorderColor': '#01579b',
'fontFamily': 'arial'
}
}}%%
graph LR
classDef ingestion fill:#e6f3ff,stroke:#3498db
classDef processing fill:#e6ffe6,stroke:#2ecc71
classDef storage fill:#fff0e6,stroke:#e67e22
classDef analytics fill:#ffe6e6,stroke:#e74c3c
classDef monitoring fill:#f0e6ff,stroke:#9b59b6
%% Real-time Pipeline
A[Real-time APIs] -->|Stream| B[Kafka]
B -->|Process| C[Spark Streaming]
B -->|Backup| D[raw-s3/topics]
C -->|Real-time Analytics| E[Apache Druid]
E -->|Live Dashboard| F[Grafana RT]
E -->|Archive| G[hist-s3/druid/segments]
A & B:::ingestion
C:::processing
D & G:::storage
E:::analytics
F:::monitoring
- ESPN API for live game statistics
- The Odds API for real-time betting odds
- OpenWeather API for venue weather conditions
- KRaft mode for better performance and reliability
- Dedicated topics per sport and data type
- S3 sink connector for data persistence
- Structured Streaming for real-time analytics
- Window-based aggregations (3-20 minutes)
- Custom processors for games, odds, and weather
- Sub-second query performance
- Multi-dimensional analytics support
- Native Kafka integration
- Historical data management
- Real-time dashboards
- Custom metrics and alerts
- Historical trend analysis
- APIs publish data to respective Kafka topics
- Spark Streaming processes and enriches data
- Druid ingests processed data for analytics
- Grafana visualizes metrics in real-time
- Low-latency data processing (<1 second)
- Fault-tolerant architecture
- Scalable data ingestion
- Historical data persistence
- Real-time analytics and visualization
This pipeline enables comprehensive sports analytics with real-time insights for game analysis, odds movement tracking, and weather impact assessment.
%%{init: {
'theme': 'base',
'themeVariables': {
'fontSize': '16px',
'fontWeight': 'bold',
'primaryTextColor': '#000000',
'primaryColor': '#e1f5fe',
'primaryBorderColor': '#01579b',
'fontFamily': 'arial'
}
}}%%
graph LR
classDef source fill:#e6f3ff,stroke:#3498db
classDef processing fill:#e6ffe6,stroke:#2ecc71
classDef storage fill:#fff0e6,stroke:#e67e22
classDef analytics fill:#ffe6e6,stroke:#e74c3c
classDef quality fill:#f5f5f5,stroke:#7f8c8d
%% Batch Pipeline
A[Historical APIs] -->|Ingest| B[Airflow DAGs]
B -->|Raw Data| C[raw-s3/historical]
C -->|Transform| D[AWS Glue]
D -->|Process| E[Iceberg Tables]
E -->|Load| F[Snowflake Gold]
F -->|Model| G[dbt]
G -->|Visualize| H[Grafana Batch]
E -->|Catalog| I[Glue Catalog]
E -->|Store| J[cur-s3/processed]
A:::source
B & D:::processing
C & J:::storage
F & G:::analytics
H:::analytics
I:::quality
- Raw data collection from multiple sports APIs
- Storage in S3 raw bucket (raw-s3/historical/)
- Organized by sport type and data category (games, odds)
- AWS Glue for ETL processing
- Transformation from raw JSON to structured formats
- Implementation of Iceberg tables for data management
- Storage in processed bucket (cur-s3/processed/)
- Snowflake integration for analytics processing
- dbt models for data transformation
- Bronze → Silver → Gold layer progression
- Comprehensive data quality checks
- Historical game data ingestion
- Odds data processing
- Structured storage in S3
- Raw JSON transformation
- Schema validation and enforcement
- Data quality checks
- Iceberg table management
- Snowflake external tables
- dbt transformations
- Business logic implementation
- Analytics-ready views
- Data Organization: Multi-layer storage strategy (Bronze/Silver/Gold)
- Quality Control: Comprehensive data validation
- Scalability: Efficient handling of multiple sports data
- Analytics Ready: Structured for complex analysis
- Maintainable: Clear separation of concerns
- Score progression tracking
- Team performance metrics
- Home vs Away performance analysis
- Season-over-season comparisons
- Quarter-by-quarter analysis
- Line movement tracking
- Market efficiency metrics
- Bookmaker comparison analysis
- Opening vs Closing line analysis
- Volume-weighted average prices
- Odds volatility patterns
- Weather impact correlations
- News sentiment effects
- Performance-weather patterns
- Multifactor analysis
- Team information tracking
- Venue details management
- Player roster changes
- League structure changes
- Odds movements tracking
- Score updates monitoring
- Weather condition changes
- Real-time market shifts
- Performance metrics evolution
- Market efficiency trends
- Betting volume patterns
- Season-by-season comparisons
- True Real-time Processing: Leverages Apache Kafka and Spark Streaming for sub-second latency in odds movement analysis and game statistics
- Dual Pipeline Architecture: Separate real-time and batch pipelines optimized for their specific use cases
- Advanced Analytics: Combines game statistics, odds movements, weather impacts, and news sentiment for comprehensive betting insights
- Cost-effective Design: Hybrid storage strategy using Apache Druid for real-time analytics and Snowflake for historical analysis
- Scalable Architecture: Cloud-native design supporting multiple sports (NFL, NBA, NHL, NCAAF) and data sources
- dbt models, gold layer, and business logic implementation
- Batch grafana dashboard
- Unit and integration tests and corresponding workflows
- Ruff linting and type checking and corresponding workflows
- CI/CD for package
Betflowand corresponding docs- Asynchronous Kafka orchestrator, one producer for all
- Weather component in batch pipeline
- Processing DAGs and analytics for data published by Kafka streams
- News component in real-time and batch pipeline
- Series of medium articles (as couple articles won't suffice for this project)
NOTE: (*) means in progress.