Betflow: Real-time Sports Betting Analytics Platform

Betflow is a modern data and analytics engineering platform designed for real-time sports betting analytics, combining streaming and batch processing capabilities to provide comprehensive insights for sports betting decisions.

Note:

This project in any capacity does not endorse or encourage gambling. The goal of this project is to learn and apply Data Engineering and Analytics Engineering concepts and understand open-source tech stack. The purpose of this project is to simply offer data-as-a-product for analyzing transient data at huge scale.

UI or dashboards are not publicly available for two big reasons:

It may raise legal concerns for me as there is an analysis pointing towards arbitrage.

Current architecture requires EC2 instance to keep running and hitting APIs at high rate, which is very expensive for me as I am paying for everything by myself.

If you are curious to see how dashboards look like, feel free to checkout the below two links for video explanation. For recruiters, for hiring purposes, please email me that you want to see UI, it takes just 5 minutes to get the pipeline running, I will share the link with you.

Zach Wilson's Capstone Showcase

Data Engineering DC meetup

Key Features

Real-time odds movement tracking and analysis
Live game statistics processing
Weather impact correlation
Historical pattern recognition
Market efficiency metrics
Multi-source data integration
Automated data quality checks
Custom analytics dashboards

The platform processes both streaming data for immediate insights and historical data for pattern analysis, providing a complete solution for sports betting analytics.

For usage check this
For working and screenshots check this

Architecture

1. Real-time pipeline

%%{init: {
  'theme': 'base',
  'themeVariables': {
    'fontSize': '16px',
    'fontWeight': 'bold',
    'primaryTextColor': '#000000',
    'primaryColor': '#e1f5fe',
    'primaryBorderColor': '#01579b',
    'fontFamily': 'arial'
  }
}}%%
graph LR
    classDef ingestion fill:#e6f3ff,stroke:#3498db
    classDef processing fill:#e6ffe6,stroke:#2ecc71
    classDef storage fill:#fff0e6,stroke:#e67e22
    classDef analytics fill:#ffe6e6,stroke:#e74c3c
    classDef monitoring fill:#f0e6ff,stroke:#9b59b6

    %% Real-time Pipeline
    A[Real-time APIs] -->|Stream| B[Kafka]
    B -->|Process| C[Spark Streaming]
    B -->|Backup| D[raw-s3/topics]
    C -->|Real-time Analytics| E[Apache Druid]
    E -->|Live Dashboard| F[Grafana RT]
    E -->|Archive| G[hist-s3/druid/segments]

    A & B:::ingestion
    C:::processing
    D & G:::storage
    E:::analytics
    F:::monitoring

Components

1. Data Sources

ESPN API for live game statistics
The Odds API for real-time betting odds
OpenWeather API for venue weather conditions

2. Message Queue (Kafka)

KRaft mode for better performance and reliability
Dedicated topics per sport and data type
S3 sink connector for data persistence

3. Stream Processing (Spark)

Structured Streaming for real-time analytics
Window-based aggregations (3-20 minutes)
Custom processors for games, odds, and weather

4. Analytics Storage (Druid)

Sub-second query performance
Multi-dimensional analytics support
Native Kafka integration
Historical data management

5. Visualization (Grafana)

Real-time dashboards
Custom metrics and alerts
Historical trend analysis

Data Flow

APIs publish data to respective Kafka topics
Spark Streaming processes and enriches data
Druid ingests processed data for analytics
Grafana visualizes metrics in real-time

Key Features

Low-latency data processing (<1 second)
Fault-tolerant architecture
Scalable data ingestion
Historical data persistence
Real-time analytics and visualization

This pipeline enables comprehensive sports analytics with real-time insights for game analysis, odds movement tracking, and weather impact assessment.

2. Batch pipeline

%%{init: {
  'theme': 'base',
  'themeVariables': {
    'fontSize': '16px',
    'fontWeight': 'bold',
    'primaryTextColor': '#000000',
    'primaryColor': '#e1f5fe',
    'primaryBorderColor': '#01579b',
    'fontFamily': 'arial'
  }
}}%%
graph LR
    classDef source fill:#e6f3ff,stroke:#3498db
    classDef processing fill:#e6ffe6,stroke:#2ecc71
    classDef storage fill:#fff0e6,stroke:#e67e22
    classDef analytics fill:#ffe6e6,stroke:#e74c3c
    classDef quality fill:#f5f5f5,stroke:#7f8c8d

    %% Batch Pipeline
    A[Historical APIs] -->|Ingest| B[Airflow DAGs]
    B -->|Raw Data| C[raw-s3/historical]
    C -->|Transform| D[AWS Glue]
    D -->|Process| E[Iceberg Tables]
    E -->|Load| F[Snowflake Gold]
    F -->|Model| G[dbt]
    G -->|Visualize| H[Grafana Batch]

    E -->|Catalog| I[Glue Catalog]
    E -->|Store| J[cur-s3/processed]

    A:::source
    B & D:::processing
    C & J:::storage
    F & G:::analytics
    H:::analytics
    I:::quality

Components

1. Data Ingestion Layer

Raw data collection from multiple sports APIs
Storage in S3 raw bucket (raw-s3/historical/)
Organized by sport type and data category (games, odds)

2. Processing Layer

AWS Glue for ETL processing
Transformation from raw JSON to structured formats
Implementation of Iceberg tables for data management
Storage in processed bucket (cur-s3/processed/)

3. Analytics Layer*

Snowflake integration for analytics processing
dbt models for data transformation
Bronze → Silver → Gold layer progression
Comprehensive data quality checks

Data Flow

1. Raw Data Collection

Historical game data ingestion
Odds data processing
Structured storage in S3

2. Data Processing

Raw JSON transformation
Schema validation and enforcement
Data quality checks
Iceberg table management

3. Analytics Processing*

Snowflake external tables
dbt transformations
Business logic implementation
Analytics-ready views

Key Features

Data Organization: Multi-layer storage strategy (Bronze/Silver/Gold)
Quality Control: Comprehensive data validation
Scalability: Efficient handling of multiple sports data
Analytics Ready: Structured for complex analysis
Maintainable: Clear separation of concerns

Analytics Overview*

Data Analysis Patterns

1. Game Analytics

Score progression tracking
Team performance metrics
Home vs Away performance analysis
Season-over-season comparisons
Quarter-by-quarter analysis

2. Odds Analytics

Line movement tracking
Market efficiency metrics
Bookmaker comparison analysis
Opening vs Closing line analysis
Volume-weighted average prices
Odds volatility patterns

3. Cross-Data Analytics

Weather impact correlations
News sentiment effects
Performance-weather patterns
Multifactor analysis

Analytical Patterns

1. SCD (Slowly Changing Dimensions)

Team information tracking
Venue details management
Player roster changes
League structure changes

2. CDC (Change Data Capture)

Odds movements tracking
Score updates monitoring
Weather condition changes
Real-time market shifts

3. Growth Accounting

Performance metrics evolution
Market efficiency trends
Betting volume patterns
Season-by-season comparisons

What Makes Betflow Different

True Real-time Processing: Leverages Apache Kafka and Spark Streaming for sub-second latency in odds movement analysis and game statistics
Dual Pipeline Architecture: Separate real-time and batch pipelines optimized for their specific use cases
Advanced Analytics: Combines game statistics, odds movements, weather impacts, and news sentiment for comprehensive betting insights
Cost-effective Design: Hybrid storage strategy using Apache Druid for real-time analytics and Snowflake for historical analysis
Scalable Architecture: Cloud-native design supporting multiple sports (NFL, NBA, NHL, NCAAF) and data sources

Ongoing and upcoming implementations

dbt models, gold layer, and business logic implementation

Batch grafana dashboard

Unit and integration tests and corresponding workflows

Ruff linting and type checking and corresponding workflows

CI/CD for package Betflow and corresponding docs

Asynchronous Kafka orchestrator, one producer for all

Weather component in batch pipeline

Processing DAGs and analytics for data published by Kafka streams

News component in real-time and batch pipeline

Series of medium articles (as couple articles won't suffice for this project)

NOTE: (*) means in progress.

Name		Name	Last commit message	Last commit date
Latest commit History 236 Commits
.github/workflows		.github/workflows
dags		dags
dbt		dbt
images		images
setups		setups
snowflake		snowflake
src		src
.airflowignore		.airflowignore
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
demo.md		demo.md
pyproject.toml		pyproject.toml
usage.md		usage.md

License

sanchitvj/sports_betting_analytics_engine

Folders and files

Latest commit

History

Repository files navigation

Betflow: Real-time Sports Betting Analytics Platform

Key Features

Architecture

1. Real-time pipeline

Components

1. Data Sources

2. Message Queue (Kafka)

3. Stream Processing (Spark)

4. Analytics Storage (Druid)

5. Visualization (Grafana)

Data Flow

Key Features

2. Batch pipeline

Components

1. Data Ingestion Layer

2. Processing Layer

3. Analytics Layer*

Data Flow

1. Raw Data Collection

2. Data Processing

3. Analytics Processing*

Key Features

Analytics Overview*

Data Analysis Patterns

1. Game Analytics

2. Odds Analytics

3. Cross-Data Analytics

Analytical Patterns

1. SCD (Slowly Changing Dimensions)

2. CDC (Change Data Capture)

3. Growth Accounting

What Makes Betflow Different

Ongoing and upcoming implementations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages