This project demonstrates how to analyze insurance claims data by applying a complete actuarial data science workflow using Python and Tableau. It showcases data cleaning, feature engineering, fraud detection, and stakeholder-ready visualizations.
The goal: Identify potential fraudulent activities, assess customer risk profiles, and deliver insights to stakeholders through a clear visual dashboard.
The goal of this project was to explore the feasibility of using basic feature engineering and rule-based risk scoring to flag potentially fraudulent insurance claims in a simplified dataset.
The focus was on:
- Cleaning and merging claims and customer data
- Creating basic risk flags using thresholds
- Testing with a synthetic fraudulent claim
- Building a proof-of-concept Tableau dashboard for stakeholder presentation
- Loaded raw datasets:
Insurance_Claims_Data.csvandInsurance_Customers_Data.csv - Inspected for missing values, data types, and duplicates
- Standardized column names and merged datasets on
customer_id - Saved the result to
Cleaned_Merged_Insurance_Data.csv
- Feature engineered new metrics:
claim_to_premium_ratiois_high_risk(binary flag)processing_efficiency
- Exported as
Feature_Engineered_Insurance_Data_with_ClaimID.csv
- Generated synthetic
claim_idvalues in the formatCL101,CL102, etc. - Injected a controlled fake fraudulent record to test risk scoring logic
- Created fraud detection logic based on:
- High claim amount thresholds
- Claim frequency
- Flagged high-risk and potentially fraudulent claims
- Final dataset saved as:
Final_Insurance_Claims_Data.csv
- Created a stakeholder-facing dashboard using Tableau Desktop & Tableau Public
- Included KPIs and visual insights:
- Total Claims
- Total Fraud Claims
- Percent Fraud
- Top 5 Claim Submitters
- Claims Distribution by Policy Type
- Risk Breakdown by Flag Type
Recommendation for stakeholders:
- Investigate claimants with unusually high claim frequencies
- Review customers with high claim-to-premium ratios
- Use this dashboard as an early detection tool for fraud patterns
| Tool | Purpose |
|---|---|
| Python | Data cleaning, feature engineering, fraud logic |
| Pandas | Data manipulation |
| Tableau | Dashboard & stakeholder insights |
| GitHub | Version control & portfolio hosting |
Insurance_Analytics/
├── data/
│ ├── raw/
│ │ ├── Insurance_Claims_Data.csv
│ │ └── Insurance_Customers_Data.csv
│ ├── processed/
│ │ ├── Cleaned_Merged_Insurance_Data.csv
│ │ ├── Feature_Engineered_Insurance_Data_with_ClaimID.csv
│ │ └── Final_Insurance_Claims_Data.csv
├── scripts/
│ ├── insurance_data_cleaning.py
│ ├── insurance_feature_engineering.py
│ ├── inject_fake_fraud_case.py
│ └── risk_scoring_fraud_detection.py
├── Insurance_Claims_Risk_Summary.png
└── README.md
This analysis examined 505 insurance claims to assess financial risk and detect potentially fraudulent activity using basic actuarial data science principles. Key findings:
- 5 claims (0.99%) were flagged as potentially fraudulent, primarily due to unusually high claim frequency or payout amounts.
- A synthetic fraud scenario was introduced to test our detection logic and validate feature engineering and risk scoring methods.
- A Tableau dashboard was created to visualize total claims, frequency patterns, and flagged risks designed to support underwriting, fraud review, or pricing teams.
- Monitor claimants with high claim-to-premium ratios or abnormal frequency.
- Investigate flagged claims and refine fraud thresholds based on future data volume.
- Integrate this process into monthly claims reporting to proactively manage financial exposure.
This project demonstrates how technical data workflows can be transformed into actionable insights for business and risk stakeholders.
| Script | Purpose |
|---|---|
insurance_data_cleaning.py |
Cleaned and merged customer and claims data |
insurance_feature_engineering.py |
Created engineered features like ratios and risk flags |
inject_fake_fraud_case.py |
Injected synthetic fraudulent claim for testing logic |
risk_scoring_fraud_detection.py |
Applied fraud and risk scoring logic to final dataset |
