This project forecasts daily sales for Rossmann stores using historical data, store metadata, and engineered features. We use the XGBoost regression model to capture complex patterns and improve predictive accuracy.
- Train Data:
train.csv - Test Data:
test.csv - Store Metadata:
store.csv
Data Source: Rossmann Kaggle Competition
- Predict the
Salescolumn using historical store data. - Create a model that generalizes well to unseen data (test set).
- Submit predictions in
submission.csv.
- Python
- Pandas, NumPy, Matplotlib, Seaborn
- XGBoost
- Scikit-learn for model selection and metrics
Store,DayOfWeek,Promo,SchoolHoliday,StateHoliday,StoreType,Assortment,CompetitionDistance,Promo2
Year,Month,Day,WeekOfYear
IsPromoMonth— True if the current month is a promo month for that storeIsHolidayWeek— True if a holiday (school/state) occurred in that weekCompetitionOpenTimeMonths— Number of months since a competitor openedPromo2OpenTimeWeeks— Weeks since Promo2 started
Sales_Lag1,Sales_Lag7,Sales_RollingMean3,Sales_RollingMean7
🚫 These features are excluded from test data since future sales are unknown.
- Load & Merge: Combine
train.csv/test.csvwithstore.csv - Feature Engineering: Extract and transform useful features
- Train/Test Split: Split train data for model evaluation
- Train Model: Use
XGBRegressor - Hyperparameter Tuning:
GridSearchCV(optional) - Predict: Apply best model to processed test data
- Submit: Create
submission.csv
- Actual vs Predicted Sales (scatter plot)
- Predicted Sales Trends (by Date or Store)
- Feature Importance from XGBoost