I am currently working on the Wine Quality dataset and struggling with significant overfitting. My model performs well on the training set but fails to generalize on the test set.
I have already tried:
Various algorithms (Random Forest, SVM, Logistic Regression).
Tuning hyperparameters (GridSearch, adjusting
max_depth,min_samples_leaf).Feature engineering (creating new ratios).
Handling class imbalance (
class_weight='balanced').
Despite these efforts, my training accuracy remains around 0.8-0.9 while my test accuracy stays stuck at 0.65.
Could anyone provide insights on whether this is due to inherent noise in the human-labeled data, or if I should change my strategy (e.g., switching to regression or advanced techniques like XGBoost)? Any advice or alternative approaches would be greatly appreciated.
Thank you!
https://colab.research.google.com/drive/1R0jhClimKn1EsfFRAV-612IwmTeypQK3?usp=sharing