[Machine Learning Diary] Day 6 — Don’t overfit
Overfitting is a very common issue in machine learning projects and data competitions. I think this is one of the most common issue we need to deal with.
What is overfit
According to the book hands-on machine learning Overfitting happens when the model is too complex relative to the amount and noiseness of the training data.
If you have ever attend any kaggle competition. You may know that sometimes people’s ranking will drop after test their model with the private dataset at final stage. This normally happens when people use some “magic features” only works on the public dataset.
How to avoid overfitting
There some algorithms to resolve overfitting
- regularization
- data augmentation
- dropout
- Bootstrap/Bagging
- ensemble
- early stopping
- utilize invariance
- Bayesian
I will use example later for each of them. To be continue
Apart from these algorithms we can also use some common practice
- Simplify the model
- Get more data
- reduce noise from training data
Conclusion
Actually overfitting is you think too much given the data you have. If may be accurate in your tranning dataset. But the accuracy is not good on testing dataset and real world data.
Reference
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems