Choosing a classification algorithm or any algorithm for that matter in Supervised Machine learning domain has to do with Bias Variance tradeoff and it’s a central issue to it.
- Bias is defined as error is an error from erroneous assumptions in the learning algorithm . High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
Where as, - Variance is defined as is an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random noise in the training data (overfitting).
So expected behavior of some common classification algorithms provided similar conditions be like:
Algorithm | Bias | Variance |
---|---|---|
Naive Bayes | High Bias | Low Variance |
Logistic Regression | Low Bias | High Variance |
Decision Tree | Low Bias | High Variance |
Bagging | Low Bias | High Variance, lesser than Decision tree |
Random Forest | Low Bias | High Variance, lesser than Decision tree and Bagging |
So in essence if the choice is based on data set size then go with models with High Bias and Low Variance in case of lesser data and with high number of data points we can experiment with other classification algo’s since that’ll give better result