Modelling Risks in Machine Learning – A Case for Dirichlet Distribution for Better Modelling on Training Dataset
In the previous part we talked about Training, Validation, and Testing dataset and how to define problem using your dataset. In that process we talked about couple of techniques, high level though, in order to identify what entails in your dataset. In this part we are discussing some statistical issues related to Classification Problem.
Risks in Modelling via Machine Learning
Two things affect the learning or the generalization capability: the choice of the algorithm (and its parameters) and number of training data. This ability to generalize can be estimated by various metrics including the prediction errors. The overall estimate of unseen error or risk of the model is given by:
Here, is the stochastic noise, is called the variance error and is a measure of how susceptible our hypothesis or the algorithm (G) is, if given different datasets. is called the bias error and represents how far away the best algorithm in the model (average learner over all possible datasets) is from the optimal one.
Learning curves as shown in Figures below —where training and testing errors are plotted keeping either the algorithm with its parameters constant or the training data size constant—give an indication of underfitting or overfitting.
When the training data size is fixed, different algorithms or the same algorithms with different parameter choices can exhibit different learning curves. The Figure below shows two cases of algorithms on the same data size giving two different learning curves based on bias and variance.
Figure above shows the Training Data relationship with Error Rate when the model complexity is fixed indicates different choices of models.
The algorithm or model choice also impacts model performance. A complex algorithm, with more parameters to tune, can result in overfitting, while a simple algorithm with less parameters might be underfitting. The classic figure to illustrate the model performance and complexity when the training data size is fixed is as follows:
Figure above shows the Model Complexity relationship with Error rate, over the training and the testing data when training data size is fixed.
Validation allows for exploring the parameter space to find the model that generalizes best. Regularization (will be discussed in linear models) and validation are two mechanisms that should be used for preventing overfitting. Sometimes the “k-fold cross-validation” process is used for validation, which involves creating samples of the data and using to train on and the remaining one to test, repeated times to give an average estimate. The following figure shows 5-fold cross-validation as an example:
The following are some commonly used techniques to perform data sampling, validation, and learning:
- Random split of training, validation, and testing: 60, 20, 20. Train on 60%, use 20% for validation, and then combine the train and validation datasets to train a final model that is used to test on the remaining 20%. Split may be done randomly, based on time, based on region, and so on.
- Training, cross-validation, and testing: Split into Train and Test two to one, do validation using cross-validation on the train set, train on whole two-thirds and test on one-third. Split may be done randomly, based on time, based on region, and so on.
- Training and cross-validation: When the training set is small and only model selection can be done without much parameter tuning. Run crossvalidation on the whole dataset and chose the best models with learning on the entire dataset.
A Precursor to Multinomial Classification Problem
In this post we discussed the modelling risks we bear in machine learning. This may get compounded in multinomial, multi-dimensional, multi-label dataset. Multiclass classifications are similar to binary classifications, with the difference that there are several possible discrete outcomes instead of just two. What should be the priori distribution of the overall dataset if the total dataset inherently containing subsets. Can one distribution represent the overall nature of the dataset? These are valid question a data scientist encounters in dealing with multinomial dataset. How do you determine if there are significant outlier in the dataset, can we associate those as attributes in our training, valildation and testing datasets? can box plots and parallel coordinates plots help in determining the algorithm choice?
Exploratory studies for the multinomial data have revealed a very interesting problem. In particular, the box plots, coupled with the parallel coordinates plot, suggest that a good choice of algorithm might be an ensemble method if there’s enough data to fit it. The sets of attributes corresponding to one class or another apparently have a complicated boundary between them. What algorithm will give the best predictive performance remains to be seen in machine learning. The exploratory methods you may have seen have done their job. They have given a good understanding of the tradeoffs for this problem, leading to some guesses about what algorithm will give the best performance.
In our next post we will discuss the priori dirichlet distribution to see how convenient it can be in multiclass, multinomial dataset.