Define The Problem by Understanding The Data Set – Part III
In the previous part we talked about dataset and how to define problem using your dataset. In that process we talked about couple of techniques, high level though, in order to identify what entails in your dataset. More so, how machine learning can be used in order to investigate and dig out important information in real-time basis.
In the first part we discussed some bits of Classification Problem. In this post we will continue with classification problem with more inputs.
Statistical Characterization of Categorical Attributes
In our first post we left the post at the topic ML application to numeric attributes. But what about categorical attributes? You want to check to see how many categories they have and how many examples there are from each category. You want to check these things for a couple of reasons. The gender attribute has two possible values (Male and Female), but if the attribute had been the state of the United States, there would have been 50 possible categories, or if it happens to be in India, 29 states. As the number of attributes grows, the complexity of dealing with them mounts. Most binary tree algorithms, which are the basis for ensemble methods, have a cutoff on how many categories they can handle. The popular Random Forests package written by Breiman and Cutler (the inventors of the algorithm) has a cutoff of 32 categories. If an attribute has more than 32 categories, you’ll need to aggregate them.
It is quite important to see that training involves taking a random subset of the data and training a series of models on it. Suppose, for instance, that the category is the state of the United States and that Ohio has only two examples. A random draw of training examples might not get any from Ohio. You need to see those kinds of problems before they occur so that you can address them. In the case of the two Ohio examples, you might merge them with Pennsylvania or Indiana, you might duplicate them, or you might manage the random draw so that you ensure getting Ohio examples (a procedure called stratified sampling).
Visualizing Attribute and Label Correlations Using a Heat Map
Calculating the correlations and printing them or drawing cross‐plots works fine for a few correlations, but it is difficult to get a grasp of a large table of numbers, and it is difficult to squeeze all the cross‐plots onto a page if the problem has 100 attributes. One way to check correlations with a large number of attributes is to calculate the Pearson’s correlation coefficient for pairs of attributes, arrange those correlations into a matrix where the ij‐th entry is the correlation between the ith attribute and the jth attribute, and then plot them in a heat map. Different coloring can define correlation appropriately. The light areas along the diagonal confirm that attributes close to one another in index have relatively high correlations. This is due to the way in which the data are generated. If close data points are sampled at short time intervals from one another and consequently have similar frequencies. Similar frequencies reflect off the targets similarly (and so on).
Training, Validation, and Test set
The Holy Grail of creating good classification models is to train on a set of good quality, representative, (training data), tune the parameters and find effective models (validation data), and finally, estimate the model’s performance by its behavior on unseen data (test data). The central idea behind the logical grouping is to make sure models are validated or tested on data that has not been seen during training. Otherwise, a simple “rote learner” can outperform the algorithm. The generalization capability of the learning algorithm must be evaluated on a dataset which is different from the training dataset, but comes from the same population. The balance between removing too much data from training to increase the budget of validation and testing can result in models which suffer from “underfitting”, that is, not having enough examples to build patterns that can help in generalization. On the other hand, the extreme choice of allocating all the labeled data for training and not performing any validation or testing can lead to “overfitting”, that is, models that fit the examples too faithfully and do not generalize well enough.
Typically, in most machine learning challenges and real world customer problems, one is given a training set and testing set upfront for evaluating the performance of the models. In these engagements, the only question is how to validate and find the most effective parameters given the training set. In some engagements, only the labeled dataset is given and you need to consider the training, validation, and testing sets to make sure your models do not overfit or underfit the data.
Three logical processes are needed for modeling and hence three logical datasets are needed, namely,
- Training 2. Validation 3. Testing
The purpose of the training dataset is to give labeled data to the learning algorithm to build the models. The purpose of the validation set is to see the effect of the parameters of the training model being evaluated by training on the validation set. Finally, the best parameters or models are retrained on the combination of the training and validation sets to find an optimum model that is then tested on the blind test set.
In the next part we will cover how to generate Training, Validation, and Test data and how to use them, handle underfitting, overfitting etc.