Define The Problem by Understanding The Data Set – Part I
Every new dataset comes with new challenge for data scientist, what is lying there is unknown. Going down the road, data scientists face new unknowns in attributes, labels, types, mixed information, cluttered unsorted data, disconnected data points, missing values etc. which gives rise to complexity in data mining and data modelling algorithms designs. In this post we are going to visit such challenges and see how it can impact your machine learning algorithm, how much tweaking and tuning can be required and hence your overall approach to target new problem.
Let’s understand through a simple example to review some basic problem structure, nomenclature, and characteristics of a machine learning data set. First we need to ascertain & introduce some language to discuss different types of function approximation problems one by one. These problems illustrate common variations of machine learning problems so that you’ll know how to recognize the variants when you see them and will know how to handle them (and will have code examples for them)
Different Types of Attributes and Labels Drive Modeling Choices
Attributes and labels go by a variety of names, and new machine learners can get tripped up by the name switching from one algo-author to another or even one section to another from a single author. Attributes (the variables being used to make predictions) are also known as the following:
Labels are also known as the following:
One type of attribute is called a categorical or factor variable. Categorical variables have the property that there’s no order relation between the various values. Categorical variables can be two‐valued, like Boy or Girl, or multivalued, like states (MP, MH, AP . . . MIZ). Other distinctions can be drawn regarding attributes (integer versus float, for example), but they do not have the same impact on machine learning algorithms.
The reason for this is that many machine learning algorithms take numeric attributes only; they cannot handle categorical or factor variables. Penalized regression algorithms deal only with numeric attributes. The same is true for support vector machines, kernel methods, and K‐nearest neighbors. There are some methods for converting categorical variables to numeric variables. The nature of the variables will shape your algorithm choices and the direction you take in developing a predictive model, so it’s one of the things you need to pay attention to when you face a new problem.
A similar dichotomy arises for the labels. When the labels are numeric, the problem is called a regression problem. When the labels are categorical, the problem is called a classification problem. If the categorical target takes only two values, the problem is called a binary classification problem. If it takes more than two values, the problem is called a multi-class classification problem. In many cases, the choice of problem type is up to the algo-designer. A problem can be converted from a regression problem to a binary classification problem by the simple transformation of the labels. These are trade-offs that you may might to make as part of your attack on a problem
Things To Check About Your Data Set
You’ll want to ascertain a number of other features of the data set as part of your initial inspection of the data. The following is a checklist and a sequence of things to learn about your data set to familiarize yourself with the data and to formulate the predictive model development steps that you want to follow. These are simple things to check and directly impact your next steps. In addition, the process gets you moving around the data and learning its properties.
Items to Check
Number of rows and columns
Number of categorical variables and number of unique values for each
Summary statistics for attributes and labels
One of the first things to check is the size and shape of the data. Read the data into a list of lists; then the dimension of the outer list is the number of rows, and the dimension of one of the inner lists is the number of columns. The next section shows the concrete application of this to one of the data sets that you’ll see used later to illustrate the properties of an algorithm that will be developed.
The next step in the process is to determine how many missing values there are in each row. The reason for doing it on a row‐by‐row basis is that the simplest way to deal with missing values is to throw away instances that aren’t complete (examples with at least one missing value). In many situations, this can bias the results, but just a few incomplete examples will not make a material difference. By counting the rows with missing data (in addition to the total number of missing entries), you’ll know how much of the data set you have to discard if you use the easy method.
If you have a large number of rows, as you might if you’re collecting web data, the number you’ll lose may be small compared to the number of rows of data you have available. If you’re working on biological problems where the data are expensive and you have many attributes, you might not be able to afford to throw data out. In that case, you’ll have to figure out some ways to fill in the missing values or use an algorithm that can deal with them.
Filling them in is called imputation. The easiest way to impute the missing data is to fill in the missing entries using average values of the entries in each row, or in worst case zero (this is debatable though). A more sophisticated method is to use one of the predictive methods will be covered in our next post on penalized regression & ensemble techniques. To use a predictive method, you treat a column of attributes with missing values as though it were labels. Be sure to remove the original problem labels before undertaking this process.
The next several sections we are going to go through the process to be outlined here and will introduce some methods for characterizing your data set to help you decide how to attack the modeling process.
It starts with simple measurements of size and shape, reporting data types, counting missing values, and so forth. Then it moves on to statistical properties of the data and interrelationships between attributes and between attributes and the labels
Physical Characteristics of the Data Set
What difference does this make? The number of rows and columns has several impacts on how you proceed. First, the overall size gives you a rough idea of how long your training times are going to be. For a small data set training time will be less than a minute, which will facilitate iterating through the process of training and tweaking.
If the data set grows to 1,000 x 1,000, the training times will grow to a fraction of a minute for penalized linear regression and a few minutes for an ensemble method. As the data set gets to several tens of thousands of rows and columns, the training times will expand to 3 or 4 hours for penalized linear regression and 12 to 24 hours for an ensemble method. The larger training times will have an impact on your development time because you’ll iterate a number of times.
The second important observation regarding row and column counts is that if the data set has many more columns than rows, you may be more likely to get the best prediction with penalized linear regression and vice versa.
At this point it is important that you check how many of the columns of data are numeric versus categorical
Statistical Summaries of the Data Set
After determining which attributes are categorical and which are numeric, you’ll want some descriptive statistics for the numeric variables and a count of the unique categories in each categorical attribute. The first step is to calculate the mean and standard deviation for the chosen attribute. Knowing these will undergird your intuition as you’re developing models.
The next step is to look at outliers. One way to reveal this sort of mismatch is to divide a set of numbers into percentiles. The easiest way to visualize forming these groupings is to imagine that the data are sorted into numeric order. The numbers in the preceding list are arranged in numeric order. That makes it easy to see where the percentile boundaries go. Some often used percentiles are given special names. The percentiles defined by dividing the set into equal quarters, fifths, and tenths are called respectively quartiles, quintiles, and deciles.
Visualization of Outliers Using Quantile‐Quantile Plot
One way to study outliers in more detail is to plot the distribution of the data in question relative to some reasonable distributions to see whether the relative numbers match up. If the data being analyzed comes from a Gaussian distribution, the point being plotted will lie on a straight line
What do you do with this information?
Outliers may cause trouble either for model building or prediction. After you’ve trained a model on this data set, you can look at the errors your model makes and see whether the errors are correlated with these outliers. If they are, you can then take steps to correct them. One way you can replicate the poor‐performing examples to force them to be more heavily represented. You can segregate them out and train on them as a separate class. You can also edit them out of the data if they represent an abnormality that won’t be present in the data your model will see when deployed. A reasonable process for this might be to generate quartile boundaries during the exploration phase and note potential outliers to get a feel for how much of a problem you might (or might not) have with it. Then when you’re evaluating performance data, use quantile‐quantile (Q‐Q) plots to determine which points to call outliers for use in your error analysis.