Define The Problem by Understanding The Data Set – Part II
In the previous part we talked about dataset and how to define problem using your dataset. In that process we talked about couple of techniques, high level though, in order to identify what entails in your dataset. More so, how machine learning can be used in order to investigate and dig out important information in real-time basis.
Since machine learning is a complex and technically heavy subject, we shall discuss some of the aspects of machine learning, before we dwell into dataset issues and where it can affect your algorithms, coding patterns and outputs. As we know there are many ways we can design and develop codes around machine learning issues, in order to ascertain the types of learning approach, we put machine learning in various baskets, namely:
- Supervised Learning
- Semi-Supervised Learning
- Unsupervised Learning
If your dataset is heavily focused on category labels, then you not just encounter a classification problem but also face dichotomy of dimension reduction. Here you need to decide on the spot, what you should expect as an output. Dimension reduction helps in identifying the insights, but whether those reduced dimensions make sense is completely a data scientist’s decision. Technical understanding of issues such as linearity/ non-linearity in dataset, other characteristics such as ability to create a canonical variable using existing two or more variables from the dataset , determine how you should target the dataset and devise your machine learning algorithms.
In Supervised Learning, you can assign tasks of inferring various functions from labeled training data. The training data can consist of a set of training examples as a case of some benchmark, as a first step to devise your model. In supervised learning, each training set is a pair consisting of an input object/labeled category (typically a vector) and a desired output value as the the supervisory signal, as sort of mapping fucntion, where your algorithm can determine how off your output is from the supervisory signals.
We shall be discussing more on Supervised Learning and touch the topics of Blending , GBRT, RBFN and other Neural Network approachs and Regression topics such as LARS and LASSO etc. in coming posts
In Unsupervised Learning, the goal is to model the underlying structure or distribution in the data in order to learn more about the data, unlike supervised learning. These are called unsupervised learning because unlike supervised learning above there is no correct answers and there is mapping function or benchmark available to determine the desired output. Algorithms are left to their own devises to discover and present the interesting structure in the data. Unsupervised learning problems can be further grouped into clustering and association problems.
- Clustering: When you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.
- Association: When you want to discover rules that describe large portions of your data, such as people that buy electronic items also tend to buy accessories.
Here we also encounter issues like classification vs. predictive models. Which one to use is determined by the dataset and the client’s business profile. Some popular examples of unsupervised learning algorithms can be highlighted as k-means for clustering problems and apriori algorithm for association rule learning problems. Detailed discussion will be published in later posts.
In Semi-Supervised Learning, the approach is that of trying to find hidden structure in unlabeled data with some predetermined benchmarks. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution, though skewness can be traced using pre-supplied benchmarks, the quality or usabilty of training examples sets is determined by the data scientist, thus it is a subjective decision. As we can see, these problems sit in between both supervised and unsupervised learning.
A good example is a photo archive where only some of the images are labeled, (e.g. dog, cat, person) and the majority are unlabeled. Many real world machine learning problems fall into this area. This is because it can be expensive or time-consuming to label data as it may require access to domain experts. Whereas unlabeled data is cheap and easy to collect and store. You can use unsupervised learning techniques to discover and learn the structure in the input variables. You can also use supervised learning techniques to make best guess predictions for the unlabeled data, feed that data back into the supervised learning algorithm as training data and use the model to make predictions on new unseen data.