One of the most well-known and essential sub-fields of data science is machine learning. The term machine learning was first used in 1959 by IBM researcher Arthur Samuel. From there, the field of machine learning gained much interest from others, especially for its use in classifications.
When you start your journey into learning and mastering the different aspects of data science, perhaps the first sub-field you come across is machine learning. Machine learning is the name used to describe a collection of computer algorithms that can learn and improve by gathering information while they are running.
Any machine learning algorithm is built upon some data. Initially, the algorithm uses some “training data” to build an intuition of solving a specific problem. Once the algorithm passes the learning phase, it can then use the knowledge it gained to solve similar problems based on different datasets.
In general, we categorize machine learning algorithms into 4 categories:
- Supervised algorithms: Algorithms that involve some supervision from the developer during the operation. To do that, the developer labels the training data and set strict rules and boundaries for the algorithm to follow.
- Unsupervised algorithms: Algorithms that do not involve direct control from the developer. In this case, the algorithms’ desired results are unknown and need to be defined by the algorithm.
- Semi-supervised algorithms: Algorithms that combines aspects of both supervised and unsupervised algorithms. For example, not all training data will be labeled, and not all rules will be provided when initializing the algorithm.
- Reinforcement algorithms: In these types of algorithms, a technique called exploration/exploitation is used. The gest of it is simple; the machine makes an action, observe the outcomes, and then consider those outcomes when executing the next action, and so on.
Each of these categories is designed for a purpose; for example, supervised learning is designed to scale the training data’s scope and make predictions of future or new data based on that. On the other hand, unsupervised algorithms are used to organize and filter data to make sense of it.
Under each of those categories lay various specific algorithms that are designed to perform certain tasks. This article will cover 5 basic algorithms every data scientist must know to cover machine learning basics.
Regression algorithms are supervised algorithms used to find possible relationships among different variables to understand how much the independent variables affect the dependent one.
You can think of regression analysis as an equation, For example, if I have the equation
y = 2x + z, y is my dependant variable, and x,z are the independent ones. Regression analysis finds how much do x and z affect the value of y.
The same logic applies to more advanced and complex problems. To adapt to the various problems, there are many types of regression algorithms; perhaps the top 5 are:
- Linear Regression: The simplest regression technique uses a linear approach for featuring the relationship between the dependent (predicted) and independent variables (predictors).
- Logistic Regression: This type of regression is used on binary dependent variables. This type of regressing is widely used to analyze categorical data.
- Ridge Regression: When the regression model becomes too complex, ridge regression corrects the model’s coefficients’ size.
- Lasso Regression: Lasso (Least Absolute Shrinkage Selector Operator) Regression is used to select and regularize variables.
- Polynomial Regression: This type of algorithm is used to fit non-linear data. Using it, the best prediction is not a straight line; it is a curve that tries to fit all data points.
Classification in machine learning is the process of grouping items into categories based on a pre-categorized training dataset. Classification is considered a supervised learning algorithm.
These algorithms use the training data’s categorization to calculate the likelihood that a new item will fall into one of the defined categories. A well-known example of classification algorithms is filtering incoming emails into spam or not-spam.
There are different types of classification algorithms; the top 4 ones are:
- K-nearest neighbor: KNN is an algorithm that uses training datasets to find the k closest data points in some datasets.
- Decision trees: You can think of it as a flow chart, classifying each data points into two categories at a time and then each to two more and so on.
- Naive Bayes: This algorithm calculates the probability that an item falls under a specific category using the conditional probability rule.
- Support Vector Machine (SVM): In this algorithm, the data is classified based on its degree of polarity, which can go beyond the X/Y prediction.
Ensembling algorithms are supervised algorithms made of combining the prediction of two or more other machine learning algorithms to produce more accurate results. Combining the results can either be done by voting or averaging the results. Voting is often used during classification and averaging during regression.
Ensembling algorithms have 3 basic types: Bagging, Boosting, and Stacking.
- Bagging: In bagging, the algorithms are run in parallel on different training sets, all equal in size. All algorithms are then tested using the same dataset, and voting is used to determine the overall results.
- Boosting: In the case of boosting, the algorithms are run sequentially. Then the overall results are chosen using weighted voting.
- Stacking: From the name, stacking has two-level stacked on top of each other, the base level is a combination of algorithms, and the top level is a meta-algorithm based on the base level results.
Clustering algorithms are a group of unsupervised algorithms used to group data points. Points within the same cluster are more similar to each other than to points in different clusters.
There are 4 types of clustering algorithms:
- Centroid-based Clustering: This clustering algorithm organizes the data into clusters based on initial conditions and outliers. k-means is the most knowledgeable and used centroid-based clustering algorithm.
- Density-based Clustering: In this clustering type, the algorithm connects high-density areas into clusters creating arbitrary-shaped distributions.
- Distribution-based Clustering: This clustering algorithm assumes the data is composed of probability distributions and then clusters the data into various versions of that distribution.
- Hierarchical Clustering: This algorithm creates a tree of hierarchical data clusters, and the number of clusters can be varied by cutting the tree at the correct level.
Association algorithms are unsupervised algorithms used to discover the probability of some items to occur together in a specific dataset. It is mostly used in the market-basket analysis.
The most used association algorithm is Apriori.
The Apriori algorithm is a mining algorithm used commonly used in transactional databases. Apriori is used to mine frequent itemsets and generate some association rules from those item sets.
For example, if a person buys milk and bread, then they are likely to also get some eggs. These insights are built upon previous purchases from various clients. Association rules are then formed according to a specific threshold for confidence set by the algorithm based on how frequently these items are brought together.
Machine learning is one of the most famous, well-researched sub-field of data science. New machine learning algorithms are always under development to reach better accuracy and faster execution. Regardless of the algorithm, it can generally be categorized as one of four categories: supervised, unsupervised, semi-supervised, and reinforced algorithms. Each one of these categories holds many algorithms that are used for different purposes.
In this article, I have gone through 5 types of supervised/ unsupervised algorithms that every machine learning beginner should be familiar with. These algorithms are well-studied and widely-used that you only need to understand how to use it rather than how to implement it.
Most famous Python machine learning modules — such as Scikit Learn — contain a pre-defined version of most — if not all — of these algorithms. So, my advice is, understand the mechanic, and master the usage and start building.
Article Written By: Sara A. Metwalli