AI/ML

How AI Learns (Part 1): Machine Learning

Learn how AI learns through machine learning. Explore numerical vs categorical data, supervised vs unsupervised learning, and core ML algorithms.

TLDR - How AI Learns: Machine Learning Basics, Data Types & Algorithms AI learns via machine learning (ML): training models on data to make predictions. Data types in ML: numerical (discrete, continuous) vs categorical (nominal, ordinal). Learning modes: supervised, unsupervised, semi-supervised, self-supervised, reinforcement.

Key algorithms: gradient descent, regression, clustering, association rule mining, dimensionality reduction. Applications: prediction, classification, anomaly detection, recommendation systems, robotics.

Recall that AI is a CS discipline that focuses on creating machines that mimic intelligent human behavior . Humans learn by trial and error, pattern recognition, and extrapolating past experiences. But how does AI learn ? Machine learning (ML) is a branch of AI that explores this. In this article, we will dive into some core concepts and algorithms of ML that empower computers to learn from data and make predictions without explicit programming.

Data is highly important in machine learning, as it is used to train the models. There are two main types of data based on representation : numerical and categorical .

A diagram of numerical vs. categorical data types. Source: https://www.legac.com.au/blogs/further-mathematics-exam-revision/further-mathematics-unit-3-data-analysis-types-of-data (accessed 07/19/2025)

However, the raw dataset values are rarely used for the actual training. Instead, conversion is performed to turn numerical or categorical data into useful floating-point values . These floating-point values make up the feature vectors (containing the characteristics of data points) that are fed into a machine learning model for training.

Numerical data consists of integers or floating-point values that behave like numbers, meaning they are measurable, countable, or additive. Time series data are often considered numerical data when each data point in the series is a number, such as sensor readings.

Google's Machine Learning Crash Course explains that although US postal codes are composed of five-digit numbers, they don't behave like numbers nor represent mathematical relationships. Instead, they represent specific geographic areas and thus are considered categorical data.

Discrete : Represents numerical values that are countable and distinct, with a finite number of possible outcomes. Examples: Number of cars in a parking lot, number of defects in a product, or number of button clicks.

Continuous : Represents numerical values that are measurable, with an infinite number of possible values within a range. It can include values with decimal points. Examples: Temperature readings, stock prices, or race completion times. Common feature engineering conversion techniques:

Normalization : Converts numerical values into a standard range, so that features are on a similar scale. Scaling allows the model to learn appropriate weights for each feature, rather than paying too much attention to features with wide spans and not enough attention to those with narrow spans.

Binning (or Bucketing) : Converts numerical values into groups or bins of subranges. Binning is appropriate when the feature values are more clustered than linear. Categorical data consists of labels or values that classify objects or individuals. There is a specific set of possible values.

There are two types of categorical values :

Nominal : Categories without any inherent order. Examples: Gender (Male/Female), car brands (Toyota, Tesla, BMW), or countries (United States, Japan, Spain).

Ordinal : Categories with a meaningful order, but without a consistent numerical difference. Examples: Customer satisfaction levels (Poor, Average, Good, Excellent) or education levels (High School, Bachelor’s, Master’s, Ph.D.).

Low number of possible categories: Vocabulary Encoding : Assigns a unique integer index for each unique categorical value. For example, with a categorical feature named car_color, the assignment could be (Red → 0, Blue → 1, Green → 2). However, the encoded integers aren't directly used for training, as the model would incorrectly imply an ordinal relationship. Vocabulary encoding is often a preprocessing step for further encoding.

One-Hot Encoding : Converts each categorical feature into a vector with length equal to the number of possible categorical values. Building upon vocabulary encoding, a "1" at the unique integer index position corresponds to the assigned categorical value (Red → [1.0, 0.0, 0.0], Blue → [0.0, 1.0, 0.0], Green → [0.0, 0.0, 1.0]). Possible categorical values are treated as distinct and unrelated, avoiding misinterpretation of ordinal relationships.

High number of possible categories: Embedding : Transforms categorical data into numerical vectors that capture the relationships among various categories or objects. Unlike one-hot encoding, which is a fixed representation, embeddings are learned by projecting the initial data vectors from a high-dimensional space to a lower-dimensional space.

The main differentiator in learning modes is the use of labeled data (consisting of input-output pairs) for training. Labeled data allow for high-accuracy training, but are intensive in time and labor because the labeling tasks are manually performed by humans.

Gradient ascent is the opposite optimization algorithm to maximize the likelihood function or reward. The algorithm iteratively adjusts the model's parameters that result in the highest probability or reward .

Supervised learning is a core machine learning approach where a model is trained using labeled data . Generally, 80 percent of the labeled data is used for training, and the remaining 20 percent is used for testing. Supervised learning solves two main problem types: regression (numerical continuous output) and classification (categorical output).

A diagram of supervised learning. Source: https://www.enjoyalgorithms.com/blogs/supervised-unsupervised-and-semisupervised-learning (accessed 07/19/2025)

Linear regression is a simple statistical method that is used to predict a numerical continuous output . It seeks to find the best-fit line, which is a straight line that minimizes the difference (or error) between the provided output values and the predicted values. It assumes a linear relationship between the input and output. The Mean Squared Error (MSE) is commonly used as the cost function.

This method can make future predictions based on historical outcomes, applicable to industries such as sales, finance, and healthcare. For example, provided a dataset of house features (lot size, number of bedrooms, number of bathrooms, etc.) and price, linear regression can be used to learn the relationship between the house features and selling price. Once the relationship is established, it can be used to predict the price of other houses.

Binomial Logistic Regression : Used for binary classification problems with only two possible categorical values (e.g., "yes" or "no").

Multinomial Logistic Regression : Used for three or more possible categorical values that are unordered (e.g., classifying eye color: "brown," "blue," or "green").

Ordinal Logistic Regression : Used for three or more possible categorical values that are ordered or ranked (e.g., classifying ratings: "low," "medium," or "high").

Logistic regression applies to many use cases across multiple industries. For example, breast cancer diagnosis (binary classification of "benign" or "maligient"), handwritten digit recognition (multinomial classification of "0-9"), and customer service review (ordinal classification of "poor," "fair," "good," or "excellent").

Only unlabeled data is used to find hidden patterns and relationships. Clustering groups unlabeled input data based on similarities or differences. There are three types of traditional "hard" clustering methods :

K-Medoids : Similar to the K-Means algorithm, but uses actual data points as cluster centers. A medoid of a cluster is the data point whose dissimilarities with all other points in the cluster are minimized. This algorithm is more robust to outliers.

Divisive (top-down) : Repeatedly splits clusters into smaller ones until all clusters are singletons or reaches a predetermined number of clusters. Some common association rule mining algorithms:

Apriori Algorithm : A bottom-up approach that starts with itemsets of size one. It iteratively expands frequent itemsets one item at a time while removing infrequent itemsets based on the minimum support threshold. It's a simple algorithm, but it can be computationally intensive for large datasets. The following are three key metrics: Support : The frequency in which an item appears in the dataset.

Confidence : The likelihood that an item Y appears in transactions containing item X.

Lift : Measures how much more likely two items are to occur together compared to occurring independently. A lift greater than one hints at a strong positive association.

Frequent Pattern Growth Algorithm (FP-Growth) : It first compresses the dataset into a special structure known as the Frequent Pattern Tree (FP-Tree), which stores information about the itemsets and their frequencies without candidate generation. The FP-Tree is then examined for frequency patterns based on the minimum support threshold. Lastly, the rules and frequent itemsets are generated. This algorithm is more efficient and scalable for large datasets.

ECLAT Algorithm (Equivalence Class Clustering and bottom-up Lattice Traversal) : Unlike the Apriori algorithm, ECLAT uses depth-first search and stores data in a vertical layout. Each item is linked to a list of transaction IDs to count the support metric for itemsets. This approach makes it faster and efficient for datasets with many frequent itemsets.

Dimensionality reduction is a process of lowering the number of dimensions (or features) in a dataset while preserving meaningful information. This can simplify complex datasets while also minimizing redundant features and noise. It's useful for preprocessing data fed into machine learning models and for data visualization purposes.

Some common dimensionality reduction methods:

Principal Component Analysis (PCA) : PCA works by feature extraction, combining and transforming the dataset's original features to create new principal components. There's an ordering for the principal components: the first captures the largest variance in the dataset, the second captures the next largest (orthogonal to the first), and so forth. It is simple and fast, but only effective for linear relationships.

Semi-supervised learning is a hybrid approach that uses both labeled and unlabeled data for training. It's suitable when there is an abundance of unlabeled data, but costly or difficult to manually label them all. It can be used for classification tasks . Some applications include speech analysis (intensive to label audio files) and internet content classification (a massive amount of webpages).

Semi-supervised learning relies on the following assumptions : Continuity Assumption : Data points that are close together are likely to have the same label.

Cluster Assumption : The data points can be organized into discrete clusters, and data points in the same cluster are likely to have the same label.

Manifold Assumption : The high-dimensional input data can be represented in a low-dimensional space (called the data manifold). So the labeling of a data point is based on the learned data manifold.

Co-Training : This method trains multiple base models to assign pseudo-labels. To add diversification, use different supervised classification algorithms for each model. Or allow each model to focus on different subsets of the dataset.

Transductive : Aims to produce label predictions for the unlabeled data only. It doesn't develop a general rule for unseen data points. Label Propagation : This is a graph-based algorithm that assigns labels for the unlabeled data points based on their relative similarity or connectivity to the labeled data points (continuity and cluster assumptions).

This approach is applicable to fields where large amounts of labeled data are difficult to obtain due to high cost and time. For example, computer vision and natural language processing (NLP). However, self-supervised learning requires substantial compute power to train models on large datasets.

Self-Predictive Learning : Trains models to "fill-in-the-blanks" by predicting a part of the input with known information about the other parts. For instance, a computer vision model is provided with the top half of an image and asked to generate the bottom half. In NPL, a model might need to predict a masked word in an input sentence.

The Markov decision process (MDP) lays out the relationship between the agent and its environment . The agent interacts with its environment by understanding the current state and taking possible action(s). It then receives a reward (or penalty) signal and the updated state. Through trial and error, the agent learns which action(s) to take for a specified goal.

Also, the agent must balance the exploration-exploitation trade-off , deciding to explore the environment more or just pick from known rewarded actions.

A diagram of the Markov decision process in reinforcement learning. Source: https://en.wikipedia.org/wiki/Reinforcement_learning (accessed 07/19/2025)

With the model-based reinforcement learning approach, the agent first constructs an internal representation (or model) of its environment. This is suitable for well-defined and stable environments . For example, a vacuum robot learning to navigate a new house . At first, the robot roams freely to explore and build an internal map of the house. Afterwards, the robot can build a series of optimal path sequences.

Dyna : The agent learns from real-world data and simulated experiences via a learned model. This hybrid approach allows for sample efficiency by augmenting limited real-world data with ample simulated data. It's applicable for robotics, autonomous driving, financial trading, and any task that can be optimized by both real and simulated data.

PILCO (Probabilistic Inference for Learning Control) : Utilizes probabilistic models (typically Gaussian processes ) to model dynamics, accounting for uncertainty in the planning and optimization of policies. This method is suitable for continuous control tasks and when real-world trials are costly or risky. It's also extremely sample-efficient.

Dreamer : This algorithm builds a latent dynamics model from pixels, meaning a compressed predictive model from raw visual data. Policy learning and optimization occur within this simplified space. Dreamer is scalable for high-dimensional input and appropriate for vision-based control tasks.

Q-Learning : A widely used value-based algorithm that maintains a Q-table where each entry is an estimate of the expected long-term reward for the specific state-action pair. The table is updated using the Temporal Difference (TD) rule. It's simple and effective for discrete action spaces.

Diana Cheung (ex-LinkedIn software engineer, USC MBA, and Codesmith alum) is a technical writer on technology and business. She is an avid learner and has a soft spot for tea and meows.