PCA vs t-SNE (Dimensionality Reduction techniques)

Rahul S
3 min readMay 4

PCA (Principal Component Analysis) and t-SNE (t-Distributed Stochastic Neighbor Embedding) are both dimensionality reduction techniques that can be used in machine learning and data analysis.

PCA is a linear transformation method that identifies the most important features (principal components) that explain the most variance in the data. It reduces the number of features in the data, making it easier to visualize and analyze. It works well for high-dimensional data and is computationally efficient. However, PCA assumes that the data is normally distributed and linearly related, and may not work well with non-linear data.

t-SNE is a non-linear dimensionality reduction technique that is particularly useful for visualizing high-dimensional data. It groups similar data points together and separates dissimilar ones, preserving local structure in the data. It is particularly good at capturing non-linear relationships between data points, which is why it is often used for visualizing clusters or patterns in the data. However, t-SNE is computationally expensive and may not work well for larger datasets.

Let’s see an example of how to use PCA and t-SNE on the famous iris dataset.

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Load the iris dataset
iris = load_iris()

# Convert data to dataframe
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)

First we will visualize the original data with scatter plot.

# Visualize the original data
plt.scatter(iris_df['sepal length (cm)'], iris_df['sepal width (cm)'], c=iris.target)
plt.xlabel('sepal length (cm)')
plt.ylabel('sepal width (cm)')
plt.title('Iris Dataset')
Original Data

Now we will apply PCA to the data to reduce the dimensions from 4 to 2. And then visualize the PCA output using another scatter plot.

# Apply PCA
pca = PCA(n_components=2)
iris_pca = pca.fit_transform(iris.data)

# Visualize PCA output
plt.scatter(iris_pca[:,0], iris_pca[:,1], c=iris.target)
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.title('PCA Output')
Rahul S

LLM, NLP, Statistics, MLOps | Senior AI Consultant | IIT Roorkee | Connect: [https://www.linkedin.com/in/rahultheogre/]


See more recommendations