Partial Least Squares (PLS): Taming High-Dimensional Data and Capturing Complex Relationships
Partial Least Squares (PLS) regression is a multivariate statistical technique used for modeling the relationships between predictor variables (X) and a response variable (Y).
It is particularly useful when dealing with datasets that have high dimensionality, multicollinearity, or noisy variables. PLS aims to find a set of latent variables, called components, that capture the most important information from both X and Y.
Mathematics of PLS Regression:
Let’s consider a scenario with N observations and p predictor variables (attributes) in X, and a single response variable Y. PLS constructs a set of orthogonal components, which are linear combinations of the original predictor variables:
- Weights Calculation: PLS starts by finding a weight vector wk that maximizes the covariance between X and Y. This is achieved through iterations of weight updates.
- Scores Calculation: Once wk is determined, the scores tk are calculated by projecting X onto wk.
- Residual Calculation: The residuals Ek (error) between Y and the scores tk are calculated.
- Loading Vector Calculation: The loading vector ck is calculated by regressing the residuals Ek on X.
- Update Weight Vector: The weight vector wk is updated by regressing the residuals Ek on the loading vector ck.
The process of calculating components, scores, and residuals is repeated iteratively to obtain additional components.
Advantages of PLS Regression:
- Dealing with Collinearity: PLS can effectively handle multicollinearity among predictor variables, making it suitable for high-dimensional datasets.
- Noise Reduction: PLS focuses on capturing the common variance between X and Y, which can help in reducing the impact of noisy variables.
- Dimensionality Reduction: PLS reduces the dimensionality of the data by creating a smaller set of components, which can improve model interpretability.