Linear Discriminant Analysis LDA (LDA) is most commonly used as a dimension reudciton technique. For this project, we are going to reduce a dataset to lower dimenison to obtain better class seperability. At the same time we can avoid overfiitng and reduce computational costs.
Ronald A. Fisher formulated the Linear Discriminant in 1936 (The Use of Multiple Measurements in Taxonomic Problems), and it also has some practical uses as classifier. The original Linear discriminant was described for a 2-class problem, and it was then later generalized as “multi-class Linear Discriminant Analysis” or “Multiple Discriminant Analysis” by C. R. Rao in 1948 (The utilization of multiple measurements in problems of biological classification)
LDA is similar to PCA, but in addiont to finding the axes's that contain the most varaicne, we are also intersted in the axes that maximize separation between classes.
The goal of LDA is to project a feature space on to $k$, where $k <= n-1$ and maintaining class discrimanotry information. it can also be helpful to avoid overfitting by minimizing the error in parameter estimation (“curse of dimensionality”).
Both are linear transformation techniques that are commonly used for dimension reduction. PCA can be considered unsupervised and maximizes variacne in the dataset. LDA is supervises and copute the directions (linear discriminants) that represent axes that give the most separation.
LDA is superior to PCA for multiclassficatoin tasks.
How do we know what subsapce to reduce to from d to k, where k < d and how do we know if we have a feature space that represents our data well.
Further along this notebook we will compute the eigenvectors from out dataset using scatter matrices (the in-between-class scatter matrix and within-class scatter matrix).
And in the other scenario, if some of the eigenvalues are much much larger than others, we might be interested in keeping only those eigenvectors with the highest eigenvalues, since they contain more information about our data distribution. Vice versa, eigenvalues that are close to 0 are less informative and we might consider dropping those for constructing the new feature subspace.
We will generate our own data, but to keep things simple lets create 5 classes with 10 dimensions. The purpose of this notbook is to get familiar with mathematics behind LDA
import pandas as pd
import numpy as np
import random
import scipy as sp
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.datasets import make_classification
%matplotlib inline
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
sample_size=20000
features = 10
classes = 5
random_shift = 2*np.random.random(features)
X,y = make_classification(n_samples=sample_size, n_features=features,
n_informative=10, n_redundant=0,
n_repeated=0, n_classes=classes,
n_clusters_per_class=1,
weights=None, flip_y=0.05,
class_sep=4, hypercube=True,
shift=random_shift, scale=1.2,
random_state=42)
#create column labels
col_labels = [str('Feature_')+str(x) for x in range(1,X.shape[1]+1)]
df = pd.DataFrame(X,columns=col_labels)
df['label'] = y
df.head()
df['label'].value_counts().apply(lambda x: x/sample_size)
for i in range(0,9):
plt.figure(figsize=(8, 6))
for k in range(0,5):
index = df['label'] == k
column = list(df.columns)[0:9][i]
df1 = df[index]
plt.hist(df1[column],label=k,alpha=0.7)
plt.xlabel(column)
plt.legend(loc="upper left", fontsize=12)
plt.show()
There is sepeartion on each of the variables, but some are overlapping. In practice, instead of reducing the dimensionality via a projection (here: LDA), a good alternative would be a feature selection technique. For low-dimensional datasets like Iris, a glance at those histograms would already be very informative. We could also perform feature selection algoirthms, but for the purpose of this notebook we will not explore those paths in this notebook.
LDA assumes all features are independent:
“linear discriminant analysis frequently achieves good performances in the tasks of face and object recognition, even though the assumptions of common covariance matrix among groups and normality are often violated (Duda, et al., 2001)” (Tao Li, et al., 2006).
We will start by caculating the mean vectors for each of out classes $m_{i},(i = 0,1,2,3,4)$, since there are 5 classes
\begin{align} \pmb m_i = \begin{bmatrix} \mu_{\omega_i (\text{Feature_1)}}\\ \mu_{\omega_i (\text{Feature_2})}\\ .\\ \mu_{\omega_i (\text{Feature_N})}\\ \end{bmatrix} \; , \quad \text{with} \quad i = 0,1,2,4 \end{align}mean_vectors = []
for k in range(0,5):
#pull out X matrix
X = df.values[:,range(0,10)]
index = df['label'] == k
mean_vectors.append(np.mean(X[index],axis=0))
print("Mean vector for class",k,"is",mean_vectors[k])
len(mean_vectors[0])
Lets compute the $4 \times 4$ dimensional matrices. The within class and the between class scatter matrix
which is just the scatter matrix for every class
and $m_{i}$ is the mean vector
$m_{i} = \frac{1}{n_{i}} \sum_{x \in D_{i}}^{n} x_{k}$
S_W = np.zeros((10,10))
X = df.values[:,range(0,10)]
for cl,mv in zip(range(0,5),mean_vectors):
#scatter matrix for every class
class_sc_mat = np.zeros((10,10))
for row in X[df['label']==cl]:
row,mv = row.reshape(10,1),mv.reshape(10,1)
class_sc_mat += (row-mv).dot((row-mv).T)
S_W+=class_sc_mat
print(class_sc_mat.shape)
print(S_W.shape)
Alternatively, we could also compute the class covriances by adding a scaling factor $\frac{1}{N-1}$ to the within class scatter matrix
$\Sigma_{i} = \frac{1}{N_{i} -1} \sum_{x \in D_{i}}^{n}(x-m_{i})(m_{i} - m)^T$
where $m$ is the overall mean, and $m_{i}$ and $N_{i}$ are the sample mean and sizes
The between-class scatter matrix $S_{B}$ is copmuted by:
\begin{align} S_{B} = \sum_{i=1}^{c} N_{i}(m_{i} - m)(m_{i} - m)^T \end{align}overall_mean = np.mean(X,axis=0)
S_B = np.zeros((10,10))
for i,mean_vec in enumerate(mean_vectors):
#get sample size
n = X[df['label'] == i+1,:].shape[0]
mean_vec = mean_vec.reshape(10,1)
overall_mean = overall_mean.reshape(10,1)
S_B += n * (mean_vec - overall_mean).dot((mean_vec - overall_mean).T)
eig_vals, eig_vecs = np.linalg.eig(np.linalg.inv(S_W).dot(S_B))
for i in range(len(eig_vals)):
eigvec_sc = eig_vecs[:,i].reshape(10,1)
print('\nEigenvector {}: \n{}'.format(i+1, eigvec_sc.real))
print('Eigenvalue {:}: {:.2e}'.format(i+1, eig_vals[i].real))
The eigenvectors are the direciton of distorttion, and the eigenvalues are the scaling factors for the eigenvalues that describe the magnitube of the distortion.
A quick check that the eigenvector-eigenvalue calculation is correct and satisfy the equation:
\begin{align} Av = \lambda v \\ where \\ A = S_{w}^{-1}S_{B} \\ v = Eigenvector \\ \lambda = Eigenvalue \end{align}for i in range(len(eig_vals)):
eigv = eig_vecs[:,i].reshape(10,1)
np.testing.assert_array_almost_equal(np.linalg.inv(S_W).dot(S_B).dot(eigv),
eig_vals[i] * eigv,
decimal=6, err_msg='', verbose=True)
print('ok')
In order to decide which eigenvector(s) we want to drop for our lower-dimensional subspace, we have to take a look at the corresponding eigenvalues of the eigenvectors. The eigenvectors with the lowest eigenvalues bear the least information The common approach is to rank the eigenvectors from highest to lowest corresponding eigenvalue and choose the top k eigenvectors.
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs = sorted(eig_pairs, key=lambda k: k[0], reverse=True)
# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in decreasing order:\n')
for i in eig_pairs:
print(i[0])
Now let's express the explaine variance as a percentage.
print('Variance explained:\n')
eigv_sum = sum(eig_vals)
for i,j in enumerate(eig_pairs):
print('eigenvalue {0:}: {1:.2%}'.format(i+1, (j[0]/eigv_sum).real))
We reduce down to two, using the first two eigenvectors
W = np.hstack((eig_pairs[0][1].reshape(10,1), eig_pairs[1][1].reshape(10,1)))
print('Matrix W:\n', W.real)
Use to rotaion matrix to reduce dimension:
lda_2 = np.zeros((20000,2))
X_reduced = np.matmul(X,W)
for i in range(0,X_reduced.shape[0]):
for j in range(0,X_reduced.shape[1]):
lda_2[i,j] = np.sum(X_reduced[i,j])
lda_2_df = pd.DataFrame(lda_2,columns= ['LDA1','LDA2'])
lda_2_df['label'] =y
lda_2_df.head()
plt.figure(figsize=(12,10))
sns.scatterplot(x="LDA1", y="LDA2", data=lda_2_df,hue='label',palette='rainbow')
The scatter plot above represents our new feature subspace that we constructed via LDA. We can see that the first linear discriminant “LD1” separates the classes quite nicely. However, the second discriminant, “LD2”, does not add much valuable information, which we’ve already concluded when we looked at the ranked eigenvalues is step 4.