Introduction

Linear Discriminant Analysis LDA (LDA) is most commonly used as a dimension reudciton technique. For this project, we are going to reduce a dataset to lower dimenison to obtain better class seperability. At the same time we can avoid overfiitng and reduce computational costs.

Ronald A. Fisher formulated the Linear Discriminant in 1936 (The Use of Multiple Measurements in Taxonomic Problems), and it also has some practical uses as classifier. The original Linear discriminant was described for a 2-class problem, and it was then later generalized as “multi-class Linear Discriminant Analysis” or “Multiple Discriminant Analysis” by C. R. Rao in 1948 (The utilization of multiple measurements in problems of biological classification)

LDA is similar to PCA, but in addiont to finding the axes's that contain the most varaicne, we are also intersted in the axes that maximize separation between classes.

The goal of LDA is to project a feature space on to $k$, where $k <= n-1$ and maintaining class discrimanotry information. it can also be helpful to avoid overfitting by minimizing the error in parameter estimation (“curse of dimensionality”).

PCA vs LDA

Both are linear transformation techniques that are commonly used for dimension reduction. PCA can be considered unsupervised and maximizes variacne in the dataset. LDA is supervises and copute the directions (linear discriminants) that represent axes that give the most separation.

LDA is superior to PCA for multiclassficatoin tasks.

LDAvPCA.png

What is a good feature subspace?

How do we know what subsapce to reduce to from d to k, where k < d and how do we know if we have a feature space that represents our data well.

Further along this notebook we will compute the eigenvectors from out dataset using scatter matrices (the in-between-class scatter matrix and within-class scatter matrix).

And in the other scenario, if some of the eigenvalues are much much larger than others, we might be interested in keeping only those eigenvectors with the highest eigenvalues, since they contain more information about our data distribution. Vice versa, eigenvalues that are close to 0 are less informative and we might consider dropping those for constructing the new feature subspace.

Summarizing the LDA approach in 5 steps

  1. Compute the $d$-dimensional mean vectors for the different classes from the dataset.
  2. Compute the scatter matrices (in-between-class and within-class scatter matrix).
  3. Compute the eigenvectors $(e_{1},e_{2}...e_{d})$ and corresponding eigenvalues $(\lambda_{1},\lambda_{2}...\lambda_{d})$ for the scatter matrices.
  4. Sort the eigenvectors by decreasing eigenvalues and choose $k$eigenvectors with the largest eigenvalues to form a $d\times k$ dimensional matrix W (where every column represents an eigenvector).
  5. Use this $d \times k$ eigenvector matrix to transform the samples onto the new subspace. This can be summarized by the matrix multiplication: $YY=XX \times WW$ (where XX is a n×d-dimensional matrix representing the n samples, and yy are the transformed n×k-dimensional samples in the new subspace).

Dataset

We will generate our own data, but to keep things simple lets create 5 classes with 10 dimensions. The purpose of this notbook is to get familiar with mathematics behind LDA

Dependencies

In [1]:
import pandas as pd
import numpy as np
import random
import scipy as sp
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.datasets import make_classification
%matplotlib inline

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Create the data

In [7]:
sample_size=20000
features = 10
classes = 5
random_shift = 2*np.random.random(features)



X,y = make_classification(n_samples=sample_size, n_features=features, 
                    n_informative=10, n_redundant=0, 
                    n_repeated=0, n_classes=classes, 
                    n_clusters_per_class=1, 
                    weights=None, flip_y=0.05, 
                    class_sep=4, hypercube=True,
                    shift=random_shift, scale=1.2,
                          random_state=42)

#create column labels
col_labels = [str('Feature_')+str(x) for x in range(1,X.shape[1]+1)]

df = pd.DataFrame(X,columns=col_labels)
df['label'] = y
df.head()
Out[7]:
Feature_1 Feature_2 Feature_3 Feature_4 Feature_5 Feature_6 Feature_7 Feature_8 Feature_9 Feature_10 label
0 2.799958 7.075322 -1.138638 -3.402424 -4.226043 -4.913339 -0.882808 7.724663 -3.234090 5.239540 0
1 7.445766 3.342201 -4.747831 -5.298116 -1.563867 -1.414268 -5.858736 4.324602 -5.019894 7.821710 0
2 -3.725767 4.927482 6.396648 -6.465097 -2.923763 -1.766202 -3.961678 7.671291 3.847920 -6.111916 3
3 6.842833 8.991320 -1.237123 -5.815169 -1.163520 -6.866526 -3.281723 3.135688 -6.514901 3.037920 0
4 -2.359387 6.142606 7.652743 -2.815592 -2.542573 -5.468284 -0.801147 6.080205 5.159048 -2.798980 3

show distributions across all variables

In [8]:
df['label'].value_counts().apply(lambda x: x/sample_size)
Out[8]:
1    0.20105
2    0.20065
3    0.19980
0    0.19960
4    0.19890
Name: label, dtype: float64
In [9]:
for i in range(0,9):
    plt.figure(figsize=(8, 6))
    for k in range(0,5):
        index = df['label'] == k
        column = list(df.columns)[0:9][i]
        df1 = df[index]
        plt.hist(df1[column],label=k,alpha=0.7)
    plt.xlabel(column)
    plt.legend(loc="upper left", fontsize=12)
    plt.show()
    

There is sepeartion on each of the variables, but some are overlapping. In practice, instead of reducing the dimensionality via a projection (here: LDA), a good alternative would be a feature selection technique. For low-dimensional datasets like Iris, a glance at those histograms would already be very informative. We could also perform feature selection algoirthms, but for the purpose of this notebook we will not explore those paths in this notebook.

Normality assumptions

LDA assumes all features are independent:

“linear discriminant analysis frequently achieves good performances in the tasks of face and object recognition, even though the assumptions of common covariance matrix among groups and normality are often violated (Duda, et al., 2001)” (Tao Li, et al., 2006).

LDA in 5 steps

Step 1: Computing the d dimensional mean vectors

We will start by caculating the mean vectors for each of out classes $m_{i},(i = 0,1,2,3,4)$, since there are 5 classes

\begin{align} \pmb m_i = \begin{bmatrix} \mu_{\omega_i (\text{Feature_1)}}\\ \mu_{\omega_i (\text{Feature_2})}\\ .\\ \mu_{\omega_i (\text{Feature_N})}\\ \end{bmatrix} \; , \quad \text{with} \quad i = 0,1,2,4 \end{align}
In [35]:
mean_vectors = []
for k in range(0,5):
    #pull out X matrix
    X = df.values[:,range(0,10)]
    index = df['label'] == k
    mean_vectors.append(np.mean(X[index],axis=0))
    print("Mean vector for class",k,"is",mean_vectors[k])
    
Mean ve for class 0 is [ 5.73445068  6.73562645 -2.2264301  -3.71492666 -2.43003418 -4.0727708
 -2.46440441  6.21318027 -3.90866407  5.29424604]
Mean ve for class 1 is [ 5.67109647  6.68571824 -2.22056517  5.41916611  6.7194605   5.06277741
 -2.38273369 -2.89439221  5.19542365 -3.80774731]
Mean ve for class 2 is [-3.36716687 -2.34544044  6.84426403 -3.76438256 -2.39577772  5.02709756
  6.60229742  6.19608736  5.2414826   5.3084233 ]
Mean ve for class 3 is [-3.3759827   6.7843741   6.87130624 -3.67792947 -2.43698018 -4.02559975
 -2.41505155  6.09641948  5.27185454 -3.83825044]
Mean ve for class 4 is [ 5.71801924  6.75168127  6.92510158 -3.72893801 -2.42307653 -4.05866316
 -2.49328195 -3.0274157  -3.8566625   5.2460743 ]
In [32]:
len(mean_vectors[0])
Out[32]:
10

Step 2: Computing the scatter matrices

Lets compute the $4 \times 4$ dimensional matrices. The within class and the between class scatter matrix

2.1 Within-class scatter matrix $S_{W}$

\begin{align} S_{W} = \sum_{i=1}^{c} S_{i} \\ where \\ S_{i} = \sum_{x \in D_{i}}^{n} (x - m_{i})(x-m_{i})^T \end{align}

which is just the scatter matrix for every class

and $m_{i}$ is the mean vector

$m_{i} = \frac{1}{n_{i}} \sum_{x \in D_{i}}^{n} x_{k}$

In [51]:
S_W = np.zeros((10,10))
X = df.values[:,range(0,10)]
for cl,mv in zip(range(0,5),mean_vectors):
    #scatter matrix for every class
    class_sc_mat = np.zeros((10,10))
    for row in X[df['label']==cl]:
        row,mv = row.reshape(10,1),mv.reshape(10,1)
        class_sc_mat += (row-mv).dot((row-mv).T)
    S_W+=class_sc_mat
In [55]:
print(class_sc_mat.shape)
print(S_W.shape)
(10, 10)
(10, 10)

Alternatively, we could also compute the class covriances by adding a scaling factor $\frac{1}{N-1}$ to the within class scatter matrix

$\Sigma_{i} = \frac{1}{N_{i} -1} \sum_{x \in D_{i}}^{n}(x-m_{i})(m_{i} - m)^T$

where $m$ is the overall mean, and $m_{i}$ and $N_{i}$ are the sample mean and sizes

The between-class scatter matrix $S_{B}$ is copmuted by:

\begin{align} S_{B} = \sum_{i=1}^{c} N_{i}(m_{i} - m)(m_{i} - m)^T \end{align}
In [59]:
overall_mean = np.mean(X,axis=0)
S_B = np.zeros((10,10))
for i,mean_vec in enumerate(mean_vectors):
    #get sample size
    n = X[df['label'] == i+1,:].shape[0]
    mean_vec = mean_vec.reshape(10,1)
    overall_mean = overall_mean.reshape(10,1)
    S_B += n * (mean_vec - overall_mean).dot((mean_vec - overall_mean).T)

Step 3: Solving the generalized eigenvalue problem for the matrix $S_{W}^{-1}S_{B}$

In [62]:
eig_vals, eig_vecs = np.linalg.eig(np.linalg.inv(S_W).dot(S_B))

for i in range(len(eig_vals)):
    eigvec_sc = eig_vecs[:,i].reshape(10,1)   
    print('\nEigenvector {}: \n{}'.format(i+1, eigvec_sc.real))
    print('Eigenvalue {:}: {:.2e}'.format(i+1, eig_vals[i].real))
Eigenvector 1: 
[[ 0.19467575]
 [ 0.41399085]
 [-0.37981556]
 [ 0.36241023]
 [ 0.56708083]
 [ 0.1783642 ]
 [-0.20212563]
 [-0.17152441]
 [-0.28796805]
 [-0.09228463]]
Eigenvalue 1: 6.53e+00

Eigenvector 2: 
[[ 0.15659832]
 [ 0.20787897]
 [-0.27192275]
 [-0.3738506 ]
 [-0.16559165]
 [-0.5669349 ]
 [-0.23193547]
 [ 0.2585294 ]
 [-0.49685511]
 [ 0.04716732]]
Eigenvalue 2: 4.91e+00

Eigenvector 3: 
[[ 0.33431719]
 [-0.20166378]
 [-0.37440391]
 [-0.04940941]
 [ 0.01872972]
 [ 0.2953377 ]
 [ 0.53618436]
 [ 0.02365608]
 [-0.25145828]
 [ 0.51583375]]
Eigenvalue 3: 3.41e+00

Eigenvector 4: 
[[ 0.21783413]
 [-0.04675576]
 [ 0.63970871]
 [ 0.06488126]
 [-0.14976335]
 [-0.08603653]
 [-0.16939168]
 [-0.63506834]
 [-0.11346757]
 [ 0.24942697]]
Eigenvalue 4: 6.04e-01

Eigenvector 5: 
[[-0.24926891]
 [ 0.62572847]
 [-0.12537963]
 [-0.30028898]
 [ 0.35860388]
 [-0.0557218 ]
 [ 0.47377372]
 [-0.12412517]
 [ 0.07978868]
 [ 0.20968276]]
Eigenvalue 5: -3.83e-16

Eigenvector 6: 
[[-0.24926891]
 [ 0.62572847]
 [-0.12537963]
 [-0.30028898]
 [ 0.35860388]
 [-0.0557218 ]
 [ 0.47377372]
 [-0.12412517]
 [ 0.07978868]
 [ 0.20968276]]
Eigenvalue 6: -3.83e-16

Eigenvector 7: 
[[-0.11474056]
 [ 0.30300631]
 [-0.066212  ]
 [ 0.50267131]
 [-0.24733607]
 [-0.27062699]
 [ 0.40124302]
 [-0.06769109]
 [ 0.12701482]
 [ 0.18166397]]
Eigenvalue 7: 5.65e-16

Eigenvector 8: 
[[-0.11474056]
 [ 0.30300631]
 [-0.066212  ]
 [ 0.50267131]
 [-0.24733607]
 [-0.27062699]
 [ 0.40124302]
 [-0.06769109]
 [ 0.12701482]
 [ 0.18166397]]
Eigenvalue 8: 5.65e-16

Eigenvector 9: 
[[ 0.22929352]
 [ 0.47556483]
 [-0.07159105]
 [ 0.08102956]
 [-0.17689262]
 [-0.26589599]
 [ 0.50273889]
 [-0.07097056]
 [ 0.53788876]
 [ 0.24591115]]
Eigenvalue 9: 2.57e-16

Eigenvector 10: 
[[-0.18075961]
 [-0.40005424]
 [ 0.21071789]
 [ 0.00317083]
 [ 0.35062561]
 [ 0.23399326]
 [-0.50976713]
 [ 0.20832388]
 [-0.51401251]
 [-0.13307554]]
Eigenvalue 10: 1.84e-16

The eigenvectors are the direciton of distorttion, and the eigenvalues are the scaling factors for the eigenvalues that describe the magnitube of the distortion.

Checking the eigenvector-eigenvalue calculation

A quick check that the eigenvector-eigenvalue calculation is correct and satisfy the equation:

\begin{align} Av = \lambda v \\ where \\ A = S_{w}^{-1}S_{B} \\ v = Eigenvector \\ \lambda = Eigenvalue \end{align}
In [63]:
for i in range(len(eig_vals)):
    eigv = eig_vecs[:,i].reshape(10,1)
    np.testing.assert_array_almost_equal(np.linalg.inv(S_W).dot(S_B).dot(eigv),
                                         eig_vals[i] * eigv,
                                         decimal=6, err_msg='', verbose=True)
print('ok')
ok

Step4: Selecting the linear discriminants for the new feature subspace

4.1 Sorting the eigenvectors by decreasing eigenvalues

In order to decide which eigenvector(s) we want to drop for our lower-dimensional subspace, we have to take a look at the corresponding eigenvalues of the eigenvectors. The eigenvectors with the lowest eigenvalues bear the least information The common approach is to rank the eigenvectors from highest to lowest corresponding eigenvalue and choose the top k eigenvectors.

In [64]:
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs = sorted(eig_pairs, key=lambda k: k[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues

print('Eigenvalues in decreasing order:\n')
for i in eig_pairs:
    print(i[0])
Eigenvalues in decreasing order:

6.5320520567686735
4.907809636047333
3.4116418920183196
0.6041540310215024
6.641195096607845e-16
6.641195096607845e-16
3.965656159214618e-16
3.965656159214618e-16
2.565893966560799e-16
1.8426152589037725e-16

Now let's express the explaine variance as a percentage.

In [66]:
print('Variance explained:\n')
eigv_sum = sum(eig_vals)
for i,j in enumerate(eig_pairs):
    print('eigenvalue {0:}: {1:.2%}'.format(i+1, (j[0]/eigv_sum).real))
    
Variance explained:

eigenvalue 1: 42.26%
eigenvalue 2: 31.75%
eigenvalue 3: 22.07%
eigenvalue 4: 3.91%
eigenvalue 5: 0.00%
eigenvalue 6: 0.00%
eigenvalue 7: 0.00%
eigenvalue 8: 0.00%
eigenvalue 9: 0.00%
eigenvalue 10: 0.00%

4.2 Choosing k eigenvectors with the largest eigenvalues

We reduce down to two, using the first two eigenvectors

In [68]:
W = np.hstack((eig_pairs[0][1].reshape(10,1), eig_pairs[1][1].reshape(10,1)))
print('Matrix W:\n', W.real)
Matrix W:
 [[ 0.19467575  0.15659832]
 [ 0.41399085  0.20787897]
 [-0.37981556 -0.27192275]
 [ 0.36241023 -0.3738506 ]
 [ 0.56708083 -0.16559165]
 [ 0.1783642  -0.5669349 ]
 [-0.20212563 -0.23193547]
 [-0.17152441  0.2585294 ]
 [-0.28796805 -0.49685511]
 [-0.09228463  0.04716732]]

Step 5: Transforming the samples onto the new subspace

Use to rotaion matrix to reduce dimension:

In [88]:
lda_2 = np.zeros((20000,2))
X_reduced = np.matmul(X,W)
for i in range(0,X_reduced.shape[0]):
    for j in range(0,X_reduced.shape[1]):
        lda_2[i,j] = np.sum(X_reduced[i,j])
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:5: ComplexWarning: Casting complex values to real discards the imaginary part
  """
In [95]:
lda_2_df = pd.DataFrame(lda_2,columns= ['LDA1','LDA2'])
lda_2_df['label'] =y
lda_2_df.head()
Out[95]:
LDA1 LDA2 label
0 -1.298015 11.032056 0
1 2.743437 11.533250 0
2 -6.490084 3.305896 3
3 3.253511 14.488700 0
4 -6.368863 2.462695 3
In [102]:
plt.figure(figsize=(12,10))
sns.scatterplot(x="LDA1", y="LDA2", data=lda_2_df,hue='label',palette='rainbow')
Out[102]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b373940>

The scatter plot above represents our new feature subspace that we constructed via LDA. We can see that the first linear discriminant “LD1” separates the classes quite nicely. However, the second discriminant, “LD2”, does not add much valuable information, which we’ve already concluded when we looked at the ranked eigenvalues is step 4.

In [ ]: