PCA (Principal Component Analysis)

#pca #machinelearning #datascience #deeplearning #artificialinteligence

Kriti Sinha Jun 20 2022 · 1 min read
Share this

Principal Component Analysis (PCA) is a dimension reduction process with multivariate statistical technique, introduced by Karl Pearson (an English mathematician and biostatistician). It find inter-relation between variables in the high dimensional.

It provides solution of major problem with high dimensional dataset's visualization to make it easy to understand. It provides solution to avoid overfitting in a classifier due to dimensional dataset along with improving the speed of training process. how PCA provides above solution?

It calculates principal component in descending order by calculating Eigenvalues and Eigenvectors using the covariance matrix of the previous step.

we will see how PCA can help to visualize a high dimension dataset, reduces computation time, and avoid overfitting.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

#dataset downloaded from below link #"https://www.kaggle.com/dipayanbiswas/parkinsons-disease-speech-signal-features"

df= pd.read_csv(r"D:/pd_speech_features.csv") df.head()
idgenderPPEDFARPDEnumPulsesnumPeriodsPulsesmeanPeriodPulsesstdDevPeriodPulseslocPctJitter...tqwt_kurtosisValue_dec_28tqwt_kurtosisValue_dec_29tqwt_kurtosisValue_dec_30tqwt_kurtosisValue_dec_31tqwt_kurtosisValue_dec_32tqwt_kurtosisValue_dec_33tqwt_kurtosisValue_dec_34tqwt_kurtosisValue_dec_35tqwt_kurtosisValue_dec_36class
0010.852470.718260.572272402390.0080640.0000870.00218...1.56202.64453.86864.21055.12214.46252.62023.000418.94051
1010.766860.694810.539662342330.0082580.0000730.00195...1.55893.610723.515514.196211.02619.50826.52456.343145.17801
2010.850830.676040.589822322310.0083400.0000600.00176...1.56432.33089.495910.745811.01774.80662.91993.14954.76661
3100.411210.796720.592571781770.0108580.0001830.00419...3.78053.56645.255814.04034.22354.68574.84606.26504.06031
4100.327900.797820.530282362350.0081620.0026690.00535...6.17275.84166.08055.76217.781711.68918.21035.05596.11641

5 rows × 755 columns

#we have 755 dimentions in this data set
df.shape

(756, 755)

#in class we have two value 0 and 1 df['class'].value_counts()

1    564
0    192
Name: class, dtype: int64

#Standardization of the dataset is a must before applying PCA, #because PCA is quite sensitive to the dataset that has a high variance in its values. from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScaler

scaler=StandardScaler() scaler.fit(df) X=scaler.transform(df) X

array([[-1.72519117,  0.96874225,  0.62764391, ..., -0.81472704,
        -0.36659507,  0.58345997],
       [-1.72519117,  0.96874225,  0.12161952, ..., -0.58297219,
         0.40039616,  0.58345997],
       [-1.72519117,  0.96874225,  0.61795018, ..., -0.8043897 ,
        -0.7809355 ,  0.58345997],
       ...,
       [ 1.72519117, -1.03226633,  0.81336154, ..., -0.79017671,
        -0.77287314, -1.71391365],
       [ 1.72519117, -1.03226633,  0.54105055, ..., -0.82631929,
        -0.81173208, -1.71391365],
       [ 1.72519117, -1.03226633,  0.3945807 , ..., -0.84098293,
        -0.82811405, -1.71391365]])



X_Scale = scaler.transform(X)


#applying PCA to the entire dataset and reduce it into two components. #PCA will convert high dimensional dataset's to low dimention data set. #and will give data visualizaton based on top 2 component pca2 = PCA(n_components=2) principalComponents = pca2.fit_transform(X_Scale) principalDf = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2']) finalDf = pd.concat([principalDf, df[['class']]], axis = 1) finalDf.head()

principal component 1principal component 2class
0-1.086105e+20-7.165452e+191
11.973075e+20-2.860094e+191
2-1.835117e+193.116259e+191
33.916000e+20-8.053567e+191
41.647250e+20-1.198711e+191


#scatter plot of top 2 component plt.figure(figsize=(7,7)) plt.scatter(finalDf['principal component 1'],finalDf['principal component 2'],c=finalDf['class'],cmap='prism', s =5) plt.xlabel('pc1') plt.ylabel('pc2')


Text(0, 0.5, 'pc2')


#applying PCA to the entire dataset and reduce it into three components. #PCA will convert high dimensional dataset's to low dimention data set. #and will give data visualizaton based on top 3 component pca3 = PCA(n_components=3) principalComponents = pca3.fit_transform(X_Scale) principalDf = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2', 'principal component 3']) finalDf = pd.concat([principalDf, df[['class']]], axis = 1) finalDf.head()

principal component 1principal component 2principal component 3class
0-1.086105e+20-7.165452e+19-3.086562e+181
11.973075e+20-2.860094e+194.586191e+191
2-1.835117e+193.116259e+196.469442e+181
33.916000e+20-8.053567e+191.310858e+201
41.647250e+20-1.198711e+191.443065e+191


#visualizing the three PCA components with the help of 3-D Scatter plot from mpl_toolkits.mplot3d import Axes3D fig = plt.figure(figsize=(9,9)) axes = Axes3D(fig) axes.set_title('PCA Representation', size=14) axes.set_xlabel('PC1') axes.set_ylabel('PC2') axes.set_zlabel('PC3') axes.scatter(finalDf['principal component 1'],finalDf['principal component 2'],finalDf['principal component 3'],c=finalDf['class'], cmap = 'prism', s=10)

<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x25eafe01c40>

#Here we are going to separate the dependent label column into y dataframe. #And all remaining columns into X dataframe. X = df.drop('class',axis=1).values y = df['class'].values X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=0) scaler = StandardScaler() # Fit on training set only. scaler.fit(X_train)

StandardScaler()

# Apply transform to both the training set and the test set. X_train_pca = scaler.transform(X_train) X_test_pca = scaler.transform(X_test) ##Creating Logistic Regression Model with PCA logisticRegr = LogisticRegression() logisticRegr.fit(X_train,y_train)

LogisticRegression()

%%time y_train_hat =logisticRegr.predict(X_train) train_accuracy = accuracy_score(y_train, y_train_hat)*100 print('"Accuracy for our Training dataset with PCA is: %.4f %%' % train_accuracy)

"Accuracy for our Training dataset with PCA is: 76.5595 %
Wall time: 2 ms


y_test_hat=logisticRegr.predict(X_test) test_accuracy=accuracy_score(y_test,y_test_hat)*100 test_accuracy print("Accuracy for our Testing dataset with tuning is : {:.3f}%".format(test_accuracy) )

Accuracy for our Testing dataset with tuning is : 77.974%
Wall time: 2 ms

#Creating Logistic Regression Model without PCA #Here we create a logistic regression model and can see that the model has terribly overfitted. #The training accuracy is 100% and the testing accuracy is 84.5%.

%%time logisticRegr = LogisticRegression() logisticRegr.fit(X_train_pca,y_train) y_train_hat =logisticRegr.predict(X_train_pca) train_accuracy = accuracy_score(y_train, y_train_hat)*100 print('"Accuracy for our Training dataset with PCA is: %.4f %%' % train_accuracy)


"Accuracy for our Training dataset with PCA is: 100.0000 %
Wall time: 47 ms


y_test_hat=logisticRegr.predict(X_test_pca) test_accuracy=accuracy_score(y_test,y_test_hat)*100 test_accuracy print("Accuracy for our Testing dataset with PCA is : {:.3f}%".format(test_accuracy) )


Accuracy for our Testing dataset with PCA is : 84.141%


# Conclusion :we learned how PCA helps to visualize a high dimension dataset, #reduces computation time, and avoid overfitting.



Comments
Read next