Calculation of the Pearson correlation coefficient

There are two files/frames. I want to make a correlation.

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
data= pd.read_csv('close_prices.csv', header = None)
bata= pd.read_csv('djia_index.csv', header = None)
X = data.drop([0], axis =1)
X = X.drop([0], axis =0)
B = bata.drop([0], axis =1)
B = B.drop([0], axis =0)
pca = PCA(n_components=10)
pca.fit(X)
t = pca.transform(X)
dj=pca.transform(B)

c = np.corrcoef(t[:0], dj[:0])[0,1]

print (c)

Receive:

----> c = np.corrcoef(t[:0], dj[:0])[0,1]
IndexError: index 0 is out of bounds for axis 0 with size 0
Author: MaxU, 2019-05-25

1 answers

You are trying to calculate the correlation for two empty vectors:

In [78]: a = np.random.rand(374, 10) * 10

In [79]: a[:0]
Out[79]: array([], shape=(0, 10), dtype=float64)  # <--- матрица пуста !

Reproducing an error:

First, we create a second matrix that is linearly dependent on the matrix a:

In [81]: b = a * np.pi

In [82]: b.shape
Out[82]: (374, 10)

Reproducing an error:

In [83]: c = np.corrcoef(a[:0], b[:0])

In [84]: c.shape
Out[84]: (0, 0)

Next, you try to access the second (index == 1) element of the first (index == 0) row in the 2D array (but our array is empty):

In [85]: c[0,1]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-85-cfb1d490a583> in <module>
----> 1 c[0,1]

IndexError: index 0 is out of bounds for axis 0 with size 0

To calculate the Pearson correlation coefficients for two matrices of the same dimension, you can use the DataFrame.corrwith (other, axis=0, drop=False, method= 'pearson').

By default, the correlation will be calculated for columns with the same name (parameter: axis=0). To calculate the correlation for rows with the same index values, you must explicitly specify parameter: axis=1

Example:

In [86]: d1 = pd.DataFrame(a)

In [87]: d2 = pd.DataFrame(b)

In [88]: c = d1.corrwith(d2)

In [89]: c
Out[89]:
0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
5    1.0
6    1.0
7    1.0
8    1.0
9    1.0
dtype: float64

Now, to check, we will break the linear dependence for the first column:

In [91]: d2.iloc[:, 0] = np.random.rand(374)

In [92]: c = d1.corrwith(d2)

In [93]: c
Out[93]:
0    0.024225   # <--- NOTE !
1    1.000000
2    1.000000
3    1.000000
4    1.000000
5    1.000000
6    1.000000
7    1.000000
8    1.000000
9    1.000000
dtype: float64
 1
Author: MaxU, 2019-05-26 09:31:05