Calculation of the Pearson correlation coefficient
There are two files/frames. I want to make a correlation.
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
data= pd.read_csv('close_prices.csv', header = None)
bata= pd.read_csv('djia_index.csv', header = None)
X = data.drop([0], axis =1)
X = X.drop([0], axis =0)
B = bata.drop([0], axis =1)
B = B.drop([0], axis =0)
pca = PCA(n_components=10)
pca.fit(X)
t = pca.transform(X)
dj=pca.transform(B)
c = np.corrcoef(t[:0], dj[:0])[0,1]
print (c)
Receive:
----> c = np.corrcoef(t[:0], dj[:0])[0,1]
IndexError: index 0 is out of bounds for axis 0 with size 0
1 answers
You are trying to calculate the correlation for two empty vectors:
In [78]: a = np.random.rand(374, 10) * 10
In [79]: a[:0]
Out[79]: array([], shape=(0, 10), dtype=float64) # <--- матрица пуста !
Reproducing an error:
First, we create a second matrix that is linearly dependent on the matrix a
:
In [81]: b = a * np.pi
In [82]: b.shape
Out[82]: (374, 10)
Reproducing an error:
In [83]: c = np.corrcoef(a[:0], b[:0])
In [84]: c.shape
Out[84]: (0, 0)
Next, you try to access the second (index == 1
) element of the first (index == 0
) row in the 2D array (but our array is empty):
In [85]: c[0,1]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-85-cfb1d490a583> in <module>
----> 1 c[0,1]
IndexError: index 0 is out of bounds for axis 0 with size 0
To calculate the Pearson correlation coefficients for two matrices of the same dimension, you can use the DataFrame.corrwith (other, axis=0, drop=False, method= 'pearson').
By default, the correlation will be calculated for columns with the same name (parameter: axis=0
). To calculate the correlation for rows with the same index values, you must explicitly specify
parameter: axis=1
Example:
In [86]: d1 = pd.DataFrame(a)
In [87]: d2 = pd.DataFrame(b)
In [88]: c = d1.corrwith(d2)
In [89]: c
Out[89]:
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
5 1.0
6 1.0
7 1.0
8 1.0
9 1.0
dtype: float64
Now, to check, we will break the linear dependence for the first column:
In [91]: d2.iloc[:, 0] = np.random.rand(374)
In [92]: c = d1.corrwith(d2)
In [93]: c
Out[93]:
0 0.024225 # <--- NOTE !
1 1.000000
2 1.000000
3 1.000000
4 1.000000
5 1.000000
6 1.000000
7 1.000000
8 1.000000
9 1.000000
dtype: float64