How to effectively create a conjugacy table for a categorical attribute?

Question

How to effectively create a conjugacy table for a categorical attribute?

I receive data with 40 categorical attributes as input. There are empty values in the data. The number of categories of each attribute is not known. The categories are string categories. Task: calculate the correlation with the target binary variable using the Kramer coefficient V, which takes the conjugacy table as input. I consider it as follows:

# Подсчитанные значения корреляции признаков
categorical_corrs = list()
for column in data.columns:
    # Для каждого признака получаю список уникальных значений,
    # за вычетом пропущенных ячеек
    categories = data[column].dropna().unique()
    confusion_matrix = [[], []]
    for category in categories:
        # Для каждой категории считаем количество реализаций для значений 0 и 1
        confusion_matrix[0].append(
            len(data.loc[(labels[0] == 0) & (data[column] == category), column])
        )
        confusion_matrix[1].append(
            len(data.loc[(labels[0] == 1) & (data[column] == category), column])
        )
    result = cramers_stat(np.array(confusion_matrix))
    # Проверка на исключительные случаи
    if result == -1:
        print column, categories, confusion_matrix
    categorical_corrs.append(result)

Each attribute has 40,000 entries (including omissions). The execution of the code above takes quite a long time. Tell me, is it possible to calculate the conjugacy table more efficiently?

PS Data can be loaded hence ("small" dataset)

1

python pandas numpy dataframe статистика

Author: MaxU, 2019-01-01

Source

2 answers

The Pandas Python library has a crosstab function. It builds the conjugacy tables you need. Try it, maybe (and most likely, because the implementation is direct, in C++ ) its implementation is faster than yours.

1

Author: passant, 2019-01-02 09:19:22

score 1 · Accepted Answer

Try using the function to calculate the Kramer coefficient V from of the given answer:

import scipy.stats as ss

def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))

To calculate confusion_matrix, you can use the pd.crosstab function()

Example:

try:
    from pathlib import Path
except ImportError:
    from pathlib2 import Path

WORK_DIR = Path(r'D:\data\927487')

train = pd.read_csv(WORK_DIR / 'orange_small_train.data', sep='\t')    
labels = pd.read_csv(WORK_DIR / 'orange_small_train_appetency.labels', 
                     header=None, squeeze=True, dtype='int8')

In [51]: confusion_mx = pd.crosstab(labels, train['Var1'])

In [52]: confusion_mx
Out[52]:
Var1  0.0    8.0    16.0   24.0   32.0   40.0   48.0   56.0   64.0   72.0   80.0   120.0  128.0  152.0  360.0  392.0  536.0  680.0
0
-1      371    134     80     46     21      9      6      5      1      3      1      0      2      1      1      1      1      1
 1        9      4      1      0      2      1      0      0      0      0      0      1      0      0      0      0      0      0

In [53]: cramers_corrected_stat(confusion_mx)
Out[53]: 0.20395161570145692

PS checking the correlation of categorical data - is a very complex process and often requires a good understanding of domain/business data