Cramers v for categorical variables python

Project description

A simple library to calculate correlation between variables. Currently provides correlation between nominal variables.

Based on statistical methodology like Cramer'V and Tschuprow'T allows to gauge the correlation between categorical variables. Ability to plot the correlation in form of heatmap is also provided.

Usage example

import pandas as pd
from pycorrcat.pycorrcat import plot_corr, corr_matrix

df = pd.DataFrame([('a', 'b'), ('a', 'd'), ('c', 'b'), ('e', 'd')],
                  columns=['dogs', 'cats'])

correlation_matrix = corr_matrix(data, ['dogs', 'cats'])
plot_corr(df, ['dogs','cats'] )

Development setup

Create a virtualenv and install dependencies:

  • pip install -r requirements.dev.txt
  • pip install -r requirements.txt Then install the pre-commit hooks: pre-commit install and continue with code change.

Run pre-commit locally to check files

pre-commit run --all-files

Release History

  • 0.1.4
    • CHANGE: Changed the documentation (no code change)
  • 0.1.3
    • ADD: Ability to pass dataframe to get correlation matrix
    • ADD: Ability to plot the correlation in form of heatmap
  • 0.1.2
    • Added as first release
  • 0.1.1
    • Test release

Author and Contributor

Anurag Kumar Mishra – Connect on github or drop a mail

Distributed under the GNU license. See LICENSE for more information.

Github repo link https://github.com/MavericksDS/pycorr

Contributing

  1. Fork it (https://github.com/MavericksDS/pycorr)
  2. Create your feature branch (git checkout -b feature/fooBar)
  3. Commit your changes (git commit -am 'Add some fooBar')
  4. Push to the branch (git push origin feature/fooBar)
  5. Create a new Pull Request

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

Cramér’s V is a number between 0 and 1 that indicates how strongly two categorical variables are associated. If we'd like to know if 2 categorical variables are associated, our first option is the chi-square independence test. A p-value close to zero means that our variables are very unlikely to be completely unassociated in some population. However, this does not mean the variables are strongly associated; a weak association in a large sample size may also result in p = 0.000.

Cramér’s V - Formula

A measure that does indicate the strength of the association is Cramér’s V, defined as

$$\phi_c = \sqrt{\frac{\chi^2}{N(k - 1)}}$$

where

  • \(\phi_c\) denotes Cramér’s V;
  • \(\chi^2\) is the Pearson chi-square statistic from the aforementioned test;
  • \(N\) is the sample size involved in the test and
  • \(k\) is the lesser number of categories of either variable.

Cramér’s V - Examples

A scientist wants to know if music preference is related to study major. He asks 200 students, resulting in the contingency table shown below.

Cramers v for categorical variables python

These raw frequencies are just what we need for all sort of computations but they don't show much of a pattern. The association -if any- between the variables is easier to see if we inspect row percentages instead of raw frequencies. Things become even clearer if we visualize our percentages in stacked bar charts.

Cramér’s V - Independence

In our first example, the variables are perfectly independent: \(\chi^2\) = 0. According to our formula, chi-square = 0 implies that Cramér’s V = 0. This means that music preference “does not say anything” about study major. The associated table and chart make this clear.

Cramers v for categorical variables python
Cramers v for categorical variables python

Note that the frequency distribution of study major is identical in each music preference group. If we'd like to predict somebody’s study major, knowing his music preference does not help us the least little bit. Our best guess is always law or “other”.

Cramér’s V - Moderate Association

A second sample of 200 students show a different pattern. The row percentages are shown below.

Cramers v for categorical variables python

This table shows quite some association between music preference and study major: the frequency distributions of studies are different for music preference groups. For instance, 60% of all students who prefer pop music study psychology. Those who prefer classical music mostly study law. The chart below visualizes our table.

Cramers v for categorical variables python

Note that music preference says quite a bit about study major: knowing the former helps a lot in predicting the latter. For these data

  • \(\chi^2 \approx\) 113;
  • our sample size N = 200 and
  • we've variables with 4 and 5 categories so k = (4 -1) = 3.

It follows that

$$\phi_c = \sqrt{\frac{113}{200(3)}} = 0.43.$$

which is substantial but not super high since Cramér’s V has a maximum value of 1.

Cramér’s V - Perfect Association

In a third -and last- sample of students, music preference and study major are perfectly associated. The table and chart below show the row percentages.

Cramers v for categorical variables python
Cramers v for categorical variables python

If we know a student’s music preference, we know his study major with certainty. This implies that our variables are perfectly associated. Do notice, however, that it doesn't work the other way around: we can't tell with certainty someone’s music preference from his study major but this is not necessary for perfect association: \(\chi^2\) = 600 so

$$\phi_c = \sqrt{\frac{600}{200(3)}} = 1,$$

which is the very highest possible value for Cramér’s V.

Alternative Measures

  • An alternative association measure for two nominal variables is the contingency coefficient. However, it's better avoided since its maximum value depends on the dimensions of the contingency table involved.3,4
  • For two ordinal variables, a Spearman correlation or Kendall’s tau are preferable over Cramér’s V.
  • For two metric variables, a Pearson correlation is the preferred measure.
  • If both variables are dichotomous (resulting in a 2 by 2 table) use a phi coefficient, which is simply a Pearson correlation computed on dichotomous variables.

Cramér’s V - SPSS

In SPSS, Cramér’s V is available from

Cramers v for categorical variables python
Cramers v for categorical variables python
. Next, fill out the dialog as shown below.

Cramers v for categorical variables python

Warning: for tables larger than 2 by 2, SPSS returns nonsensical values for phi without throwing any warning or error. These are often > 1, which isn't even possible for Pearson correlations. Oddly, you can't request Cramér’s V without getting these crazy phi values.

Final Notes

Cramér’s V is also known as Cramér’s phi (coefficient)5. It is an extension of the aforementioned phi coefficient for tables larger than 2 by 2, hence its notation as \(\phi_c\). It's been suggested that its been replaced by “V” because old computers couldn't print the letter \(\phi\).3

Thank you for reading.

References

  1. Van den Brink, W.P. & Koele, P. (2002). Statistiek, deel 3 [Statistics, part 3]. Amsterdam: Boom.
  2. Field, A. (2013). Discovering Statistics with IBM SPSS Newbury Park, CA: Sage.
  3. Howell, D.C. (2002). Statistical Methods for Psychology (5th ed.). Pacific Grove CA: Duxbury.
  4. Slotboom, A. (1987). Statistiek in woorden [Statistics in words]. Groningen: Wolters-Noordhoff.
  5. Sheskin, D. (2011). Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, FL: Chapman & Hall/CRC.

How is Cramer's V calculated in Python?

Let us calculate Cramer's V for a 3 × 3 Table..
X2: It is the Chi-square statistic..
N: It represents the total sample size..
R: It is equal to the number of rows..
C: It is equal to the number of columns..

Should I use phi or Cramer's V?

Cramer's V is used to examine the association between two categorical variables when there is more than a 2 X 2 contingency (e.g., 2 X 3). In these more complicated designs, phi is not appropriate, but Cramer's statistic is. Cramer's V represents the association or correlation between two variables.

How do you find the correlation between two categorical variables in Python?

If a categorical variable only has two values (i.e. true/false), then we can convert it into a numeric datatype (0 and 1). Since it becomes a numeric variable, we can find out the correlation using the dataframe. corr() function.

What is Chi

Cramér's V is an effect size measurement for the chi-square test of independence. It measures how strongly two categorical fields are associated. The effect size is calculated in the following manner: Determine which field has the fewest number of categories.