Cohen’s kappa coefficient in Python

cohen's kappa

If you have some experience in machine learning and have evaluated the results of some classifications, you have certainly used metrics such as accuracy, precision or f1-score. However, there are also other, less popular metrics that are worth using. One of them is cohen’s kappa. Today I want to bring you closer to this metric, convince you to use it and show you how to do it easily in Python.

Theoretical introduction

Cohen’s kappa is defined as the degree of compliance of two measurements of the same variable under different conditions. The same variable can be measured by two different raters or one rater can measure twice and it is determined for dependent categorical variables. The coefficient is expressed by the following formula, where pₒ means in this case relative observed agreement among raters and pₑ means hypothetical (expected) agreement. The value of the coefficient is less than or equal to 1. Value 1 is interpreted as full agreement.

Cohen’s kappa in machine learning

In machine learning, this measure basically tells us how much better the tested classifier is compared to the classifier that simply guesses randomly according to the frequency of each class. Accuracy is a popular and very intuitive metric used to check classification performance, but it does not work with unbalanced datasets. Cohen’s kappa, on the other hand, handles with this kind of datasets. It can say that it is a normalized measure of accuracy and it allows the model to predict the minority class even when the predicted probability of that class is not the highest[1].

Sign up for the newsletter to keep up to date with new articles!

Example in Python

If you want to use this metric in Python, you can do that with the sklearn library. Just import the following method:

from sklearn.metrics import cohen_kappa_score

Let me skip the implementation of the classification algorithm and move on to the assessment of the classification using the metric discussed today. You can find a sample implementation of the classifier in my article about Naive Bayes Classifier, which I published some time ago on Medium.
Let’s suppose that you want to use the metric to check Naive Bayes Classifier. You need an array with test class labels and an array with predictions.

cohen_kappa_score(labels_of_testing_set, predictions)


I wanted to introduce you to cohen’s kappa metric. Of course, I don’t diminish the importance of accuracy, but it’s always worth confronting the results with different metrics. In the case of cohen’s kappa, especially for the unbalanced datasets, what I mentioned earlier.


  1. Rosario Delgado; Xavier-Andoni Tibau; Quanquan Gu, Why Cohen’s Kappa should be avoided as performance measure in classification, PLoS One, 2019

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top