Credit card transaction classification with Bernoulli Naive Bayes

Kaggle has provided a dataset containing credit card transactions, some of which are fraudulent. A natural question that arises is whether we can classify new transactions as regular or fraudulent based on the data collected so far. This is a binary classification problem, so we can choose a classifier which is suitable for this type of problem.

I have heard great things about naive Bayes in general, but never had the opportunity to test it on a dataset of sufficient size. And from my previous tests (mainly with random forests) I learned that small datasets are not sufficient to obtain high classification accuracy—the metric that tells us how well the actual and predicted labels match. Accuracy allows us to compare different classifiers on the same data. Naive Bayes is said to be fast and scalable, quickly arriving at acceptable results, where other classifiers can do better, but often require more CPU time. This allows us to use it as a base classifier and compare others against it.

In the case of the credit card fraud dataset, we will use Bernoulli Naive Bayes. From the scikit-learn documentation we can see that "BernoulliNB is designed for binary/boolean features." Here is the code:

import pandas as pd from sklearn.model_selection import train_test_split from sklearn.naive_bayes import BernoulliNB from sklearn.metrics import accuracy_score df = pd.read_csv('creditcard.csv') labels = df['Class'] del df['Class'] # now df contains the observations fraudulent_transactions = labels[labels == 1].shape[0] all_transactions = df.shape[0] # Use 20% of of the dataset for testing X_train, X_test, y_train, y_test = train_test_split(df, labels, test_size=0.20, random_state=42) bnb = BernoulliNB(), y_train) y_pred = bnb.predict(X_test) print('Percentage of fraudulent transactions: {0:.5f}%'.format((fraudulent_transactions / all_transactions) * 100)) print('Classification accuracy: {0:.5f}%'.format(accuracy_score(y_test, y_pred) * 100)) """ Percentage of fraudulent transactions: 0.17275% Classification accuracy: 99.91047% """

We use Pandas to load our data in memory. df stands for dataframe, which is a 2D table of data, usually having labeled columns. Each column is accessible with df['Column name'], which makes is very convenient for exploratory analysis of large datasets, even on a slow machine. In many cases we can type a line like df = df[['Column name1', 'Column name2', 'Column name3']] to limit the data only to the columns we need for further analysis. To select multiple columns, we use two brackets, to select a single column, only one is sufficient. To get the rows matching a given criteria, for instance amount > 100, we can type df[df['Amount'] > 100]. For more information, please refer to the Pandas documentation.

The train-test split is used to divide the data into a training set and test set. We train the classifier on the training set and evaluate it on the test set. The fit method is used for the training (here it accepts the training observations and training labels), whereas the predict method is used for the testing (accepting test observations). It returns the predictions, which can be compared against the true labels.

As an overview, we can see that the percentage of actual fraudulent transactions is quite small—below 0.2%. In our case we achieve a classification accuracy of 99.91%, which is a very high percentage. This means that 99.91% of the labels in the test set match the predictions of the classifier. Thus, we can be reasonably confident that we can apply this classifier to new, unseen data of the same type, where it will perform quite well (the test set artificially plays the role of "unseen data"). The reason that the result is so high stems from the fact that this dataset is both large and feature-rich, which is relatively rare. In most cases we don't have enough data or it is relatively expensive to obtain it/create it by hand.