An attempt to classify butterflies

Suppose that you have a set of images containing butterflies and you want to automatically assign a label to them according to the class they belong, without having to look at all the images manually. The butterflies adataset provided by the Ponce Group contains many such images of different sizes, taken under a variety of lighting conditions and viewing angles. To attempt such classification, we need to ensure that all images are of the same size, so we need to preprocess them. In our case we choose a size of 150x103 pixels to eliminate any upscaling as much as possible. We might think that this would be sufficient to construct our 2D matrix of samples and features, but it turns out that one of the images contains data that doesn't have the expected shape that all others have, despite having the same size. This results in buggy code and it may take hours before we are able to find the problematic image (mch023.jpg) and to exclude it from our data.

This dataset is relatively small (618 images), so there aren't too many examples in each of the classes that we can use during the training phase. This means that our classification results will likely be more inaccurate compared to larger datasets. If we choose to use 50% of the images for training and 50% for testing, then our classifier can only look at 309 images before it has to make predictions.

Since each image contains many pixels which are tuples in the form (R, G, B), one way we could make it a row in our 2D matrix is to ravel the data. This means that for each image we get 150x103x3 = 46350 individual values. We could try to further reduce the dimensionality with PCA, but this is a lossy process, which may negatively affect our classification accuracy. For this reason, we avoid it here.

There are many different classifiers we could potentially test on this data: Adaboost, k-nearest neighbors, random forests, extra trees, support vector machines (with or without grid search), even convolutional neural networks. With default settings, Adaboost obtains classification accuracy of 0.313916 on this dataset, which compared to the results on some other datasets seems strikingly low. This suggests that we likely need more data. Gradient boosting didn't finish in reasonable time. Random forests in its default settings achieved 0.381877, which is a slightly higher result. Extra trees achieved 0.446602 in its default settings, which is again a little bit more. Once we use more trees (in our case 2000, obtained by trial and error), the result improves to 0.546926 (which is still very low). Using support vector machine requires a lot of parameter tuning, and when the adjustments are slightly off, the accuracy level can drop dramatically. We can again do better by trial and error, but this is very time-consuming, non-intuitive and reminds us of supervision which we want to avoid. Even with exhaustive grid search, the best result we could obtain was ≈0.48. We might start thinking that we aren't using the best possible tool for the task. Some people have mentioned that for visual tasks like image classification, convolutional neural networks usually outperform other methods. So lets see one possible neural network layer configuration (I tried multiple):

Sample structure of a convolutional neural network used for image classification
Fig. 1: Structure of a sample convolutional neural network (CNN) for image classification

There are 3 convolutional layers, where each neuron in them have ReLU as an activation function. After each convolutional layer a max pooling layer is used.

The result with this structure and compiling the model with categorical crossentropy as loss function, Adam optimizer and accuracy as metric, we observe the following in the first 4 epochs:

The classification accuracy obtained at each of the four epochs of training
Fig. 2: The result of training the previously shown CNN after four epochs

Again, this result is very far from being satisfactory, considering that in the original paper, where the butterflies dataset was first presented, Svetlana Lazebnik et. al. reported total classification accuracy of slightly over 90%. And as you can see, the more complex we make the layer structure, the slower the training time. 323s spent only on a single epoch meant that for all four epochs to complete, we had to wait approximately 22 minutes. And our model consists of only 3 convolutional layers, which means that in practice convolutional neural networks using hundreds of layers and running for hundreds of epochs become prohibitively expensive to train on a regular machine. We may also notice that beside a certain point, having more epochs doesn't improve the result or may even adversely impact it.

This raises the question how useful CNNs are relative to the amount of computation they require. Reporting slightly "better" results where some machine in the background had to work its way forward on a complex model for a very long time doesn't seem as a sustainable improvement model. Humans also require quite a lot of training time, particularly on the theory that the neural network will have to use. This time also can't be neglected and must be taken into account when deciding about a particular approach. If the neural network is going to improve the result by 10-15%, but will require 10-20x the amount of computation, then its use may not be justified.

That said, the nice thing about the extra trees classifier is that it gives respectable results in a very short time compared to the alternatives. (What is surprising here is that this is valid for images as well.) It works well by default and doesn't have too many parameters to tune, which means that by design it is harder to misuse. Here you can see the classification of the images from the test set, where properly classified images are given in green and misclassified images are given in red. Click on the image if you want to see all butterflies.

The classification of the butterfly images used as a test set
Fig. 3: Classification of the butterflies on the test set