Helping Machines Visualize Our World

Headless Convolutional Neural Network as Feature Extractor

8 min readMay 22, 2021

Helping machines “see” by using digital images and deep learning models — Photo by Bud Helisson on Unsplash

Helping machines understand the contents of digital images allows for all sorts of applications such as image segmentation, object detection ,facial recognition ,edge detection, pattern detection, image classification, feature matching and many more! In this article, we will focus on feature matching, where features are learnt from an image and matched with other images for image retrieval! To note that the following code can also be found in this link to my Github.

This is a series of Medium posts made by 4 NUS SCALE master students (MSc. in Industry 4.0) who are taking ISY5004 Intelligent Sensing Systems. Here’s a snapshot of what we’ll be sharing:

Transfer Learning

A pre-trained convolutional neural network (CNN) model is a great starting point for the computer vision task as it takes a vast amount of time and resources to develop neural network models. Using transfer learning, the models can be repurposed for a second task. In this project, transfer learning was used for feature extraction. The model’s top layers which normally involve pooling into a 1D vector for classification were taken off and the remaining portion of the model was used for feature extraction using the weights trained from Imagenet as a starting point. The image is passed through a series of weights to give a feature vector which can be used for distance matching with another image. Shorter distances suggest a closer match.

Pre-Processing

Let’s load in a dataset of your preferred choice and get the path to the images. The dataset that was used for this short write up has the following structure:

hipvan_image/
    beds/
    decor/
    kids/
    ....ikea_image/
    Acuoustic panels & acoustic arts/
    Baby product/
    Children/
    ....

Hence, the following code has been used to iterate through the folders to create a list of file_paths.

Splitting the Train/Test Sets

Here, we will be preparing our train/test sets for performance evaluation which will be used for selection of the ideal model for our image retrieval use

Drop classes less than 5 objects as they will cause some trouble later during the splitting of the dataset.
Split the data into 80/20 Train/Test

Keras Feature Extractor

Now that we have all the data prepared, lets get to the fun part. Running the following will request for a user_input that you can call to generate a headless model. What we mean by headless is that there is no classifier at the end of the neural network. Hence, what we will have is a feature vector that can be used for matching in an image retrieval case.

Go ahead and key in a model you see. Such as xception, vgg16, inceptionv3, efficientnetb7. For more available models, do refer to the following link to get an idea of their performance on the imagenet. https://keras.io/api/applications/

Few things to note, models with larger amount of parameters run much slower.

Functions for Feature Extraction

With summary of the model displayed, that means you have succesfully loaded the model! In order to extract the features vectors for each image, we will be iterating through the path_list that was generated earlier.The conv_features are in 2D for the headless models presented above.

Performing the Feature Extraction

The earlier defined functions will be called and the feature vector of each image will be stored in numpy array which we can save for future uses!

Performance Measures

Five popular models, EfficientNetB7, InceptionV3, NASNetLarge, Resnet50, and VGG16 were evaluated with three runs for an average of the evaluation metrics. In order to pick the best model for the use case, it was necessary to use well established evaluation metrics and modified metrics to handle the multi-label approach. The chosen metrics were accuracy, precision, and parameters. Macro soft F1 and Macro F1 were also evaluated but due to the multiclass problem were dropped and will be discussed later in this section.

Defining the Performance Matrix

These will be further elaborate by parts in the subsequent section

Generating the Performance Metrics

The following string of codes will execute the earlier defined performance metrics to determine the accuracy/precision of the models.

Accuracy

Accuracy is the number of images that have the right label matched out of the number of the total number of predictions. Top 5 accuracy was chosen due to the multi-label challenge.

Top 5 Accuracy is a measure of how often the target class falls within the top 5 predicted classes. In an image classification problem, it is the top 5 values of the softmax distribution. For the use case, it is a measure of having labels match within the top 5 images with shortest distance to the single target label being looked up.

Top 5 Accuracy = No. of Correct Single Class Predictions / No. of Predictions

In this paper, a slightly modified version of Top 5 accuracy was used where the predicted labels are matched with the target object multi-labels.

Modified Top 5 Accuracy = No. of Correct Predictions for multi-class / Total Number of Predictions

Image augmentation overall improved the accuracy as it was observed that the misclassified items came predominantly from labels with smaller dataset. The team has tried to include weights to balance the labels but did not improve the models’ performances. Based on the Top 5 Mean Accuracy and Modified Accuracy as shown in Fig 1 and compiled in table 4, it was found that EfficientNetB7 out performed the other models in this metric suggesting that the model is most likely to return an accurate label within the top 5 closest matches. NASNetLarge and ResNet50 were the next best models and were comparable to one another.

Fig 1: Bar Charts for Top 5 Mean Accuracy and Top 5 Modified Mean Accuracy

Precision

Precision also known as positive predictive value is the proportion of relevant instances amongst the retrieved instances. This is important for users to receive relevant search results. The team proposes to use the precision of the top 5 closest matches labels to the target single label alongside with the slightly modified version to have the top 5 closest matches labels to the target’s multi-labels.

Top 5 Precision = True Positive Single Label / (True Positive + False Positive) in 5 Search

Top 5 Modified Precision = True Positive in Multi-Label / (True Positive + False Positive) in 5 Search

EfficientNetB7 tops the precision performance metrics with ResNet50 and NASNetLarge as the next best models as seen in the bar chart of Fig 2.

Fig 2: Bar Charts for Top 5 Mean Precision and Top 5 Modified Mean Precision

Parameters

Using less floating point operations per second (FLOPS) is needed to allow for close to real-time inference. This is to enable quicker searches and less computation cost. Hence, Fig 3 shares several scatter plots that were plotted with the earlier mentioned accuracy and precision against the model’s parameters. The most ideal models fall towards the top left of the scatter plots. The Team chose to go for EfficientNetB7 as the model that had moderate number of parameters while achieving high accuracy and precision. The worst performing model in our set of experiments was VGG16 as lower accuracy and precision while requiring a high number of parameters.

Fig 3: Scatter plots for Top1/5 Mean Accuracy/Precision vs Parameters

The last metric that the team attempted to perform was macro F1-score, the harmonic mean between precision and recall where the average is calculated per label and averaged across all the labels. P and r are precision and recall respectively. The values were stored in table 6 and seem to have saturated. Dropping objects with more than 5 labels worsen the Macro F1-score. The original intent was to achieve a high enough Macro F1-score before performing fine tuning of the weights. It is to note that the team also have tried to perform weighted classes but did not improve the performance of the Macro F1-score. Thus, team determined the CNN feature extraction based on accuracy, precision and parameters.

Conclusion

In this sharing, we covered how to extract features using the available CNN models in Keras Application. Likewise, the performance of the models were done where EfficientNetB7 transferred well to the scraped dataset and outperformed various CNN models in terms of accuracy, precision, and efficiency which are discussed later under the experimental results. Hence, EfficientNetB7 was chosen to be the CNN used for the feature extraction which produced much more relevant matches compared to several popular models.

Now with the right models we have on hand, let’s see how we can integrate it in your own Telegram bot and host it on the cloud. Onwards to part 5!

A Telegram Bot with the Power of Computer Vision

Snap-it Find-it: Your Shopping Companion Bot

jensen-wong.medium.com