Consecutive multiscale feature learning-based image classification model

Olimov, Bekhzod; Subramanian, Barathi; Ugli, Rakhmonov Akhrorjon Akhmadjon; Kim, Jea-Soo; Kim, Jeonghong

doi:10.1038/s41598-023-30480-8

Download PDF

Article
Open access
Published: 03 March 2023

Consecutive multiscale feature learning-based image classification model

Bekhzod Olimov¹,
Barathi Subramanian²,
Rakhmonov Akhrorjon Akhmadjon Ugli²,
Jea-Soo Kim² &
…
Jeonghong Kim²

Scientific Reports volume 13, Article number: 3595 (2023) Cite this article

4451 Accesses
9 Citations
2 Altmetric
Metrics details

Subjects

This article has been updated

Abstract

Extracting useful features at multiple scales is a crucial task in computer vision. The emergence of deep-learning techniques and the advancements in convolutional neural networks (CNNs) have facilitated effective multiscale feature extraction that results in stable performance improvements in numerous real-life applications. However, currently available state-of-the-art methods primarily rely on a parallel multiscale feature extraction approach, and despite exhibiting competitive accuracy, the models lead to poor results in efficient computation and low generalization on small-scale images. Moreover, efficient and lightweight networks cannot appropriately learn useful features, and this causes underfitting when training with small-scale images or datasets with a limited number of samples. To address these problems, we propose a novel image classification system based on elaborate data preprocessing steps and a carefully designed CNN model architecture. Specifically, we present a consecutive multiscale feature-learning network (CMSFL-Net) that employs a consecutive feature-learning approach based on the usage of various feature maps with different receptive fields to achieve faster training/inference and higher accuracy. In the conducted experiments using six real-life image classification datasets, including small-scale, large-scale, and limited data, the CMSFL-Net exhibits an accuracy comparable with those of existing state-of-the-art efficient networks. Moreover, the proposed system outperforms them in terms of efficiency and speed and achieves the best results in accuracy-efficiency trade-off.

Machine learning reveals the control mechanics of an insect wing hinge

Article 17 April 2024

Memorability shapes perceived time (and vice versa)

Article 22 April 2024

Segment anything in medical images

Article Open access 22 January 2024

Introduction

Recently, the amount of available data has considerably increased owing to the developments of Internet of Things, technological devices, and computational machines. Because of the widespread usage of these ubiquitous technologies, high volumes of various data, such as digital images, texts, speech, or various combinations of these, have been generated. Among the aforementioned types of data, images constitute a large portion of available data¹.

Because of the accessibility of digital image data from cameras and sensors, these data need to be processed for analysis to obtain meaningful results. As digital image data are significantly large in volume and usually complex, sophisticated digital image analysis techniques, such as machine learning (ML) and deep learning (DL), have been used to efficiently handle them². Several tasks have been performed to deal with digital images, such as image classification, semantic segmentation^3,4,5, object detection^6,7, and instance segmentation^8,9. Image classification is crucial part of digital image analysis and a basic component of the other computer vision tasks because image classification models are used as a backbone for the abovementioned more advanced computer vision tasks^10,11.

Image classification involves the extraction of useful features from a digital image and the classification of the image into one of the pre-defined classes based on the extracted features^12,13. Manual verification and classification of digital images can be a laborious and monotonous process; thus, automating the image analysis process by using image classification methods is more efficient and less time-consuming^14,15. Recent advances in these methods have facilitated the usage of image classification in several real-world applications, such as medical imaging^16,17, face recognition¹⁸, human activity recognition¹⁹, and traffic control systems^20,21.

Numerous studies have been conducted on the usage and importance image classification. Before the emergence of DL, several traditional methods were used to effectively analyze images. For example, some statistical methods, such as maximum likelihood, minimum distance, parallelepiped, are the most common traditional techniques for image classification^22,23. Moreover, a few ML methods, such as k-nearest neighbors, support vector machines, and random forest, are used^24,25. However, traditional image classification methods became obsolete after the introduction of DL methods, which are faster, more efficient, and more accurate. DL methods used for image classification already surpass human-level accuracy when abundant labeled data are available for training^26,27.

However, manually labeling millions of available images is a time-consuming and laborious task; thus, obtaining a large number of manually annotated data for image classification model training is challenging^19,28. Consequently, DL-based classification networks have limitations in learning useful features from labeled datasets with a limited number of images. The insufficiency of training data is apparent in many fields, such as medicine and fault detection^29,30. The complex structure and large number of trainable parameters of state-of-the-art classification models^31,32,33,34 often result in overfitting³⁵. Additionally, the existing state-of-the-art classification networks cannot appropriately extract useful features from small-scale images and often exhibit poor performance on these data. Because the existing methods primarily focus on large-scale images and a prolonged training process, they typically leads to poor generalizability and unsatisfactory outcomes on small-scale images¹². Although several lightweight models focus on efficient computation by reducing the number of trainable parameters^36,37,38,39, they still encounter the underfitting problem3. These models cannot appropriately learn useful image features, leading to poor classification performance. Moreover, the existing DL-based image classification methods are not flawless or fast⁴⁰; therefore, faster and more efficient, precise, and generalizable image classification models are being developed^41,42.

By studying the currently available methods for image classification, we identified that these models can be improved in terms of accuracy and speed. Thus, in this study, we propose a novel image classification system called CMSFL-Net; it uses elaborate preprocessing and a carefully designed model architecture. The proposed model benefits from multiscale feature extraction and consecutive feature learning by using various feature maps with different receptive fields (RFs) to achieve better performance in terms of speed and accuracy when compared with the existing state-of-the-art methods. In general, the contributions of this study are as follows:

The proposed method employs consecutive propagation of extracted features from various RFs, thus obtaining better classification accuracy using an efficient computation-based small-sized model.
The proposed method utilizes an elaborate pre-processing stage and improved consecutive multiscale feature learning that enables it to achieve a better and faster training process.
The proposed method exhibits high inference speed owing to an efficient computation-based lightweight model that uses few trainable parameters that allows the model to be used in real-time applications.
The proposed method exhibits excellent generalizability and performance in limited, small-scale, and large-scale image datasets.
The proposed method can be employed as a backbone model for other computer vision tasks, such as semantic segmentation, object detection, and instance segmentation, owing to its superiority in feature learning over the existing state-of-the-art DL-based classification models.

The remainder of the manuscript is organized as follows. Section 2 presents a thorough discussion on the existing methods for image classification and their weaknesses. Section 3 contains a thorough explanation of the proposed methodology. Section 4 provides detailed information on the conducted experiments and their results. Section 5 presents the experimental results. Finally, Section 6 concludes this study and defines future study topics.

Related works

As discussed in Sect. 1, there has been a vast number of proposed traditional and DL-based approaches for image classification. In this section, we focus on only DL-based techniques since traditional approaches are not utilized with these data because of their poor speed and accuracy. The currently available DL-based approaches can be classified into computationally expensive-powerful and efficient-lightweight models.

Computationally expensive and powerful DL-based models

One of the earliest and most powerful DL-based convolutional neural network (CNN) models for image classification is residual networks (ResNets)³¹. In this study, we proposed a model by reformulating the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions. By employing the residual functions learning, the model is easy to optimize and obtains better accuracy from the increased depth of the network; consequently, it successfully addresses the problem of training deep neural networks (DNNs) and a vanishing gradient problem. Huan et al. proposed a dense convolutional network (DenseNet) that connects each layer to every other layer in a forward propagation³². The authors employed $L(L+1)/2$ direct connections instead of traditional L connections of networks with L layers. This direct connection allows the network to handle a vanishing gradient problem, ensures feature sharing, and significantly reduces the number of trainable parameters. Moreover, Xie et al. introduced a highly modularized DL-based classification model architecture that uses repetitive building blocks aggregating a set of transformations with the same topology and introduces “cardinality” dimension that serves as a crucial factor in addition to depth and width dimensions³³. Alternatively, we further improved the ResNet using more convolutional operations with various filters while retaining the same computational complexity.

Moreover, Gao et al. proposed a novel building block for CNNs by constructing hierarchical residual connections within a single residual block³⁴. The Res2Net represents multiscale features at a granular level and increases the RFs for each layer of the network. Mansilla et al. proposed a novel method that incorporates anatomical priors in the form of global constraints into the data learning process to boost the realism of the warped images after registration. The method learns global nonlinear representations of image anatomy using segmentation masks and uses them to constrain the registration step¹⁷. Oregi et al. developed a system to address an issue of adversarial attacks by extracting color gradient features from input images at various sensitivity levels to detect various manipulations. This technique employs a DCNN to classify an image, whereas a discrimination model analyzes the extracted color gradient features with sequence data to identify the legitimacy of input images². Wei et al. formulated an interactive visual model that uses self-interaction, mutual interaction, multi-interaction, and adaptive interaction, forming the first interactive completeness of the visual interaction network. We also employ the adaptive adjustment mechanism to enhance the performance of the DCNN model. Although the aforementioned models achieve state-of-the-art performance in terms of accuracy in an image classification task, they suffer from inefficient computation, slow training, and inference speed due to an extensive number of trainable parameters and floating point operations (FLOPs). Also, the models experience poor generalizability for small-scale images since they cannot completely learn useful features from the images within a short period of training. Although DenseNet reduced the number of trainable parameters and modified versions of ResNets improved the feature extraction and accuracy, they are still significantly slower in comparison to the lightweight models that are introduced in the next subsection. Regarding Res2Net, it extracts the features from particles of the input features to every other layer rather than learning the features from initial inputs. Considering that the inputs to the next layers in the network lose information as the training continues, the model can only partially use the power of useful features from the original input, which leads to poor classification performance of the network.

Efficient and lightweight DL-based models

ShuffleNet, MobileNet, and MnasNet are the most widely employed lightweight DL-based classification models. They are mainly used in devices with limited computational power due to their efficient computation and small memory requirement. ShuffleNet employs pointwise group convolution and channel shuffle that allows the model significantly reduces computational expenses while retaining a competitive accuracy⁴³. Ma et al. further improved the original ShuffleNet by introducing ShuffleNet V2³⁷. The model considers the indirect metric of computation complexity, such as FLOPs, and the direct metric, such as required memory and device characteristics.

Regarding the other efficient and lightweight model, MobileNet V1 employs a streamlined architecture, which utilizes depth-wise separable convolution operations to formulate a lightweight network architecture. The authors of the MobileNet V1 introduced two hyper-parameters that allow an engineer to select an appropriate model size based on the problem characteristics. The MobileNet V1 is still outperformed by standard CNN architecture-based models. Therefore, to address the issue, MobileNet V2 is proposed⁴⁴. The model benefits from an inverted residual structure where the shortcut connections are between the thin bottleneck layer while the intermediate expansion layer employs depthwise convolution operation to filter features as a nonlinearity source. MnasNet is based on MobileNet V2 model architecture and introduces lightweight attention modules using squeeze and excitation into the bottleneck structure³⁸. These structures are placed after the depthwise filters feed-forward pass to obtain attention to be applied to the largest image representation. Qian et al. improved the MobileNet V2 and proposed MobileNet V3 that uses modified swish nonlinearities by replacing the original sigmoid function with the hard sigmoid to alleviate the vanishing gradient problem and ensure better accuracy³⁹. In general, lightweight models obtain a good trade-off between speed and accuracy; however, they exhibit poor feature learning ability than vanilla deep CNN networks. Consequently, these models cannot obtain desirable accuracy when trained using limited or small-scale image data, causing an underfitting problem.

To address the aforementioned problems of the existing powerful (parallel approach of feature extraction) and efficient models (underfitting for limited data), we propose a novel model that uses consecutive multiscale feature learning from the original input features and sequentially propagates these features to decrease the number of trainable parameters and model size. Moreover, the proposed method exhibits a simplified model structure that allows improved feature extraction that leads to better classification performance due to the usage of the consecutive feature learning method.

Proposed methodology

In this section, we describe the proposed CMSFL-Net system in detail. An overall graphical illustration of the proposed method is illustrated in Fig. 1. Specifically, the CMSFL-Net contains three significant steps, which are data pre-processing, data learning, and inference.

Data pre-processing

In the pre-processing stage, dataset images are represented as tensors to make the computation in the training process more convenient and efficient. Specifically, the images are extracted from directories and represented as tensors since they ensure more natural representations of multidimensional data. The resulting tensor is 4D—$X\in {\mathbb {R}}^{M\times {C}\times {H}\times {W}}$, where M, C, H, and W are the total number of images, number of channels, image height, and image width, respectively. After obtaining the images in tensors, they are resized to match the input size of deep CNN (DCNN) later in a data learning phase. Then, the resized image pixel values are standardized to follow a standard normal distribution using (1) as follows.

$$\begin{aligned} X_{std} = \frac{X - \displaystyle \frac{1}{M}\displaystyle \sum _{i=1}^{M} x_{i}}{\sqrt{\displaystyle \frac{1}{M}\displaystyle \sum _{i=1}^{M} {\left( x_{i} - \displaystyle \frac{1}{M}\displaystyle \sum _{i=1}^{M} x_{i} \right) }^2 }} \end{aligned}$$

(1)

In (1), X and $X_{std}$ are the original and standardized data; while i and M are a particular data point and the total number of instances, respectively. Notably, data standardization of validation and test data is performed using the training data distribution to avoid overfitting to the training set and increase the generalization ability of the DCNN model.

Finally, data augmentation techniques are applied to increase the number of images for better learning multiscale features and better generalization ability of the proposed model during training and inference, respectively. Based on the dataset images’ characteristics, we apply various image augmentation methods as follows:

$$\begin{aligned}{} & {} \left[ \begin{array}{ll} &{}x_{atr}\\ &{}y_{atr}\\ &{}1\\ \end{array}\right] = \left[ \begin{array}{lll} cos\alpha &{} -sin \alpha &{} 0\\ sin\alpha &{} cos \alpha &{} 0\\ 0&{}0&{}1\\ \end{array}\right] \left[ \begin{array}{ll} &{} x_{org}\\ &{} y_{org}\\ &{}1\\ \end{array}\right] \quad \quad \left[ \begin{array}{ll} &{}x_{atr}\\ &{}y_{atr}\\ &{}1\\ \end{array}\right] = \left[ \begin{array}{lll} s_{x} &{} 0 &{} 0\\ 0 &{} s_{y} &{} 0\\ 0&{}0&{}1\\ \end{array}\right] \left[ \begin{array}{ll} &{} x_{org}\\ &{} y_{org}\\ &{}1\\ \end{array}\right] \nonumber \\{} & {} \left[ \begin{array}{ll} &{}x_{atr}\\ &{}y_{atr}\\ &{}1\\ \end{array}\right] = \left[ \begin{array}{lll} -1 &{} 0 &{} 0\\ 0 &{} 1 &{} 0\\ 0&{}0&{}1\\ \end{array}\right] \left[ \begin{array}{ll} &{} x_{org}\\ &{} y_{org}\\ &{}1\\ \end{array}\right] \quad \quad \left[ \begin{array}{ll} &{}x_{atr}\\ &{}y_{atr}\\ &{}1\\ \end{array}\right] = \left[ \begin{array}{lll} 0 &{} 1 &{} 0\\ 1 &{} 0 &{} 0\\ 0&{}0&{}1\\ \end{array}\right] \left[ \begin{array}{ll} &{} x_{org}\\ &{} y_{org}\\ &{}1\\ \end{array}\right] \end{aligned}$$

(2)

Specifically, we employ affine transformations to rotate the 2D image dimensions along the X and Y coordinates, change the scale of the images using $s_x$ and $s_y$ parameters, and mirror the images across the X and Y axes.

Data learning

After data preprocessing stages are completed, useful features of images are extracted using a consecutive multiscale feature learning-based model - CMSFL-Net.

Network architecture

The model is a combination of consecutive multiscale feature learning (CMSFL) modules for extracting features from an image, a max-pooling operation for decreasing the spatial dimension of an image, and a fully connected dense layer for linearly classifying an image into one of the pre-defined classes based on the learned features in CMSFL modules inspired from^45,46,47. The complete network architecture of the CMSFL-Net is provided in Fig. 2.

As shown in Fig. 2, every CMSFL module is followed by a max-pooling operation, which decreases the spatial dimension of its input by a factor of two by retaining the most striking pixels with the highest value in comparison with the ones in its neighborhood. Despite being an efficient method to reduce the computational complexity of a DCNN model, max-pooling operation results in tremendous information loss^48,49. The problem is illustrated in Fig. 3.

CMSFL module

Considering the information loss problem, we aim to learn as much useful information as possible from the input image before applying the max-pooling operation to address the aforementioned issue. For this purpose, we formulate a CMSFL module that benefits from several convolutional layers and concatenation operations. Figure 4 illustrates a detailed graphical overview of the CMSFL module.

The CMSFL module aims to extract as many useful features as possible from the input volume by applying a few convolutional layers with various receptive fields. Every convolution operation has a kernel size of 3$\times$3 and is followed by batch normalization (BN) and activation function. For smooth training and better generalizability, we employ weight initialization-based rectified linear unit activation function⁵⁰. The other specification of the CMSFL module is that it considers the original features of the input volume to the module and concatenates the useful information from the secondary branch (SB) to every output (except the first convolution layer) of the main branch (MB). This concatenation operation helps retain a better representation of the useful features because the input volume to the CMSFL module exhibits the full information and features that are steadily lost when the convolution operations are applied. Therefore, to address the information loss, we consecutively concatenate output volumes of the convolution operation from the SB with the output feature maps from the MB.

$$\begin{aligned} O_V^{[l]} = Conv_{MB}(I_V)^\frown Conv_{SB}(I_V) \end{aligned}$$

(3)

This also helps to efficiently increase the RF size of the convolution kernels. As can be seen, the RF size of the kernels in the convolution layers of the MB gradually increases as the training continues. The CMSFL module represents such an efficient approach to increasing RF size by applying only a single 3 $\times$ 3 convolution operation in SB and concatenating it with the output feature map from a 3 $\times$ 3 convolution operation in MB. A graphical illustration of the efficient way of RF size increase in the CMSFL module when compared to the traditional approaches can be seen in Fig. 5.

Increasing the RF

As shown in Fig. 5, to increase the RF from (7, 5, 3) to (9, 7, 5, 3), the traditional methods require four 3 $\times$ 3 convolution operations or a single 11 $\times$ 11 convolution operation. However, the same increase in the RF size can be achieved by employing only two 3 $\times$ 3 convolution operations in the proposed CMSFL module. The computation of total number of 3 $\times$ 3 convolution operations in the traditional and proposed methods are provided in Eqs. (4) and (5) as follows.

$$\begin{aligned}{} & {} \begin{aligned} RF:(3) \Longrightarrow& 3 \times 3 \rightarrow 1 \times [3 \times 3] \\ RF:(5, 3) \Longrightarrow& 5 \times 5 \rightarrow 2 \times [3 \times 3] \\ RF:(7, 5, 3) \Longrightarrow&7 \times 7 \rightarrow 3 \times [3 \times 3] \\ RF:(9, 7, 5, 3) \Longrightarrow& 11 \times 11 \rightarrow 4 \times [3 \times 3] \\ \end{aligned} \end{aligned}$$

(4)

$$\begin{aligned}{} & {} \begin{aligned} RF:(3) \Longrightarrow& 1 \times [3 \times 3] \\ RF:(5, 3) \Longrightarrow& 2 \times [3 \times 3] \\ RF:(7, 5, 3) \Longrightarrow& 2 \times [3 \times 3] \\ RF:(9, 7, 5, 3) \Longrightarrow& 2 \times [3 \times 3] \\ \end{aligned} \end{aligned}$$

(5)

From the Eqs. (4) and (5), the traditional methods require numerous convolution operations with a kernel size of $3 \times 3$ (or larger kernel sizes) to increase the RF in every other convolutional layer, while the proposed CMSFL module demands only two $3 \times 3$ convolution operations to increase the RF. Specifically, to obtain the RF of (9, 7, 5, 3), traditional methods require ten $3 \times 3$ convolutions while the proposed method can achieve the same RF size with only seven $3 \times 3$ convolution operations.

Moreover, the proposed CMSFL module requires significantly fewer trainable parameters and FLOPs that helps train the model efficiently, address overfitting, and achieve better generalizability to test data. The number of trainable parameters and FLOPs in an $l^{th}$ convolutional layer can be computed using the following equation.

$$\begin{aligned} \begin{aligned} P_{tr}^{[l]} =& ks^{[l]} \times cs^{[l-1]} \times cs^{[l]} + b^{[l]}\\ FLOPs^{[l]} = IV_{H}&\times IV_{W} \times cs^{[l-1]} \times ks^{[l]}\times ks^{[l]} \times cs^{[l]}\\ \end{aligned} \end{aligned}$$

(6)

In Eq. (6), ks, cs, and b correspond to kernel size, channel size, and bias, while $IV_{H}$ and $IV_{W}$ stand for input volume height and width, respectively. Thus, the traditional convolution operations need [$3 \times 3 + 5 \times 5 + 7 \times 7 + 11 \times 11$] weights multiplied by the number of filters in a specific layer l to achieve the RF of (9, 7, 5, 3). In contrast, the proposed model requires only [$7 \times$ $3 \times 3$] multiplied by the number of filters in layer l to obtain the aforementioned RF.

Loss function

Considering that the proposed method can be employed while training using datasets exhibiting a data imbalance problem, we implement a loss function that alleviates the aforementioned issue. Specifically, the UCIR loss function sets larger and smaller weights for over-represented and under-represented classes, respectively by $l^2$ normalizing both the weights and the activation. This means employing the cosine similarity rather than the dot product. For each class c, the last layer is changed as follows:

$$\begin{aligned} \omega _c = \frac{\exp (\alpha \cos (\varepsilon _c)}{\sum _j{\exp (\alpha \cos (\varepsilon _c)}} \\ \end{aligned}$$

(7)

In Eq. (7), $\alpha$, $\varepsilon$, and $\cos ()$ are learned scaling parameter, the last layer weights for the class c, and cosine similarity, respectively. After addressing the class imbalance problem, we formulate the loss function for training the proposed method. For this purpose, we employ weighted categorical cross-entropy loss. This loss function is formulated as follows:

$$\begin{aligned} \begin{aligned} L_{f} = -\frac{1}{M} \sum _{j=1}^{J}\sum _{i=1}^{M}\omega _j \times y_i^j \times \log (DCNN(x_i, j)) \end{aligned} \end{aligned}$$

(8)

In (8), M, J, $y_i^j$, $x_i$, and DCNN are the total number of images, classes, and ground truth for a training example i for class j, $i^th$ training image, and deep convolutional neural network, respectively.

In general, the data learning process aims to extract as many useful features as possible until the information is lost from the original image during convolution and max-pooling operations.

Inference

After completing the second step of the proposed system and obtaining a trained CMSFL-Net model, we can classify the images using the this model in an inference stage. In this stage, the raw data should pass through the same pre-processing operations, as in the training stage, except for data augmentation. Specifically, a test set of a dataset or real-life images are represented as tensors, precisely resized, and standardized using (1). For standardization, X must be the training data, i.e., the same data that was used in the training and validation stages, to ensure that data in the inference stage follow the same distribution. The images are then input into the trained model, which consequently classifies them into one of the pre-defined categories.

Experiments and results

In this section, we illustrate details of the conducted experiments and share the results by comparing the performance of the proposed system with existing state-of-the-art methods.

Benchmarking datasets

To illustrate the excellent generalizability of the proposed method, we tested the performance of the CMSFL-Net using various real-life datasets that contain small-scale and large-scale images, datasets with a limited number of images. The overall information on the datasets can be seen in Table 1.

Table 1 General information on the datasets for the experiments.

Full size table

Training details

In this subsection, we provide detailed information about the conducted experiments, such as experimental setup, baseline methods, and evaluation metrics.

Experimental setup

We formulated the baseline and proposed methods using Python version 3.6.9 and PyTorch library version 1.4.0. We initialized the weight parameters using Gaussian distribution and did not use bias parameters. We used $L_{f}$ (discussed in “Loss function” section) as the minimizing function and Adam optimizer with $\eta =0.0001$ and $\gamma =0.9$ as the parameter optimizer for the proposed method. The experiments were conducted using 32 GB NVIDIA Tesla V100-SXM2 GPU with CUDA 10.0 and a mini-batch size of 64 and 16 for small-scale and large-scale datasets, respectively. The models were trained for 50 epochs because the considered methods converged within this period of training and did not show improvements in performance when training a greater number of epochs.

Evaluation metrics

We employ various evaluation metrics to assess the performance of the proposed model compared with one of the baseline methods from different angles. Specifically, we define accuracy score (AS) and F1 score (F1) for the evaluation of the model’s performance.

$$\begin{aligned} \begin{aligned} AS&= \frac{\sum _{i}^{M}{\hat{y}}_i == y_i}{\sum _{i}^{M}y_i}\\ F1&= \frac{2 \times TP / (TP + FP) \times TP / (TP + FN)}{TP / (TP + FP) + TP / (TP + FN)} \end{aligned} \end{aligned}$$

(9)

In Eq. (9), ${\hat{y}}_i$, $y_i$ denote predicted output and ground truth for $i^th$ image, while TP, FP, FN correspond to true positive, false positive, and false negative, respectively.

Baseline models

To show the efficiency of the proposed system, we selected two sorts of state-of-the-art methods, namely powerful methods that achieve high accuracy scores and lightweight methods that are efficient and fast. We compare the performance of the CMSFL-Net with the aforementioned models to show its good performance in terms of both efficiency and accuracy. As baseline models, we select Res2Net: A new multiscale backbone architecture (Res2Net)³⁴, Learning Deformable Registration of Medical Images with Anatomical Constraints (LDR)¹⁷, Robust Image Classification Against Adversarial Attacks using Elastic Similarity Measures between Edge Count Sequences (ESM)², Visual Interaction Networks (VIN)¹², ShuffleNet V2³⁷, MnasNet³⁸, and MobileNetV3³⁹. Because we discussed these methods in Section 2, we do not dive into the details of the aforementioned approaches in this section. All baseline models and the proposed method were trained and evaluated under the same circumstances as described in the next subsection.

Experimental results with regards to computational efficiency

After formulating the baseline and proposed models for the experiments in accordance with the previous subsection, we compare them in terms of computation efficiency by focusing on the number of trainable parameters, FLOPs, model size, training, and inference time. The results of the comparison are shown in Table 2.

Table 2 Comparison of the baseline and proposed models in terms of computational efficiency, memory, and time*.

Full size table

As indicated in Table 2, the proposed model significantly outperformed the powerful baseline models, such as Res2Net, LDR, ESM, and VIN, and achieved comparable performance when compared with the lightweight models, like ShuffleNetV2, MnasNet, and MobileNetV3. Specifically, the CMSFL-Net required considerably fewer trainable parameters by outperforming its closest peer, ShuffleNetV2, by approximately 30%. Moreover, the proposed model achieved the best performance in FLOPs and trained model size, too. To be more precise, CMSFL-Net exhibited up to 2 times fewer FLOPs in comparison with the powerful models, while its model size was only 3.44 MB, which is the smallest storage among the considered models. Regarding training and inference time, the proposed model was the fastest one when compared with the baseline models by demanding 0.47 minutes per epoch training and 7.53 seconds on training and testing sets of the BreakHis dataset, respectively.

Experimental results on small-scale and large-scale image datasets

The results of the validation set of the CIFAR-10, STL-10, and ImageNet-100 datasets using accuracy-related metrics are provided in Fig. 6. For example, LDR showed a steady increase in AS and F1 as the training progressed; however, the model achieved the lowest accuracy-related scores in comparison to the other considered models on the aforementioned datasets. The baseline models achieved similar scores in loss, AS, and F1 on the CIFAR-10 and STL-10 datasets, where VIN and MobileNetV3 attained the second-best scores. The highest scores in AS and F1 in the small-scale image datasets were obtained by the proposed system. The CMSFL-Net significantly outperformed its peers by obtaining approximately 7% higher results when evaluated using AS and F1. On the large-scale ImageNet-100 dataset, the proposed method demonstrated stable performance during the training process and achieved the second-best results when assessed using accuracy-related metrics. Throughout the training, except for the several final epochs, the performance of the proposed method was similar to the best-performing ESM method, which exhibited a sudden increase in AS and F1 in the last epochs of training and outperformed the proposed system.

Experimental results on datasets with limited number of data

Figure 7 shows the comparison of the model’s performance on the validation set of the COVID-CT, BreakHis, and Br35H datasets when assessed with loss value, AS, and F1.

In fact, the results of the accuracy-related metrics are very noisy because the datasets exhibited a small number of training data that results in fluctuations in training. Overall, the performance of the baseline models such as VIM, ShuffleNetV2, Res2Net, and ESM was comparable with one of the proposed models, while MnasNet and LDR methods obtained significantly lower AS and F1 when compared with the CMSFL-Net. Overall, the proposed method obtained the optimal results in AS by reaching 0.732, 0.995, and 0.773 on the validation set of the COVID-CT, BreakHis, and Br35H datasets, respectively. Regarding the F1, the results of the proposed method were similar to the ones when assessed using AS evaluation metric, and it achieved the F1 of 0.733, 0.993, and 0.736, respectively.

Discussion

This section discusses the results of the conducted experiments using the test sets of the considered datasets and shares the results of ablation studies. Moreover, it exhibits a qualitative comparison of the baseline and proposed methods and enumerates the limitations of the proposed method.

Generalizability of the considered models on the small and large-scale image datasets

We tested the performance of the baseline and proposed methods on the test set of the considered datasets to compare their generalization ability on unseen data during inference in terms of loss, AS, and F1. The results of the experiments are shown in Table 3.

Table 3 Comparison of the baseline and proposed models on the test sets of the small- and large-scale image datasets in terms of loss and accuracy*.

Full size table

As presented in the table, the proposed system outperformed the baseline methods in both small-scale datasets, CIFAR-10 and STL-10, except for the loss value on the CIFAR-10, where the ESM method achieved the lowest score by obtaining 1.040, which is only 0.6% better than the one of the proposed method. In other accuracy-related metrics, the proposed model obtained better generalization to the unseen data. On the large-scale ImageNet-100 dataset, the proposed method achieved the second-best performing results by obtaining 1,79%, 1,25%, and 0,99% lower scores than the best performing ESM method in loss, AS, and F1, respectively. Considering that the ESM model is a computationally expensive network, the proposed method demonstrated the best accuracy-efficiency trade-off in the most commonly used ImageNet benchmarking dataset.

Generalizability of the considered models on the datasets with a limited number of data

Table 4 shows the experimental results of the inference step on the test sets of the COVID-CT, BreakHis, and BrH35 datasets.

Table 4 Comparison of the baseline and proposed models on the test sets of the datasets with limited training data in terms of loss and accuracy*.

Full size table

The table shows that ESM achieves the lowest loss and the highest accuracy on the COVID-CT dataset by significantly outperforming its peers in terms of generalizability to the unseen data. The proposed method attains satisfactory results by ranking second on the aforementioned dataset. However, on the other two medical image datasets, namely BreakHis and Br35H, the proposed method obtains the best scores in terms of the evaluation metrics: loss, AS, and F1. Specifically, the proposed method achieves perfect accuracy on the test set of Br35H by reaching 0.991 and 0.990 in AS and F1, respectively. In general, the proposed approach shows good efficiency and better generalizability to unseen data than its peers.

Ablation studies of the CMSFL-Net

We also conducted extensive ablation studies and tested different versions of the proposed method to determine the best trade-off between speed and accuracy. The results of the studies are shown in Table 5.

Table 5 Ablation studies of the proposed model on the test sets of the considered datasets*.

Full size table

In the table, we modified the network using a different number of CMSFL modules ranging from 4 to 10 and compared the results using loss, AS, and F1 evaluation metric scores and the number of trainable parameters, training time, and inference time. Overall, it is shown that a network with the fewest number of CMSFL modules was faster in training and inference but achieved lower accuracy on the test sets of the considered datasets. When the number of CMSFL modules was increased by two, CMSFL-Net (refer to Section ) showed a significant decrease in loss and an increase in accuracy-related metrics. However, increasing the number of CMSFL modules did not provide significant improvements in the performance of the proposed method. Although there was a slight increase in accuracy with the proposed model with 8 CMSFL modules, it resulted in a considerable increase in training and inference time. Increasing the CMSFL modules to ten decreases the network performance. Considering these findings, we selected the proposed model architecture with six CMSFL modules as a default network because it achieved the best accuracy and speed trade-off in the conducted ablation studies.

Data availibility

All data generated or analyzed in this study are included in this published article. All six datasets used are publicly available datasets. All of them are cited in the paper according to the rules of conducting research. The citations and the datasets can be found in the “Benchmarking datasets” subsection of the “Experiments and Results” section of the paper.

Change history

29 December 2023
The original online version of this Article was revised: ‘Kyungpook National University’ was omitted for Affiliation 2, which has been corrected.

References

Shorten, C. & Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. J. Big Data 6, 1–48 (2019).
Article Google Scholar
Oregi, I., Del Ser, J., Pérez, A. & Lozano, J. A. Robust image classification against adversarial attacks using elastic similarity measures between edge count sequences. Neural Netw. 128, 61–72 (2020).
Article PubMed Google Scholar
Olimov, B., Kim, J. & Paul, A. Ref-net: Robust, efficient, and fast network for semantic segmentation applications using devices with limited computational resources. IEEE Access 9, 15084–15098 (2021).
Article Google Scholar
Olimov, B. et al. Fu-net: fast biomedical image segmentation model based on bottleneck convolution layers. Multimed. Syst. 1–14 (2021).
Olimov, B., Koh, S.-J. & Kim, J. Aedcn-net: Accurate and efficient deep convolutional neural network model for medical image segmentation. IEEE Access 9, 154194–154203 (2021).
Article Google Scholar
Ge, Z., Liu, S., Wang, F., Li, Z. & Sun, J. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021).
Pang, Y., Wang, T., Anwer, R. M., Khan, F. S. & Shao, L. Efficient featurized image pyramid network for single shot detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7336–7344 (2019).
Ahmed, B., Gulliver, T. A. & alZahir, S. Image splicing detection using mask-RCNN. SIViP 14, 1035–1042 (2020).
Article Google Scholar
Zhang, Q., Chang, X. & Bian, S. B. Vehicle-damage-detection segmentation algorithm based on improved mask RCNN. IEEE Access 8, 6997–7004 (2020).
Article Google Scholar
Olimov, B., Kim, J., Paul, A. & Subramanian, B. An efficient deep convolutional neural network for semantic segmentation. In 2020 8th International Conference on Orange Technology (ICOT), 1–9 (IEEE, 2020).
Schmarje, L., Santarossa, M., Schröder, S.-M. & Koch, R. A survey on semi-, self-and unsupervised learning for image classification. IEEE Access (2021).
Wei, B., He, H., Hao, K., Gao, L. & Tang, X.-S. Visual interaction networks: A novel bio-inspired computational model for image classification. Neural Netw. 130, 100–110 (2020).
Article PubMed Google Scholar
ugli Olimov, B. A., Veluvolu, K. C., Paul, A. & Kim, J. Uzadl: Anomaly detection and localization using graph Laplacian matrix-based unsupervised learning method.. Comput. Ind. Eng. 171, 108313 (2022).
Article Google Scholar
Olimov, B., Kim, J. & Paul, A. Dcbt-net: Training deep convolutional neural networks with extremely noisy labels. IEEE Access 8, 220482–220495 (2020).
Article Google Scholar
Olimov, B. & Kim, J. Deepcleannet: Training deep convolutional neural network with extremely noisy labels. J. Korea Multimed. Soc. 23, 1349–1360 (2020).
Google Scholar
Gridach, M. Pydinet: Pyramid dilated network for medical image segmentation. Neural Netw. 140, 274–281 (2021).
Article PubMed Google Scholar
Mansilla, L., Milone, D. H. & Ferrante, E. Learning deformable registration of medical images with anatomical constraints. Neural Netw. 124, 269–279 (2020).
Article PubMed Google Scholar
Trigueros, D. S., Meng, L. & Hartnett, M. Generating photo-realistic training data to improve face recognition accuracy. Neural Netw. 134, 86–94 (2021).
Article Google Scholar
Wang, Q. & Chen, K. Multi-label zero-shot human action recognition via joint latent ranking embedding. Neural Netw. 122, 1–23 (2020).
Article PubMed Google Scholar
Ali, A., Zhu, Y. & Zakarya, M. Exploiting dynamic spatio-temporal graph convolutional neural networks for citywide traffic flows prediction. Neural Netw. (2021).
Arcos-García, Á., Alvarez-Garcia, J. A. & Soria-Morillo, L. M. Deep neural network for traffic sign recognition systems: An analysis of spatial transformers and stochastic optimisation methods. Neural Netw. 99, 158–165 (2018).
Article PubMed Google Scholar
Walton, A. Assessing the performance of different classification methods to detect inland surface water extent. B.S. thesis, University of Stuttgart (2015).
de Oliveira Duarte, D. C., Zanetti, J., Junior, J. G. & das Graças Medeiros, N. Comparison of supervised classification methods of maximum likelihood, minimum distance, parallelepiped and neural network in images of unmanned air vehicle (uav) in viçosa-mg. Revista Brasileira de Cartografia70, 437–452 (2018).
Thanh Noi, P. & Kappas, M. Comparison of random forest, k-nearest neighbor, and support vector machine classifiers for land cover classification using sentinel-2 imagery. Sensors 18, 18 (2017).
Article ADS PubMed PubMed Central Google Scholar
Boateng, E. Y., Otoo, J. & Abaye, D. A. Basic tenets of classification algorithms k-nearest-neighbor, support vector machine, random forest and neural network: a review. J. Data Anal. Inf. Process. 8, 341–357 (2020).
Google Scholar
Liu, Y., Gao, X., Gao, Q., Han, J. & Shao, L. Label-activating framework for zero-shot learning. Neural Netw. 121, 1–9 (2020).
Article PubMed Google Scholar
Ji, Z. et al. A semi-supervised zero-shot image classification method based on soft-target. Neural Netw. (2021).
Fang, X. et al. Dart: Domain-adversarial residual-transfer networks for unsupervised cross-domain image classification. Neural Netw. 127, 182–192 (2020).
Article PubMed Google Scholar
Yang, D., Karimi, H. R. & Sun, K. Residual wide-kernel deep convolutional auto-encoder for intelligent rotating machinery fault diagnosis with limited samples. Neural Netw. 141, 133–144 (2021).
Article PubMed Google Scholar
Olimov, B., Subramanian, B. & Kim, J. Unsupervised deep learning-based end-to-end network for anomaly detection and localization. In 2022 Thirteenth International Conference on Ubiquitous and Future Networks (ICUFN), 444–449 (IEEE, 2022).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (2016).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4700–4708 (2017).
Xie, S., Girshick, R., Dollár, P., Tu, Z. & He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1492–1500 (2017).
Gao, S. et al. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. (2019).
Chen, T., Wang, N., Wang, R., Zhao, H. & Zhang, G. One-stage CNN detector-based benthonic organisms detection with limited training dataset. Neural Netw. 144, 247–259 (2021).
Article PubMed Google Scholar
Szegedy, C., Ioffe, S., Vanhoucke, V. & Alemi, A. A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI Conference on Artificial Intelligence (2017).
Ma, N., Zhang, X., Zheng, H.-T. & Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), 116–131 (2018).
Tan, M. et al. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2820–2828 (2019).
Qian, S., Ning, C. & Hu, Y. Mobilenetv3 for image classification. In 2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), 490–497 (IEEE, 2021).
Guo, N., Gu, K., Qiao, J. & Bi, J. Improved deep CNNs based on nonlinear hybrid attention module for image classification. Neural Netw. 140, 158–166 (2021).
Article PubMed Google Scholar
Ma, Y., Niu, B. & Qi, Y. Survey of image classification algorithms based on deep learning. In 2nd International Conference on Computer Vision, Image, and Deep Learning, vol. 11911, 422–427 (SPIE, 2021).
He, Z. Deep learning in image classification: A survey report. In 2020 2nd International Conference on Information Technology and Computer Application (ITCA), 174–177 (IEEE, 2020).
Zhang, X., Zhou, X., Lin, M. & Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6848–6856 (2018).
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4510–4520 (2018).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 234–241 (Springer, 2015).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022 (2021).
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J. & Maier-Hein, K. H. NNU-net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18, 203–211 (2021).
Article CAS PubMed Google Scholar
Su, H. et al. Region segmentation in histopathological breast cancer images using deep convolutional neural network. In 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI), 55–58 (IEEE, 2015).
Gholamalinezhad, H. & Khosravi, H. Pooling methods in deep neural networks, a review. arXiv preprint arXiv:2009.07485 (2020).
Olimov, B. et al. Weight initialization based-rectified linear unit activation function to improve the performance of a convolutional neural network model. Concurr. Comput. Pract. Exp. 33, e6143 (2021).
Article Google Scholar
Krizhevsky, A., Hinton, G. et al. Learning multiple layers of features from tiny images. IEEE (2009).
Coates, A., Ng, A. & Lee, H. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 215–223 (JMLR Workshop and Conference Proceedings, 2011).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012).
Google Scholar
He, X., Yang, X., Zhang, S., Zhao, J., Zhang, Y., Xing, E. & Xie, P. Sample-efficient deep learning for Covid-19 diagnosis based on CT scans. medrxiv (2020).
Spanhol, F. A., Oliveira, L. S., Petitjean, C. & Heutte, L. A dataset for breast cancer histopathological image classification. IEEE Trans. Biomed. Eng. 63, 1455–1462 (2015).
Article PubMed Google Scholar
Kang, J., Ullah, Z. & Gwak, J. MRI-based brain tumor classification using ensemble of deep features and machine learning classifiers. Sensors 21, 2222 (2021).
Article ADS PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

AI Department, IT Convergence R &D Center, Vitasoft, Seoul, South Korea
Bekhzod Olimov
School of Computer Science and Engineering, Kyungpook National University, Daegu, 41586, South Korea
Barathi Subramanian, Rakhmonov Akhrorjon Akhmadjon Ugli, Jea-Soo Kim & Jeonghong Kim

Authors

Bekhzod Olimov
View author publications
You can also search for this author in PubMed Google Scholar
Barathi Subramanian
View author publications
You can also search for this author in PubMed Google Scholar
Rakhmonov Akhrorjon Akhmadjon Ugli
View author publications
You can also search for this author in PubMed Google Scholar
Jea-Soo Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jeonghong Kim
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.O wrote the main manuscript text. S.B, A.R, and J-S.K prepared figures. J.K supervised the work. All authors reviewed the manuscript.

Corresponding author

Correspondence to Jeonghong Kim.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Olimov, B., Subramanian, B., Ugli, R.A.A. et al. Consecutive multiscale feature learning-based image classification model. Sci Rep 13, 3595 (2023). https://doi.org/10.1038/s41598-023-30480-8

Download citation

Received: 26 October 2022
Accepted: 23 February 2023
Published: 03 March 2023
DOI: https://doi.org/10.1038/s41598-023-30480-8

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Machine learning reveals the control mechanics of an insect wing hinge

Memorability shapes perceived time (and vice versa)

Segment anything in medical images

Introduction

Related works

Computationally expensive and powerful DL-based models

Efficient and lightweight DL-based models

Proposed methodology

Data pre-processing

Data learning

Network architecture

CMSFL module

Increasing the RF

Loss function

Inference

Experiments and results

Benchmarking datasets

Training details

Experimental setup

Evaluation metrics

Baseline models

Experimental results with regards to computational efficiency

Experimental results on small-scale and large-scale image datasets

Experimental results on datasets with limited number of data

Discussion

Generalizability of the considered models on the small and large-scale image datasets

Generalizability of the considered models on the datasets with a limited number of data

Ablation studies of the CMSFL-Net

Data availibility

Change history

29 December 2023

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links