INTRODUCTION
Osteoporosis is highly prevalent among the elderly and women. It leads to the deterioration of bone mineral density (BMD) and micro-architecture, which significantly impacts the likelihood of fractures10). The vertebral column is the most common site for osteoporotic fractures. Additionally, osteoporosis frequently results in complications such as screw loosening and non-union during spinal surgeries. Diagnosis typically involves dual energy X-ray absorptiometry (DXA), with osteoporosis confirmed by a T-score of -2.5 or lower. Patients with spinal conditions can readily and affordably undergo lumbar X-ray examinations, which are commonly performed.
The field of artificial intelligence (AI) is evolving rapidly, with machine learning and deep learning algorithms demonstrating remarkable accuracy and effectiveness. Presently, deep learning algorithms are being developed to match human levels of error. In the medical field, machine learning is increasingly applied for image-based diagnostics and other diagnostic processes. AI-based diagnosis of osteoporosis was made using X-ray, DXA, computed tomography (CT), and magnetic resonance images. it has certain limitations, such as the high cost and insurance coverage of DXA. AI-based diagnosis of osteoporosis by analysis of radiographs can be a cost-effective alternative to DXA. There are also studies being conducted to predict the occurrence of bone fractures in osteoporosis patients. In these studies, various deep learning architectures are being used to improve the accuracy of diagnosis4).
This study aims to explore whether it is possible to predict and diagnose osteoporosis using transfer learning with deep learning algorithms, employing X-rays commonly taken from patients with spinal conditions. Previous research utilizing X-rays has been limited in scope, and there have also been studies using CT scans. We selected and evaluated four renowned deep-learning algorithms for this purpose.
MATERIALS AND METHODS
1. Patients and Dataset
The study protocol was approved by the Institutional Review Board (IRB) of Inje University Haeundae Paik Hospital (IRB no. 2024-07-008). The requirement for informed consent was waived due to the retrospective nature of this study. We retrospectively evaluated 2,300 consecutive patients who underwent DXA and lumbar sagittal plain X-rays between 2013 and 2021. Patients with an interval of less than one year between DXA and X-ray examinations were selected. Osteoporosis was diagnosed with a T-score ≤ -2.5 according to the World Health Organization criteria, and osteopenia was classified as normal6). Vertebrae L1 to L3 from lumbar spine sagittal X-rays were selected. The reason for choosing L1 to L3 vertebral bodies is firstly because bone density increases to L5, and secondly, the L4/5 image is not good due to the overlap of the pelvic bone17). The exclusion criteria included: (1) an interval of more than one year between DXA and X-ray examinations; (2) vertebrae that had undergone vertebroplasty; (3) absence of spine anterior-posterior DXA; and (4) images that could not be evaluated. In total, 254 patients (images) were enrolled in this study. The dataset was divided into a training group (213 images), a validation group (18 images), and a test group (23 images). In the training group, 101 images were classified as osteoporosis and 112 as normal. In the validation group, 10 images were classified as osteoporosis and 8 as normal. In the test group, 13 images were classified as osteoporosis and 10 as normal (Fig. 1).
2. Comparing Using Deep Learning Models
In this study, we compared image classification performance using four convolutional neural network (CNN) models. Initially, we utilized two models from the visual geometry group (VGG) series of deep CNNs, specifically VGG19 and VGG1613). Next, we employed the Residual Neural Network (ResNet50), a model designed by He et al.2), which won the classification task at the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 Challenge. Generally, increasing the depth of network layers improves image identification accuracy; however, excessively deep networks can reduce accuracy. To address this issue, we adopted a learning approach known as residual learning, which allows networks to be extended up to 152 layers. This specific architecture of ResNet includes versions with 18, 50, and 152 layers. Third, we used Xception (Google Net), a CNN comprising 71 layers1). Each model demonstrated excellent results and low recognition error rates in the ILSVRC.
3. Data Processing
Image data were obtained from lumbar spine X-ray images formatted in Digital Imaging and Communications in Medicine. We selected sagittal X-ray images that included lumbar vertebrae 1, 2, and 3, excluding lumbar 4 and 5. To account for variations in X-ray scan parameters, we performed a series of grayscale normalizations, which included adjustments to window width, window level, and window pixel normalization. The ImageDataGenerator function is a useful tool that generates transformed images from the provided data and incorporates them into the learning set. This function proves especially effective for augmenting the dataset with additional image data.
4. Statistical Analysis
Descriptive statistics for categorical variables were presented as numbers (percentages), and continuous variables were reported as means with standard deviations. A two-sample t-test was performed for continuous variables that satisfied covariance. Statistical significance was set at a p-value of less than 0.05. To evaluate the diagnostic performance of the models for osteoporosis, we utilized the receiver operating characteristic curve. This curve is generated by plotting the true positive rate (sensitivity) against the false positive rate (1-sensitivity). By adjusting the predicted probability threshold of the model, we calculated the area under the curve (AUC) values. A confusion matrix is a table used to assess the performance of a classification algorithm. Precision represents the ratio of correctly identified positive samples to the total samples labeled as positive by the model. Recall is the ratio of correctly identified positive samples to the actual positive samples. The F1-score is defined as the harmonic mean to simultaneously consider precision and recall, offering a balance between the two metrics in evaluating the performance of the classification model.
RESULTS
1. Patient Characteristics and Radiological Parameters
The average age of the 124 osteoporosis patients was 80.4 ± 9.6 years, while the 130 non-osteoporosis patients averaged 74.1 ± 11.0 years; the study included 189 women and 65 men. (p<0.01) The body mass index for the osteoporosis group was 23.0 ± 3.1, compared to 24.8 ± 3.9 for the non-osteoporosis group. (p<0.01) The BMD score for the osteoporosis group was -3.31 ± 0.63, and for the non-osteoporosis group, it was 0.32 ± 1.12. (p<0.025) (Table 1). X-ray sagittal images (total of 254 images) focused on the L1-3 vertebral column (Fig. 2).
2. Model Training and Evaluate
Fig. 3 to 5, and Table 2 and 3 display the performance and diagnostic predictability of various CNN models in classifying osteoporosis and non-osteoporosis based on lumbar spine sagittal X-ray images. The most accurate CNN model was ResNet50 in the training data set accuracy of 0.95, validation data set accuracy 0.67, test data set accuracy 0.82 (Fig. 3, Table 2, 3). Table 3 details the CNN model evaluation on test data set images. ResNet50 shows the best performance with an accuracy of 0.82, precision of 0.80, recall of 0.86, and F1-score of 0.83 (Table 3). Fig. 4 presents the AUC results, indicating superior performance by ResNet50, with a value of 0.76, compared to other CNN models. A confusion matrix is used to define the performance of a classification algorithm. ResNet50 performance and the confusion matrix display the outcomes for images predicted as osteoporosis (n=12) among the test data osteoporosis images (n=14) (Fig. 5).
DISCUSSION
Osteoporosis has become a global public health concern as populations age and life expectancy increases. It is estimated that over 200 million people have been diagnosed with this condition14). Treatments such as teriparatide, romosozumab, and denosumab have been developed and are currently in use3,9). DXA is considered the gold standard for diagnosing osteoporosis, utilizing spectral imaging to measure differences in energy levels from two X-ray beams11). Unlike other bones, the lumbar vertebral column consists of approximately 66% to 75% cancellous bone. In osteoporosis, the vertebrae exhibit decreased BMD, leading to reduced thickness and number of trabeculae in the cancellous bone15).
Additionally, previous studies have indicated that cortical thickness or trabecular patterns can predict BMD12). While there have been several studies predicting BMD diagnosis using X-ray images and deep learning techniques5,7,8,16,18), fewer studies have utilized lumbar images, despite being the easiest and most commonly available clinical method. Research has also been conducted using dental and femur neck images18). Moreover, it is less common for clinicians to engage in clinical research using deep learning and transfer learning models, compared to expert-level computer engineering7).
In this study, we selected four deep learning methods (VGG 16, VGG 19, ResNet50, Xception) that have demonstrated good results in the ILSVRC and trained them using lumbar sagittal images. VGG 16, 19 architecture is structured starting with five blocks of convolutional layers followed by three fully-connected layers. VGG16 has approximately 138 million parameters and VGG19 has approximately 143 million parameters. Most of these parameters (approximately 100 million) are in the first fully connected layer, and it was since found that these fully connected layers could be removed with no performance downgrade, significantly reducing the number of necessary parameters13). Xception architecture is an extension of the Inception architecture which replaces the standard Inception modules with depthwise separable convolutions. Xception slightly outperforms InceptionV3 on the ImageNet dataset, and vastly outperforms it on a larger image classification dataset with 17,000 classes. Xception has 22,855,952 trainable parameters1). ResNet50 architecture are deep convolutional networks where the basic idea is to skip blocks of convolutional layers by using shortcut connections to form blocks named residual blocks. These stacked residual blocks greatly improve training efficiency and largely resolve the degradation problem present in deep networks. The total number of weighted layers is 50, with 23,534,592 trainable parameters2). In addition, unlike other studies, this study applied the transfer learning technique to VGG 16, VGG 19, ResNet50, and Xception. ResNet50 emerged as the top performer in this study, achieving an AUC value of 0.76 (Fig. 4). In other research, Jang et al.5) first utilized dental images (n=800) as a training dataset, model of that study showed an AUC value of 0.7 in the test data set (n=117) with a self-developed deep neural network based on VGG 16. Zhang et al.18) used anterior and sagittal X-ray images of the lumbar spine (n=1,616) as a training dataset, with their custom-developed deep CNN yielding an test dataset (n=198) AUC value of 0.767. Lee et al.7) used dental panoramic images (n=680) for training, applying transfer learning with fine-tuning based on VGG-16, which test dataset (n=137) resulted in an AUC value of 0.858. Sukegawa et al.16) employed dental panorama images (n=778) as a training dataset and used a combination of Efficient Net and ResNet models (18, 50, 152), enhancing accuracy with an ensemble model technique, test dataset (n=156) resulting in an AUC value of 0.911 (Table 4). These varied study outcomes indicate that the use of ensemble models tends to yield the best performance. Beyond simple deep learning techniques, modifications tailored to the image type and research objectives, such as transfer learning and ensemble methods, significantly impact results. This study aimed to assess the accuracy of deep learning techniques using straightforward images from a clinician’s perspective. If the number of data images is small, as in this study, various deep learning data processing methods can be used to solve this problem. Moreover, transfer learning and fine-tuning techniques within the CNN layers can yield effective results even with limited data7). It is believed that these methods will significantly benefit clinicians in the specialized field of medicine in the future.
AI technology, particularly deep learning, is increasingly playing a supportive role in diagnostics within the medical field. This technology represents a significant advancement. Until recently, clinicians faced challenges accessing complex computer coding, programming, and mathematical theories. However, with the emergence of AI technologies like ChatGPT, greater support is now available, facilitating collaboration across various fields and enhancing deep learning technology. The transfer learning method, in particular, makes deep learning technology more accessible to clinicians. Through fine-tuning, clinicians can achieve slightly improved results tailored to their specific expertise.
The first limitation of this study is the small size of the image dataset. While other studies have utilized datasets ranging from hundreds to thousands of images, this study was restricted to test results within one year to enhance the correlation between BMD and X-ray outcomes. The second limitation is related to the ensemble model, which could have shown improved results with more detailed fine-tuning techniques; however, this was not implemented due to technical constraints and shortcomings during the planning phase of data collection. The third limitation is the lack of a more detailed BMD scoring segmentation. The BMD scores between the two groups show a difference of about -3.0, indicating a need for more precise techniques to categorize into three groups: osteoporosis, osteopenia, and normal. This aspect was overlooked in the study design.
CONCLUSION
In this study, ResNet 50 showed the best performance in both the training and test sets. In the role of diagnostic assistance, AI technology employing deep learning techniques is significantly nearing human capabilities. The diagnosis of osteoporosis using BMD is expected to evolve into a comprehensive diagnostic aid or decision-making tool with the integration of AI in the future.