Continuous and bimonthly publication
ISSN (on-line): 1806-3756

Licença Creative Commons
76
Views
Back to summary
Open Access Peer-Reviewed
Artigo Original

Machine learning algorithms applied to the diagnosis of COVID-19 based on epidemiological, clinical, and laboratory data

Silvia Elaine Cardozo Macedo1, Marina de Borba Oliveira Freire1, Oscar Schmitt Kremer2, Ricardo Bica Noal1, Fabiano Sandrini Moraes2, Mauro André Barbosa Cunha2

ABSTRACT

Objective: To predict COVID-19 in hospitalized patients with SARS in a city in southern Brazil by using machine learning algorithms. Methods: The study sample consisted of patients = 18 years of age admitted to the emergency department with SARS and hospitalized in the Hospital Escola - Universidade Federal de Pelotas between March and December of 2020. Epidemiological, clinical, and laboratory data were processed by machine learning algorithms in order to identify patterns. Mean AUC values were calculated for each combination of model and oversampling/undersampling techniques during cross-validation. Results: Of a total of 100 hospitalized patients with SARS, 78 had information for RT-PCR testing for SARS-CoV-2 infection and were therefore included in the analysis. Most (58%) of the patients were female, and the mean age was 61.4 ± 15.8 years. Regarding the machine learning models, the random forest model had a slightly higher median performance when compared with the other models tested and was therefore adopted. The most important features to diagnose COVID-19 were leukocyte count, PaCO2, troponin levels, duration of symptoms in days, platelet count, multimorbidity, presence of band forms, urea levels, age, and D-dimer levels, with an AUC of 87%. Conclusions: Artificial intelligence techniques represent an efficient strategy to identify patients with high clinical suspicion, particularly in situations in which health care systems face intense strain, such as in the COVID-19 pandemic.

Keywords: COVID-19/diagnosis; Artificial intelligence; Machine learning.

INTRODUCTION
 
COVID-19 has been the most important health problem in the world since 2020. Following its emergence in December of 2019, in Wuhan, China, the disease spread quickly across the world and, in February of 2020, the WHO declared it a pandemic because of its global impact.(1)
 
There are currently around 750 million confirmed cases of COVID-19 and 7 million COVID-19–related deaths worldwide. In Brazil, the number of COVID-19 cases and deaths were extremely high during the pandemic, and the disease had a harmful impact on the public health system.(2)
 
Viral infections such as SARS-CoV-2 infection are dangerous because they spread very quickly; therefore, early detection and diagnosis have a positive impact on health strategies.(3) Early in the COVID-19 pandemic, there were few diagnostic tests available in many countries, including Brazil; therefore, there was a need to select clinical and laboratory variables that could predict COVID-19 in order to proceed to nasal swab collection for RT-PCR to detect SARS-CoV-2 infection.(4) Although COVID-19 mortality has declined, the existence of other circulating viruses makes it necessary to establish the correct diagnosis and reduce the risk of transmission.
 
Artificial intelligence (AI) has been deployed at various levels of the health care system, including diagnosis,(5-7) public health, clinical decision making, and therapeutics. Particularly, AI algorithms have been shown to be effective in improving the diagnosis and prognosis of COVID-19 through the creation of models including clinical and epidemiological characteristics, as well as biochemical data.(8-10) The present study evaluated clinical and laboratory data to predict COVID-19 in hospitalized patients with SARS in a city in southern Brazil by using machine learning algorithms.
 
METHODS
 
The present study was conducted in the city of Pelotas, Brazil, which is the fourth most populated city in the state of Rio Grande do Sul, with a population of 325,685 inhabitants.(11) The city of Pelotas is the largest of the 22 municipalities in the Third Regional Health District. Therefore, patients from some of the other municipalities are referred to health care facilities in Pelotas, and this was especially true during the COVID-19 pandemic.
 
During the data collection period, the emergency department became the point of entry into the public health care system for patients from the city of Pelotas (and other municipalities) presenting with SARS. After undergoing an initial evaluation and RT-PCR for COVID-19, patients meeting the criteria for hospital admission were transferred to a public hospital able to receive them. In this context, the Hospital Escola-Universidade Federal de Pelotas became the most important center for receiving and treating patients with SARS during the COVID-19 pandemic.
 
The study sample consisted of patients ≥ 18 years of age admitted to the emergency department with SARS and hospitalized in the Hospital Escola - Universidade Federal de Pelotas between March and December of 2020. Because the data were collected retrospectively, the requirement for written informed consent was waived. The study project was approved by the Brazilian National Research Ethics Committee (Protocol no. 37337720.2.0000.5317).
 
We collected data on patient characteristics, including demographics (sex and age); comorbidities (e.g., obesity, diabetes, hypertension, cancer, and chronic respiratory disease); symptoms (e.g., cough, shortness of breath, chest pain, sore throat, and headache); vital signs (HR, RR, systolic blood pressure, diastolic blood pressure, and axillary temperature); and laboratory test results (e.g., hemoglobin level, leukocyte count, platelet count, and creatinine level). Table 1 shows the clinical and epidemiological characteristics of the patients suspected of having COVID-19. The missing values in the dataset were imputed by using the mean value of the features. The 100 rows were randomly divided into 70% for training and 30% for test. The continuous variables were normalized on the basis of the mean and standard deviation of the training sample. For the discrete variable, all values were in the interval between 0 and 1; therefore, no normalization was applied.
 
Several machine learning algorithms were examined in the present study, including support vector machines, gradient boosting, multilayer perceptron (MLP), adaptive boosting, and decision trees. All these algorithms are known as supervised learning algorithms. In supervised learning, the model observes input-output pairs and the learning algorithm finds the optimal configuration of parameters resulting in a function that maps from input to output while minimizing a certain loss function.(12,13)
 
The choice to use several algorithms allows one to cover different methods, from more classic and simple statistical methods such as decision trees to ensemble learning, convex optimization, and gradient-based methods such as MLPs. Each approach has advantages and peculiarities that could or could not be suitable for the problem tackled in the present study. Therefore, a cross-validation step allowed us to verify which methods optimized a certain metric.
 
In addition to the classification methods, because the collected dataset was imbalanced, techniques for oversampling were used in order to improve the overall performance of the system. Class imbalance poses serious problems for machine learning techniques. Some of the most conventional approaches to these problems are undersampling and oversampling. Oversampling consists in creating artificial data on the basis of the statistical behavior of the elements of the minority class, whereas undersampling consists in sampling the majority class in such a way that the dataset becomes balanced.(14)
 
Synthetic Minority Oversampling Technique (SMOTE) is one of the most notable oversampling methods available.(15) SMOTE works by taking each sample in the minority class and creating synthetic samples in the lines that connect the sample with each k-nearest neighbors. Other oversampling methods include a variation of SMOTE, known as Borderline-SMOTE, and an adaptive synthetic sampling approach for imbalanced learning,(15-17) both of which were tested in the present study.
 
Each combination of model and sampling technique was optimized by using a grid search approach. In grid search, a list of possible values for each hyperparameter is created, and all combinations of between-values are examined. The cross-validation method used was k-fold, with k = 5. The programming language used was Python 3.6, and the following libraries were used: NumPy 1.19.5; imbalanced-learn 0.4.3; pandas 0.22.0; scikit-learn 0.21.0; SciPy 1.4.1; and statsmodels 0.9.0.
 
After finding the best model, we were able to analyze other metrics, such as sensitivity, specificity, precision, and confusion matrix. We analyzed the relevance of the variables present in the data for the classification of the models. This allowed us to determine the importance of the variables used and the level of agreement between the model and previously established knowledge.
 
RESULTS
 
A total of 78 patients had information for RT-PCR testing for SARS-CoV-2 infection and were therefore included in the analysis. The study sample did not differ from the original sample (n = 100) with regard to the baseline characteristics (Table 1). The median value and interquartile range for each variable are shown in Table 1. Of the sample as a whole (N = 78), 42% were male, and the mean age was 61.4 ± 15.8 years. Nearly 60% of the study participants had two or more comorbidities, with hypertension and diabetes being the most prevalent (in 58.5% and 44.3%, respectively). One quarter of the study participants were current smokers. The median time elapsed since the onset of symptoms was 9 days (IQR: 3-14 days), the most common symptoms being shortness of breath (in 66%), cough (in 59%), fever (in 44%), and muscle or joint pain (in 43%).


 
For a comprehensive analysis, each combination of classification model and oversampling/undersampling method was trained and cross-validated 30 times. This approach allowed the construction of a performance distribution for each pair. Given the imbalanced nature of the dataset, the performance was evaluated by means of the AUC metric. Models such as MLP, the random forest method, and gradient boosting achieved similar performance levels. Of those, the random forest model without any oversampling method showed the highest median performance, although it was only slightly higher than the median performance of the other models. Given that this combination not only yielded the best performance but also entailed lower computational costs than did the other two methods, it was selected for further analysis.
 
The step of feature selection is also significantly important in the implementation of machine learning models. Additionally, when considering the practical aspects of implementing a process that will be directly dependent on data collection, it is useful to find the best trade-off between performance and number of features. First, to select which features can be useful for classification, one must have a measure of their importance in the overall performance of the algorithm. Different approaches can be used in order to extract feature importance in machine learning models, including permutation importance, SHAP, and mean decrease in impurity (MDI), the last being particularly suitable for models such as the random forest. MDI, also known as the Gini importance, explores the structure of the random forest to evaluate feature importance. Given that a random forest is an ensemble learning algorithm based on decision trees, MDI counts the times a feature is used to split a node in a tree, weighted by the number of samples it splits. This allows us to identify how relevant a certain feature is for generating a prediction. By evaluating the MDI for the trained model, we identified the ten most important features (Figure 1).


 
The process of analyzing feature relevance helps reduce computational cost, and, by reducing the number of features, it is possible to decrease the probability of introducing undesired bias due to the size of the training dataset. However, it is still important to evaluate the performance of the model with different numbers of features. As can be seen in Figure 2A, ROC curves were plotted for three different scenarios: all features; the five most relevant features; and the ten most relevant features. The best performance in terms of AUC was achieved by the model containing the ten most relevant features. As can be seen in Figure 2B, a confusion matrix of the model containing the ten most relevant features shows the relationship between the output of the model and the RT-PCR results, highlighting each type of correct and incorrect prediction.


 
Table 2 presents key metrics that highlight the performance of the model across different feature sets. Notably, when the ten most relevant features were used, the performance of the model improved not only in terms of AUC (as can be seen in Figure 2A) but also in terms of sensitivity. It is important to emphasize that the output of the model can be interpreted similarly to a probability, which allows the definition of a threshold (a value between 0 and 1) to determine whether a numerical output results in a positive or negative result. This provides flexibility to balance between sensitivity and precision, thus reducing the occurrence of false positives and false negatives. For the results presented herein, a threshold of 0.5 was considered.


 
DISCUSSION
 
Since the WHO declared COVID-19 a pandemic on March 12, 2020, health care systems worldwide faced intense strain. This raised the need for exploring new and emerging technologies to meet the increasing health demand. One important challenge was the scarcity of medical supplies and diagnostic tools, especially in the first year of the pandemic. The limited availability of resources, including COVID-19 diagnostic tests, highlighted the need for developing tools to identify patients with high clinical suspicion of COVID-19. In this context, AI techniques represent an efficient strategy for detection, severity assessment, and therapeutic approach.
 
In the last three years, many studies have investigated the role of imaging tests such as X-rays, CT scans, and ultrasound examination in the early diagnosis of COVID-19 through AI techniques.(18) On the other hand, the integration of clinical data into AI algorithms has been less studied and could represent an effective strategy to face the challenges of COVID-19, particularly in scenarios in which imaging tests are not readily available. Previous studies evaluating the use of AI in COVID-19 diagnosis showed accuracy values of approximately 85%.(19) Ahamad et al. reported that the most relevant predictive symptoms were fever (41.1%), cough (30.3%), lung infection (13.1%), and runny nose (8.43%).(20) Similarly, we found an accuracy of 84% when we included ten variables in the model. However, the most relevant predictors in our study were leukocyte count, PaCO2, troponin levels, duration of symptoms in days, platelet count, multimorbidity, presence of band forms, urea levels, age, and D-dimer levels. In this context, Silveira found an association between blood count and COVID-19 diagnosis through a gradient boosting model, with an accuracy of 80.0%, a sensitivity of 75.6%, and a specificity of 82.0%.(21) The variables that had the greatest influence on the predictive decision were basophil count, eosinophil count, and leukocyte count. (21) It is important to highlight that our objective was to predict the probability of a COVID-19 diagnosis in patients hospitalized with SARS.
 
One of the main limitations of the present study is the relatively small number of samples, especially in comparison with most machine learning applications. (18 20) However, despite this limitation, the achieved performance demonstrates the value of the method as a useful tool for the health care system. To enhance the performance of the model, it is crucial to expand data collection to different municipalities. This would not only increase the size of the dataset but also improve the generalization capability of the model. Additionally, continuous retraining of the model would enable it to adapt to the evolving effects of the virus on the population. Such efforts would not only increase the impact of the model but also provide a deeper understanding of the long-term effects and behavior of the COVID-19 pandemic.
 
Although AI-based tools do not replace medical evaluation, their contribution is unequivocal in improving the management of several issues and health problems. Particularly in pandemic situations, AI-based tools can help to make rapid decisions related to the diagnosis and prevention of disease spreading. Thus, given that in the future the health care system might be faced with other pandemics, there is a need for continued improvement of AI technologies. Future studies should focus on strengthening current technologies to detect, monitor, and diagnose emerging and potentially life-threatening medical conditions.
 
ACKNOWLEDGMENTS
 
We are grateful to the staff and department heads at the emergency department and COVID-19 ward at the Hospital Escola - Universidade Federal de Pelotas, in the city of Pelotas, Brazil.
 
AUTHOR CONTRIBUTIONS
 
SECM, MBOF, OSK and MABC participated in the design of the study; in the analysis and interpretation of data; and in the writing and critical review of the article. RBN and FSM participated in the conception and design of the study; and in the acquisition, analysis, and interpretation of data. All authors approved the final version to be published.
 
CONFLICTS OF INTEREST
 
None declared.
 
REFERENCES
 
1.           Pan American Health Organization (PAHO) [homepage on the Internet]. Washington, DC: PAHO; [cited 2024 May 27]. History of the COVID-19 pandemic. Available from: https://www.paho.org/pt/covid19/historico-da-pandemia-covid-19
2.           World Health Organization (WHO) [homepage on the Internet]. Geneva: WHO; [cited 2024 May 27]. Coronavirus disease (COVID-19) Epidemiological Updates and Monthly Operational Updates. Available from: https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports
3.           Sharma A, Ahmad Farouk I, Lal SK. COVID-19: A Review on the Novel Coronavirus Disease Evolution, Transmission, Detection, Control and Prevention. Viruses. 2021;13(2):202. https://doi.org/10.3390/v13020202
4.           Brasil. Ministério da Saúde. Secretaria de Ciência, Tecnologia, Inovação e Insumos Estratégicos em Saúde. Diretrizes para Diagnóstico e Tratamento da COVID-19. Versão 4. Brasília: o Ministério; 2020.
5.           Billah M, Washeed S. Gastrointestinal polyp detection in endoscopic images using an improved feature extraction method. Biomed Eng Lett. 2017;8(1):69-75. https://doi.org/10.1007/s13534-017-0048-x
6.           Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA. 2016;316(22):2402-2410. https://doi.org/10.1001/jama.2016.17216
7.           Lahmiri S, Dawson DA, Shmuel A. Performance of machine learning methods in diagnosing Parkinson’s disease based on dysphonia measures. Biomed Eng Lett. 2017;8(1):29-39. https://doi.org/10.1007/s13534-017-0051-2
8.           Batista AFM, Miraglia JL, Donato THR, Chiavegatto Filho ADP. COVID-19 diagnosis prediction in emergency care patients: a machine learning approach. medRxiv 2020.04.04.20052092 https://doi.org/10.1101/2020.04.04.20052092
9.           Rodríguez-Rodríguez I, Rodríguez JV, Shirvanizadeh N, Ortiz A, Pardo-Quiles DJ. Applications of Artificial Intelligence, Machine Learning, Big Data and the Internet of Things to the COVID-19 Pandemic: A Scientometric Review Using Text Mining. Int. J. Environ. Res. Public Health 2021, 18(16),8578. https://doi.org/10.3390/ijerph18168578
10.        Mei X, Lee HC, Diao KY, Huang M, Lin B, Liu C,  et al. Artificial intelligence-enabled rapid diagnosis of patients with COVID-19. Nat Med. 2020;26(8):1224-1228. https://doi.org/10.1038/s41591-020-0931-3
11.        Brasil. Instituto Brasileiro de Geografia e Estatística (IBGE) [homepage on the Internet]. Rio de Janeiro: IBGE; [cited 2024 Jun 2]. Cidades e Estados—RS Pelotas. Available from: https://www.ibge.gov.br/cidades-e-estados/rs/pelotas.html
12.        Russell S, Peter N. Artificial Intelligence: A Modern Approach. 3rd ed. Pearson; 2016.
13.        Alpaydin, E. Introduction to machine learning. 4th ed. Boston: MIT press; 2020.
14.        Gosain A, Sardana S. Handling Class Imbalance Problem using Oversampling Techniques: A Review. Proceedings of the 2017 international conference on advances in computing, communications and informatics (ICACCI). 2017 Sep 13-16; Udupi, India. Piscataway, NY: IEEE; 2017. p. 79-85. https://doi.org/10.1109/ICACCI.2017.8125820
15.        Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. JAIR. 2002(16):321-357. https://doi.org/10.1613/jair.953
16.        Han H, Wang WY, Mao BH. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Advances in Intelligent Computing. ICIC 2005. Lecture Notes in Computer Science, vol 3644. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11538059_91
17.        He H, Bai Y, Garcia EA, Li S. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence. 2008 Jun 1-8; Hong Kong. Piscataway, NY: IEEE; 2008. p.1322-1328. https://doi.org/10.1109/IJCNN.2008.4633969
18.        Gudigar A, Raghavendra U, Nayak S, Ooi CP, Chan WY, Gangavarapu MR, et al. Role of Artificial Intelligence in COVID-19 Detection. Sensors (Basel). 2021;21(23):8045. https://doi.org/10.3390/s21238045
19.        Comito C, Pizzuti C. Artificial intelligence for forecasting and diagnosing COVID-19 pandemic: A focused review. Artif Intell Med. 2022;128:102286. https://doi.org/10.1016/j.artmed.2022.102286
20.        Ahamad MM, Aktar S, Rashed-Al-Mahfuz M, Uddin S, Liò P, Xu H, et al. A machine learning model to identify early stage symptoms of SARS-Cov-2 infected patients. Expert Syst Appl. 2020;160:113661. https://doi.org/10.1016/j.eswa.2020.113661
21.        Silveira EC. Prediction of COVID-19 from hemogram results and age using machine learning. Front Health Inform. 2020; 9(1):39. https://doi.org/10.30699/fhi.v9i1.234

Indexes

Development by:

© All rights reserved 2025 - Jornal Brasileiro de Pneumologia