Researchers in the UK developed and evaluated a scalable, privacy-preserving federated learning solution using affordable microcomputing for COVID-19 screening in hospitals, as published in The Lancet Digital Health.
In the field of medical artificial intelligence (AI) research, the use of patient data presents ethical, legal, and technical challenges. Federated learning provides a privacy-preserving approach, allowing the development of AI models without sharing data across organizations, making it ideal for COVID-19 screening in hospitals.
However, real hospital implementations of federated learning are rare and require technical expertise and data separation from clinical systems. Further research is needed to refine and validate the federated learning approach in different healthcare environments and address implementation challenges for broader acceptance in real clinical settings.
About the Study
The study involved the development and testing of a federated learning solution for COVID-19 screening in UK hospitals. Researchers selected four hospital groups from the National Health Service (NHS) and used Raspberry Pi 4 Model B devices for comprehensive federated learning. This setup allowed each hospital to train, calibrate, and evaluate AI models locally using anonymized patient data, ensuring privacy.
NHS trusts were provided with inclusion and exclusion criteria for data extraction from electronic health records. The anonymization of data was strictly carried out by clinical teams or NHS informaticians. The study utilized pre-pandemic control cohorts and a COVID-19-positive cohort for training, including vital signs, demographic data, and blood test results. Data extracts were loaded onto client devices for joint training, calibration, and evaluation.
Federated training utilized logistic regression and deep neural network classifiers. Features were pre-processed into a common format, and missing data was imputed using local median values. The FedAvg algorithm facilitated cross-hospital training by transmitting model parameters to the central server for aggregation. Local models were calibrated with the aim of a set sensitivity threshold, with evaluation results aggregated by the server.
The federated evaluation involved using potential cohorts from different hospitals. Calibration and imputation strategies varied depending on whether the sites participated in both training and evaluation or only in evaluation. Site-specific model optimization tested the adaptability of the global model, and centralized server-side evaluation verified the accuracy of federated evaluation. The study also examined the influence of individual features on model predictions.
The statistical analysis focused on comparing model performance across different configurations and training methods using measures such as AUROC, sensitivity, and specificity.
The study showed a significant increase in the AUROC of the logistic regression model. For example, OUH experienced an increase from 0.685 to 0.829, and PUH saw an increase from 0.731 to 0.865. Similarly, deep neural network models showed even more significant improvements, with AUROC values increasing from 0.574 to 0.872 at OUH and from 0.622 to 0.876 at PUH.
Three NHS trusts – OUH, UHB, and PUH – participated in this federated training and contributed data from a large patient cohort. The federated evaluation included data from patients admitted during the second wave of the pandemic, with varying COVID-19 prevalence rates and average ages at the participating sites.
External evaluation of the final global models showed both the logistic regression and deep neural network models exhibited high classification performance. Federated calibration achieved impressive sensitivities, with the logistic regression model at 83.4% and the deep neural network model at 89.7%.
The performance of these models remained stable across different evaluation sites. Notably, the deep neural network model showed a more pronounced improvement through federation compared to the logistic regression model, reaching a performance plateau after approximately 75-100 rounds.
Site-specific tuning of the global models resulted in a slight improvement in the deep neural network model at PUH. However, no significant improvement was observed for the logistic regression model, indicating a high degree of generalizability of the global models and minimal shifts in predictor distribution between sites.
Analysis of the global logistic regression model highlighted several important predictors, such as granulocyte counts and albumin concentrations, consistent with previous studies emphasizing their role in the inflammatory response. Additionally, analysis of the deep neural network model using additive Shapley explanations revealed eosinophil count as an extremely influential predictor.