The lack of representative COVID-19 data is a bottleneck for reliable and generalizable machine learning. Data sharing is insufficient without data quality, where source variability plays an important role. We showcase and discuss potential biases from data source variability for COVID-19 machine learning. In this work, we used the publicly available nCov2019 dataset, including patient level data from several countries. We aimed to the discovery and classification of severity subgroups using symptoms and comorbidities.
In our work published in JAMIA, we have shown that cases from the two countries with the highest prevalence were divided into separate subgroups with distinct severity manifestations. This variability can reduce the representativeness of training data with respect the model target populations and increase model complexity at risk of overfitting. We conclude that data source variability is a potential contributor to bias in distributed research networks. We call for systematic assessment and reporting of data source variability and data quality in COVID-19 data sharing, as key information for reliable and generalizable machine learning.
Our analysis tool developed within BDSLab at UPV can be found at http://covid19sdetool.upv.es/?tab=ncov2019