Data Science Group COVID-19 (Comunitat Valenciana)

Since April 2020, I have been involved in the Data Science Group COVID-19 of the València Region under the supervision of Nuria Oliver. This is a multidisciplinary team of volunteers that work side by side with the General Director of Analysis and Public Policies of the Presidency of the RValencia Region Government. The analysis for COVID-19 is coordinated with the Ministry of Health and the rest of the Councils involved. This working group is led by Nuria Oliver, Commissioner of the Presidency of the Generalitat for the Valencian Strategy for Artificial Intelligence and, especially, for the coordination of data intelligence before the COVID-19 epidemic in the Valencia Region.

They are part of the group of experts from the Jaume I University, the University of Valencia, the Polytechnic University of Valencia, the Miguel Hernández University, the University of Alacant, the CEU Cardenal Herrera University, Fisabio, and Microsoft, with the collaboration of Esri, the INE, the Secretary of State for Artificial Intelligence and the three most important mobile phone companies in the country.

This group is divided into three priority areas with their respective work coordinators: (1) analysis, visualization, and modeling of mobility data, (2) epidemiological models and (3) data science applied to COVID-19. There, I work in epidemiological models with Antonio Falcó, Miguel Rebollo, Miguel A. Lozano, Emilio Sansano, Xavier Barber, and Francisco Escolano.

This is the web page of our data research group http://infocoronavirus.gva.es/es/grup-de-ciencies-de-dades-del-covid-19-de-la-comunitat-valenciana

Paper publihed in JAMIA: Potential limitations in COVID-19 machine learning due to data source variability: a case study in the nCov2019 dataset

The lack of representative COVID-19 data is a bottleneck for reliable and generalizable machine learning. Data sharing is insufficient without data quality, where source variability plays an important role. We showcase and discuss potential biases from data source variability for COVID-19 machine learning. In this work, we used the publicly available nCov2019 dataset, including patient level data from several countries. We aimed to the discovery and classification of severity subgroups using symptoms and comorbidities.

In our work published in JAMIA, we have shown that cases from the two countries with the highest prevalence were divided into separate subgroups with distinct severity manifestations. This variability can reduce the representativeness of training data with respect the model target populations and increase model complexity at risk of overfitting. We conclude that data source variability is a potential contributor to bias in distributed research networks. We call for systematic assessment and reporting of data source variability and data quality in COVID-19 data sharing, as key information for reliable and generalizable machine learning.
Our analysis tool developed within BDSLab at UPV can be found at http://covid19sdetool.upv.es/?tab=ncov2019