The first results of COVID-19 subgroups by symptoms and comorbidities are provided using the open nCov-2019 dataset . The nCov-2019 dataset comprises a collection of publicly available information on worldwide cases confirmed during the ongoing nCoV-2019 outbreak.
Materials: We analyzed the raw nCov-2019 dataset release at 2020-05-11. We included those cases were at least one symptom and an outcome were available. Then, we fixed duplicates and homogenized values in outcomes, comorbidities and symptoms. We mapped the latter to ICD-10 terms. The final sample included 170 cases.
Methods: We applied a Multiple Correspondence Analysis 3-dimensional embedding of symptoms and outcomes and a hierarchical clustering. The proper number of clusters for both age-independent and age group analyses were selected by supervised inspection of group consistency.
Results: We found clinically meaningful patient subgroups based on symptoms and comorbidities for specific age groups and age-independent analyses. However, the two most prevalent source countries were divided into separate subgroups with different manifestations of severity.
For further details read our publication:
Carlos Sáez, Nekane Romero, J Alberto Conejero, Juan M García-Gómez. Potential limitations in COVID-19 machine learning due to data source variability: a case study in the nCov2019 dataset. Journal of the American Medical Informatics Association. Accepted manuscript. doi: 10.1093/jamia/ocaa258
Code available for replication in our COVID-19 Subgroup Discovery tool GitHub repository .