NEWS

Scientists call for fully open sharing of coronavirus genome data

Other researchers say that restrictions at the largest SARS-CoV-2 genome platform encourage fast sharing while protecting data providers’ rights.
Data visualisation of the genomes of the 56 fully sequenced isolates of the virus SARS-CoV-2

A visualization of 56 SARS-CoV-2 genomes.Credit: Martin Krzywinski/SPL

Hundreds of scientists are urging that SARS-CoV-2 genome data should be shared more openly to help analyse how viral variants are spreading around the world.

Researchers have posted huge numbers of SARS-CoV-2 genome sequences online since January 2020. The most popular data-sharing platform, called GISAID, now hosts more than 450,000 viral genomes; Soumya Swaminathan, the chief scientist at the World Health Organization (WHO), has called it a “game changer” in the pandemic. But it doesn’t allow sequences to be reshared publicly, which is hampering efforts to understand the coronavirus and the rapid rise of new variants, argues Rolf Apweiler, co-director of the European Bioinformatics Institute (EBI) near Cambridge, UK, which hosts its own large genome database that includes SARS-CoV-2 sequences.

“The openness of SARS-CoV-2 sequence data is crucial for the rapid response against the biggest health threat to humankind in a very, very long time,” says Apweiler.

In a letter released on 29 January, Apweiler and others call for researchers to post their genome data in one of a triad of databases that don’t place any restrictions on data redistribution: the US GenBank, the EBI’s European Nucleotide Archive (ENA) and the DNA Data Bank of Japan, which are collectively known as the International Nucleotide Sequence Database Collaboration (INSDC).

Anyone can anonymously access the INSDC’s data and use them as they want, but GISAID requires that users confirm their identity and agree not to republish the site’s genomes without permission from the data provider. This means that studies building on GISAID data — such as those that create evolutionary trees analysing how SARS-CoV-2 variants are related — can’t publish full data so that others can easily check their analyses or further build on their data set. They must direct readers back to the GISAID site.

The letter says the scientific community should “remove barriers that restrain effective data sharing”, but doesn’t mention GISAID specifically. It is signed by more than 500 scientists, including the 2020 chemistry Nobel laureate Emmanuelle Charpentier, and the head of the COVID-19 Genomics UK Consortium, Sharon Peacock. Where scientists have already established submissions to other databases, the letter states, “these submissions should continue in parallel”.

Feature not flaw

Many researchers who work with GISAID say that its terms of access are a benefit, because they encourage hesitant researchers to share data online speedily, without fear that others will use the results without credit. “The reason so many labs have provided SARS-CoV-2 genomes to GISAID is precisely because of the data-access agreement that restricts public resharing,” says Sebastian Maurer-Stroh, a bioinformatician at Singapore’s Agency for Science, Technology and Research. GISAID has worked with many labs to assist them to share data, he says.

GISAID stands for the Global Initiative on Sharing Avian Influenza Data; an international consortium of researchers helped to set it up as a non-profit foundation in 2008, to address researchers’ reluctance to share data on influenza strains. Some nations, including Indonesia, a hotspot for avian flu, feared that pharmaceutical firms would create drugs and vaccines using the sequence data without crediting the original data providers or sharing the benefits of the work with them. But they were persuaded to share sequences rapidly on GISAID; in March 2013, for instance, China published sequences of H7N9 avian flu in the database on the same day it informed the WHO of three infections in people. “GISAID encourages and incentivizes real-time data sharing by parties who would otherwise be reluctant to share, by ensuring that they retain their rights in their data,” says a spokesperson for the initiative.

“This issue is not only about science, but also about sovereignty and equity,” says Marie-Paule Kieny, a vaccine researcher at INSERM, the French national health-research institute in Paris. “GISAID empowers the rapid flow of SARS-CoV-2 sequence data with maximal impact,” she says, because scientists depositing sequences can trust that their rights will be respected by data users.

Senjuti Saha, a microbiologist who works on SARS-CoV-2 genomes at the Child Health Research Foundation in Dhaka, says that she appreciates the call for open data beyond what GISAID offers, but worries that it might further dissuade researchers in low- and middle-income countries (LMICs) from uploading data until they have analysed them. During the pandemic, she says, some LMICS have started doing more viral sequencing, although labs often lack computational infrastructure. She says that she’s seen LMIC coronavirus data taken out of context by academics in wealthier countries who don’t consult or credit the data providers. “We really want to share our data, but it is heart-breaking and demotivating when we know we worked so hard to generate data, but we don't get the credit for it,” she says.

The letter, says Kieny, “seems to me like an initiative from European and high-income countries not fully informed on the critical need to ensure that low-resource countries accept to share sequences freely, so that the public-health impact of sequencing of pathogens such as SARS-CoV-2 is maximized”.

ENA head Guy Cochrane says the EBI is aware of the global issues around data and benefit sharing, and is actively involved in finding benefit-sharing mechanisms that empower countries in the global south and keep data open. But even well-resourced European countries could do more to share their data openly, he says.

Data challenges

Some researchers told Nature that besides arguments about equity and openness, there is an issue with GISAID’s differential control over how registered users can download its data. Some users must download files in small batches, for instance, but others can get an entire data set in bulk with GISAID approval. The GISAID spokesperson says that’s because the initiative needs to know who is using its data and for what reason, so that nothing is erroneously redistributed.

Cochrane adds that another challenge with GISAID’s platform is that researchers post ‘assemblies’ — or reconstructions — of viral genomes from the chunks of data read off sequencing machines, rather than the raw data. Assembly always involves some interpretation of inevitable errors in the sequencing process, Cochrane says, and this can lead to what look like mutations in a genome that are in fact artefacts of sequencing. Access to the raw data of many genomes helps scientists dig into these issues, and Cochrane says researchers should share their raw and assembled sequencing data, which they can do at the INSDC even if they also post on GISAID. Maurer-Stroh, however, says that GISAID is aware of such issues and already provides quality-control checks to flag potential mistakes in submitted genomes. Cochrane says such processes can only reduce, not eliminate, artefact errors.

An EBI-hosted data portal that brings together fully open COVID-19 data sets submitted to the INSDC currently hosts more than 270,000 raw SARS-CoV-2 sequences and 55,000 assembled genomes — fewer than GISAID. “We have a fog of incomplete knowledge,” says Apweiler. He says that some scientists might think, incorrectly, that submitting data to GISAID means that the results will automatically be shared openly at the INSDC — and he hopes that the call to share data without restriction will boost the INSDC’s data trove.

But telling scientists to resubmit their SARS-CoV-2 data to the INSDC is complex, says David Haussler, who directs a genomics institute working with INSDC and GISAID data at the University of California, Santa Cruz. Bioinformaticians are in crisis mode, rushing to get genome data and analyse them in detail, and want to share as much as they are permitted to publish about key new mutations in sequences, he says. He did not sign the open letter — although he supports restriction-free data sharing — because he hopes instead that GISAID can temporarily drop some of its access terms during the pandemic, perhaps to coordinate with the INSDC.

Kieny, however, says that could lead to some scientists losing trust in GISAID and not filing their sequences with the database so quickly. “There is no obstacle, for those who want to do it, to deposit their sequences into the INSDC,” she says.

Nature 590, 195-196 (2021)

Nature Briefing

An essential round-up of science news, opinion and analysis, delivered to your inbox every weekday.