
Section 4.1
Treat Biological Data as a Strategic Resource
Chapter 04
Section 4.1
4.1A
Congress must authorize the Department of Energy (DOE) to create a Web of Biological Data (WOBD), a single point of entry for researchers to access high-quality data.
4.1B
Congress should authorize the National Institute of Standards and Technology (NIST) to create standards that researchers must meet to ensure that U.S. biological data is ready for use in AI models.
4.1C
Congress should authorize and fund the Department of Interior (DOI) to create a Sequencing Public Lands Initiative to collect new data from U.S. public lands that researchers can use to drive innovation.
4.1D
Congress should authorize the National Science Foundation (NSF) to establish a network of “cloud labs,” giving researchers state-of-the-art tools to make data generation easier.
Biological data lie at the heart of emerging biotechnologies and are defined by the National Institute of Standards and Technology (NIST) as “the information, including associated descriptors, derived from the structure, function, or process of a biological system(s) that is either measured, collected, or aggregated for analysis.”228
Biological data include a wide variety of human data as well as data from animals, plants, fungi, bacteria, and viruses that comprise the rich biological landscape of the United States. These biological data enable scientists to discover, design, and optimize everything from individual components of cells to the behavior of whole groups of organisms to the inputs and outputs of biomanufacturing processes.
Biological data are especially important for unlocking AI’s potential. Just as large language model (LLM) chatbots such as ChatGPT are trained on vast amounts of text from the internet, biological design tools and scientific language models are trained on troves of biological data from research efforts.
If the United States is to cement its global lead in biotechnology, it must do more to develop high-quality data. The country has failed to provide high-quality data in a usable way, address gaps in data holdings, invest in automated biological data collection, or build the infrastructure needed to ensure that the United States fully leverages its wealth of biological data. The federal government has even failed to maximize the scientific discoveries and innovations already held in its existing collections of biological specimens.
U.S. natural history collections alone house an estimated 800 million to 1 billion biological specimens, ripe for opportunities to collect different types of biological data, including genomic data, but the samples are mostly untouched by researchers.229
China’s approach to biological data involves accessing and exploiting publicly available data from around the world, including from the United States, while harvesting its own domestic datasets and closing them off to the rest of the world.230 This approach gives China an asymmetric advantage in exploiting biological data and highlights its lack of data-sharing reciprocity. Many Chinese Communist Party (CCP) policies explicitly state that the government intends to prioritize the collection and use of biological data, as do statements from China’s medical AI industry.231 Accordingly, the U.S. government must ensure that China cannot obtain bulk and sensitive biological data from the United States.
Recommendation
4.1A
Recommendation 4.1A
Congress must authorize the Department of Energy (DOE) to create a Web of Biological Data (WOBD), a single point of entry for researchers to access high-quality data.
Currently, U.S. biological data is generated from a wide variety of sources and organized with different purposes in mind. These data are organized differently across organizations in academia, government, and industry, and even across individual labs within the same organization.232
This uncoordinated approach makes collating large datasets a burdensome process for researchers, slowing potential discoveries. It might take months to answer a single question, assuming the information exists in the first place.
There are several noteworthy examples of biological databases created by federal departments and agencies, but each is incomplete for a future that requires data for new AI models. For example, the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) is one of the most comprehensive genomic databases in the world.233 But its datasets are in reality spread over different databases and data types and are not designed to be used comprehensively, a key requirement for training AI models. Targeted programs to make biological data more compatible would help to ensure that efforts such as the NCBI drive the future of biotechnology. The Joint Genome Institute at Lawrence Berkeley National Laboratory leads an exemplary data program on microbial sequences and ecosystems, but the program is focused on a small subset of microbiome data.234 Expanding efforts like this to include a larger class of organisms and other types of biological information, such as protein data, would add valuable tools needed for the future of biotechnology.
Having the ability to standardize, combine, and analyze biological data generated from different places, organisms, or experiments is critical to advancing research and training AI models. In many cases, the combination of different datasets is more valuable than the individual parts.
The creation of a resource that combines biological datasets in a usable way would allow researchers to spend less time curating biological data and more time testing hypotheses, training models, and designing novel biological functions. Such a resource would:
- serve as a single point of entry for researchers to access different sources of biological data, all of which would be standardized, usable, and interoperable;
- enable discovery with advanced computational methods; and
- protect and control access to U.S. biological data.
To create these resources, Congress must authorize the Department of Energy (DOE) to create the Web of Biological Data (WOBD), a comprehensive central biological data infrastructure that would serve as single point of entry for accessing biological data, have built-in security and access controls, and provide opportunities for advanced computation and analysis. The WOBD would start with data collected from federally funded efforts and have the potential to expand to collect other sources of data.
The WOBD would:
- serve as an access point for high-quality biological data from different locations;
- host new biological data;
- develop and maintain tools for using these biological data such as bioinformatics pipelines, models, and ontologies (i.e., the categories, properties, and relationships between concepts and conventions that define a field); and
- have a requirement that any datasets included on the platform must be standardized.
This centralized resource would have the added benefit of incorporating cybersecurity and access controls into the earliest stages of its design and development. There are many considerations when designing security and access controls for biological data. For example, plant genome sequences from basic research projects would need different access controls and cybersecurity protocols than sensitive medical records or human genomic data. The WOBD would be meant to encompass many different types of biological data, and as it expands, it would need to carefully build in security and take into account all appropriate privacy laws.
Implementation for the WOBD within its first two years would include:
- assigning a National Laboratory to serve as the manager of the WOBD;
- having that National Laboratory work with existing datasets and collaborate with the NIST to stress-test the digital infrastructure and develop frameworks for interoperability; and
- requiring the DOE to report to Congress on the progress it has made on these tasks.
After the first two years, the WOBD should start establishing connections to all existing biological data from federally funded sources. The ultimate goal is for the WOBD to connect as many sources of biological data as possible through a single point of entry.
The WOBD would also have a R&D arm that would support human-centered design and ensure that its interface is user-friendly. As researchers and other users begin incorporating the WOBD into daily research life, it would grow and evolve with the field.
Recommendation
4.1B
Recommendation 4.1B
Congress should authorize the National Institute of Standards and Technology (NIST) to create standards that researchers must meet to ensure that U.S. biological data is ready for use in AI models.
National infrastructure, metrology (the study of metrics), and standards for biological data are critical to advancing the field and maintaining American leadership, especially when it comes to AI-ready data. But the lack of universal standards, centralized access systems, or even a common language for biological data has exacerbated the current disconnected approach.
The federal government can fix this problem by building national infrastructure and frameworks for biological data that maximize the ability to combine biological datasets that are greater than the sum of their parts. Creating standards and frameworks for data would also require the NIST to expand its portfolio of work related to biometrology, which is the study of metrics and standards related to biotechnology. Taken together, these steps would help create usable biological datasets that would reduce the amount of time and effort researchers spend curating biological data. The resulting datasets could be used to train advanced AI models that could provide novel biological insights at unprecedented levels of performance.
An expanded biotechnology portfolio at the NIST should include expanded capabilities for biometrology. These should include additional instrumentation and research that would translate into usable frameworks, metrics, and units, all built in collaboration with the biotechnology industry. These capabilities would support the building of AI-ready requirements. (For more details on biometrology and the expanded NIST portfolio, see Appendix C.) The NIST is well positioned to take on this mission because it leads the establishment of national standards for critical and emerging technologies such as AI and semiconductors. Indeed, it has already undertaken some efforts to standardize biological data, such as hosting the Genome in a Bottle Consortium, which seeks to characterize human genome data.235 While such efforts are helpful, there is still a need for a concentrated focus on developing AI-ready data. In particular, there is a need to maximize the potential of biological research by requiring that recipients of federal funding collect AI-ready data.
Congress should authorize the NIST to develop standards and frameworks for biological data, prioritizing the establishment of a definition of, and parameters for, AI-ready biological data. The NIST should design standards that support interoperability between new and existing U.S. biological datasets and that support the use of biological data in AI models.
To develop the AI-ready biological data definitions and frameworks, there should be a two-phased approach that would complement other work on standards as part of an expanded biotechnology portfolio at the NIST.
A phased approach is critical because developing a definition of AI-ready biological data is a complicated process due to the sheer number and breadth of biological data types. Accordingly, it is important to establish initial evaluation criteria before fully implementing an AI-ready biological data requirement.
Phase I: Define AI-Ready Biological Data and Pressure-Test Frameworks
Phase I would occur over the first two years, during which the NIST would define AI-ready data and pressure-test the definition to ensure it does not impose an undue burden on the research community. The NIST should create a definition in consultation with key federal, academic, and industry stakeholders. The definition, at a minimum, should specify that AI-ready biological data:
- are compatible with the WOBD (see recommendation 4.1a);
- are accessible via an application programming interface (API) within one year of collection;
- include machine-readable metadata that enables reusability;
- can be normalized to support aggregation with other biological datasets;
- include all data controls and outputs; and
- are available in a raw, unprocessed format.
Phase II: Fully Implement AI-Ready Data Requirement
In Phase II of the program, which would take place over the next three years, the NIST would expand its work to provide data management resources for biological data, build complete cybersecurity frameworks, hire a dedicated staff, and coordinate with relevant federal funding agencies on AI-ready data requirements. In this phase, the NIST would fully implement the requirements.
In parallel with developing these guidelines, the NIST should work with departments that are members of the Federal Acquisition Regulation (FAR) Council to update the FAR to incorporate a base-level requirement that federal funders produce AI-ready biological data. This requirement should be applicable to large biological datasets, with thresholds defined by the NIST. Updates to the FAR should apply to all relevant agencies, while allowing for authorized exemptions on a case-by-case basis. The NIST should serve as a hub for helping recipients of federal funding that are subject to AI-ready provisions ensure that their data are indeed AI-ready.
Recommendation
4.1C
Recommendation 4.1C
Congress should authorize and fund the Department of Interior (DOI) to create a Sequencing Public Lands Initiative to collect new data from U.S. public lands that researchers can use to drive innovation.
Efforts to collect biological data in the United States are not strategically planned and executed, leaving gaps in biological data holdings and preventing researchers from understanding what data is needed. The United States would benefit from data collection in a number of different sectors, including healthcare, agriculture, and biomanufacturing. While the Commission identified many gaps in U.S. biological data collection, there is a particular need for non-human biological data, including data from animals, plants, microbes, and fungi, in order to better understand the breadth of America’s biological landscape.
The United States has one of the most extensive and varied public lands systems in the world, encompassing enormous distributions of preserved ecology and biological organisms. The National Parks alone cover 85 million acres, including extreme landscapes such as Death Valley, with its record-breaking heat, and Gates of the Arctic, with its glacial wilderness.236
The national parks are home to unique organisms and ecosystems, including the coral formations at Dry Tortugas National Park in Florida, many different species of salamanders at Great Smokey Mountain National Park in North Carolina and Tennessee, and the gypsum dune fields and endemic moth species of White Sands National Park in New Mexico.237
Genomic data from plants, fungi, animals, and microorganisms are essential resources for research in genetics, evolution, and biochemistry, as well as for applied purposes such as medicine, food, and conservation. Genomic data collected from organisms living in extreme environments, such as the hydrothermal sites in Yellowstone National Park, could provide insights into how organisms adapt to live in these extreme environments. Similar to how penicillin was discovered by studying a fungus that produced the antibiotic for its own survival, studying a wide range of different organisms from public lands could contribute to biotechnology innovations.238
There is no coordinated federal effort to catalog the genomic landscape of U.S. federal lands. While there are efforts to collect genomic sequence data, these are tailored to the missions of specific departments and agencies, and they lack interoperability, collaboration, overarching data standards, and shared interagency goals.
Congress should authorize and fund the Department of the Interior (DOI) to create a Sequencing Public Lands Initiative to collect data from U.S. public lands that researchers can use to drive innovation. This major initiative would seek to sequence and catalogue the genomes of animals, plants, fungi, and bacteria across the United States.
The biological data collected from this initiative, which would be made available through the WOBD (see recommendation 4.1a), would help protect National Park lands, allow researchers to learn from nature to develop innovations, and enhance broad educational opportunities.
The Sequencing Public Lands Initiative should proceed in three phases, so that the project is carefully executed and gradually expanded, culminating in an opportunity to sequence a wide variety of organisms from different federally managed lands.
Phase I: Selecting Five National Parks
The Sequencing Public Lands Initiative should start with a two-year initial phase in which five national parks are selected through a competitive process based on four criteria, including:
- Biological Resources: Each park should conduct an inventory of its own biological resources, including information on the breadth of known species and the rarity of present species.
- Implementation Plan: Each park should devise an implementation plan that includes input from experts on regional organisms, genomic sequencing, and taxonomy. These experts would coordinate sampling and collection logistics, as well as a proposed sequencing timeline.
- Education and Outreach Plans: Each park should have plans to establish partnerships with local public universities to provide opportunities for recent graduates to work on sample collection and processing. Furthermore, parks should have plans for outreach and public education efforts.
- Specific Research Questions: Each park should feature scientist-generated research questions particular to that park and its unique biome.
A newly established office in the DOI would work with the selected national parks to establish how to safely and appropriately collect samples, who would perform the collection, what training would be necessary, and how to work with the NIST to establish data standards. The DOI would also work with the U.S. Department of Agriculture (USDA) and the Smithsonian Institution to establish best practices for storing samples. Phase I would require that the DOI report to Congress with an implementation plan for the entire initiative and give an annual update on progress. It is critical to set up the systems that make up Phase I before moving on to Phase II.
Phase II: Sequencing Twenty National Parks
The DOI would expand the initiative to 20 additional national parks. Each additional park should be required to conduct a survey of the breadth of biological organisms within its boundaries and create implementation and education and outreach plans, as well as scientist-led research questions.
Phase III: Sequencing Public Lands
The final phase would entail the full realization of the program, which would expand to more federal lands, and seek to capture a holistic picture of the biological landscape of the United States. Land managed by the DOI’s Bureau of Land Management, its Fish and Wildlife Service, and the USDA’s U.S. Forest Service would be included, and genome sequencing would fit into the previously established infrastructure and pipelines. The outcome of this initiative would consist of biological data, such as whole genome sequences, and necessary metadata to ensure the data are AI-ready. These data would comprise a database within an established data storage system—namely, the proposed WOBD (see recommendation 4.1a).
The Sequencing Public Lands Initiative would require close collaboration with local communities and landowners. At every step, program coordinators would have to consult with the Assistant Secretary for Indian Affairs and other relevant partners to incorporate their views and expertise into the project.
Education and outreach would be key components of the Sequencing Public Lands Initiative. The initiative would provide an opportunity to engage with scientists, students, and broader communities on the environment and its inhabitants, as well as on the importance of basic science and genomic data. This initiative would also offer opportunities for students, recent graduates, and postdoctoral fellows to gain technical experience in the research pipeline, from collecting samples to assembling and annotating genomes. The Sequencing Public Lands Initiative could serve as a springboard for bioliteracy across the country. National parks could develop curricula for local students from elementary through high school to learn about topics such as ecology, molecular biology, and computer science, all while working on projects that feature real biological systems in their area. While these genomic data would become a valuable resource for scientists, the discoveries from these biological data could also be incorporated into education and outreach materials that the parks could use to generate further interest in the United States’ rich ecosystems.
Recommendation
4.1D
Recommendation 4.1D
Congress should authorize the National Science Foundation (NSF) to establish a network of “cloud labs,” giving researchers state-of-the-art tools to make data generation easier.
To gain an advantage in AI capabilities related to biotechnology, the United States needs more high-quality training data for AI models. Currently, however, there are limited research opportunities for biological data collection using robotics and automation. Robotics and automation are redefining what is possible for large-scale, high-throughput biotechnology research and data collection.
The costs to build a highly automated laboratory are significant, as are the costs of sustaining a highly specialized workforce to keep the laboratory operational. There are examples of several commercial automated laboratory facilities, called “cloud labs” that provide such resources, but there are significant barriers to entry for both building and using a cloud lab, mostly related to cost.239
Given these costs and the benefits that come from the massive quantity of high-quality data that a cloud lab can generate, automated laboratories should be viewed as an opportunity to invest in economies of scale. The United States could create an opportunity for researchers to generate large amounts of high-quality biological data through new and existing automated instrumentation infrastructure. The resulting data would be critical for the future of biological AI models.240
Congress should authorize the National Science Foundation (NSF) to establish a network of cloud labs. To give researchers access to state-of-the-art automated instrumentation for biotechnology data collection and experimentation, the NSF would coordinate the different capabilities of existing cloud lab facilities in addition to establishing new cloud lab infrastructure.
This program should be executed in three phases, including a careful initial planning phase.
Phase I: Assessment and Planning
The NSF, in consultation with the National Biotechnology Coordination Office (NBCO) (see recommendation 1.1a), the DOE, and the NIST, would assess the state of existing cloud lab infrastructure in the United States. The NSF would also develop an implementation plan for the program, in consultation with relevant public and private sector stakeholders, including a plan for creating new cloud lab facilities in the United States.
Phase II: Initial Awards for New Cloud Laboratories
The NSF would award grants on a competitive basis to develop and operate at least two new cloud labs, while continuing to update and maintain its network of existing cloud labs.
Phase III: Additional Awards for New Cloud Laboratories
The NSF would award grants to develop and operate at least three additional cloud labs.

“Biotechnology has held promise for decades as the revolutionary frontier of tomorrow, but I firmly believe we are at the most critical inflection point. We are barreling toward never-before-seen capabilities: AI and related cutting-edge technologies are supercharging our ability to discover biobased products. The need is clear: Emerging national security threats, such as supply chain insecurity, the strengthening of adversaries, and public health threats require innovative, world-leading solutions. The biotechnology mindset is shifting: Our companies, universities, and leaders are increasingly realizing biotechnology product-market fit which will impact all Americans, whether it be service members, farmers, or families.
All of this is happening right now at an unprecedented pace, and it’s happening around the world. I’m excited that our recommendations will make it easy for the United States to run the fastest and win this race at home. We can’t afford to let up.”
Commissioner Alexander Titus