Publications by the DNALC

Achieving STEM diversity: Fix the classrooms

Jo Handelsman, Sarah Elgin, Mica Estrada, Shan Hays, Tracy Johnson, Sarah Miller, Vida Mingo, Christopher Shaffer, and Jason Williams

Science, 2 Jun 2022, Vol 376, Issue 6597, pp. 1057-1059 | DOI: 10.1126/science.abn9515


Achieving equity in science, technology, engineering, and mathematics (STEM) requires attracting and retaining college students from diverse backgrounds. Despite decades of calls for action, change has been slow. Recommendations have largely focused on members of underrepresented groups themselves rather than on fixing the classrooms that drive many students out of STEM. Without removing such barriers, funding and programs directed toward underrepresented groups will not transform STEM. Instead, we must fix the classrooms where many students from historically excluded communities (HECs) are discouraged from pursuing STEM. Here, we outline areas that need change and identify steps that can be taken by instructors, academic leadership, and government agencies to drive change at scale. The DNA Learning Center has an active role and important responsibility in contributing to this work, and this paper highlights how course-based research (such as the DNALC barcoding programs) can contribute to making science accessible to all.

Asking the Wrong Questions About American Science Education: Insights from a Longitudinal Study of High School Biotechnology Lab Instruction

Dave Micklos, Lindsay Barone

bioRxiv, November 29, 2021 |


The discussion of American science education is often framed by the questions: Why do American precollege students do poorly on international science assessments and what we are doing wrong? Rather we need to ask: Why do so many international students come to US universities for science, what are we doing right in science, and how do we stay ahead in science education? Poor scores on international assessments belie the fact that the U.S. has the best science education system in the world. Our study of 6,200 high school teachers in 1998 and 2018 documented striking success in retooling classrooms for lab-based instruction in biotechnology and provided a pre-COVID-19 snapshot of what is right with American biology education. However, it also highlights the need revitalize our precollege teaching resource with a renewed National Science Foundation commitment to in-service training.

Management, Analyses, and Distribution of the MaizeCODE Data on the Cloud

Liya Wang¹, Zhenyuan Lu¹, Melissa delaBastide¹, Peter Van Buren¹, Xiaofei Wang¹, Cornel Ghiban¹, Michael Regulski¹, Jorg Drenkow¹, Xiaosa Xu¹, Carlos Ortiz-Ramirez2 , Cristina F. Marco¹, Sara Goodwin¹, Alexander Dobin¹, Kenneth D. Birnbaum², David P. Jackson¹, Robert A. Martienssen¹, William R. McCombie¹, David A. Micklos¹, Michael C. Schatz¹³, Doreen H. Ware¹⁴* and Thomas R. Gingeras¹*
¹ Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, United States, ² New York University, New York, NY, United States, ³ Johns Hopkins University, Baltimore, MD, United States, ⁴ USDA-ARS Robert W. Holley Center for Agriculture and Health, Ithaca, NY, United States

Front. Plant Sci., March 31, 2020 |


MaizeCODE is a project aimed at identifying and analyzing functional elements in the maize genome. In its initial phase, MaizeCODE assayed up to five tissues from four maize strains (B73, NC350, W22, TIL11) by RNA-Seq, Chip-Seq, RAMPAGE, and small RNA sequencing. To facilitate reproducible science and provide both human and machine access to the MaizeCODE data, we enhanced SciApps, a cloud-based portal, for analysis and distribution of both raw data and analysis results. Based on the SciApps workflow platform, we generated new components to support the complete cycle of MaizeCODE data management. These include publicly accessible scientific workflows for the reproducible and shareable analysis of various functional data, a RESTful API for batch processing and distribution of data and metadata, a searchable data page that lists each MaizeCODE experiment as a reproducible workflow, and integrated JBrowse genome browser tracks linked with workflows and metadata. The SciApps portal is a flexible platform that allows the integration of new analysis tools, workflows, and genomic data from multiple projects. Through metadata and a ready-to-compute cloud-based platform, the portal experience improves access to the MaizeCODE data and facilitates its analysis

Barriers to integration of bioinformatics into undergraduate life sciences education: A national study of US life sciences faculty uncover significant barriers to integrating bioinformatics into undergraduate instruction

Jason J. Williams, Jennifer C. Drew, Sebastian Galindo-Gonzalez, Srebrenka Robic, Elizabeth Dinsdale, William R. Morgan, Eric W. Triplett, James M. Burnette III, Samuel S. Donovan, Edison R. Fowlks, Anya L. Goodman, Nealy F. Grandgenett, Carlos C. Goller, Charles Hauser, John R. Jungck, Jeffrey D. Newman, William R. Pearson, Elizabeth F. Ryder, Michael Sierk, Todd M. Smith, Rafael Tosado-Acevedo, William Tapprich, Tammy C. Tobin, Arlín Toro-Martínez, Lonnie R. Welch, Melissa A. Wilson, David Ebenbach, Mindy McWilliams, Anne G. Rosenwald, Mark A. Pauley

PLoSONE 14(11):e0224288. November18, 2019


Bioinformatics, a discipline that combines aspects of biology, statistics, mathematics, and computer science, is becoming increasingly important for biological research. However, bioinformatics instruction is not yet generally integrated into undergraduate life sciences curricula. To understand why we studied how bioinformatics is being included in biology education in the US by conducting a nationwide survey of faculty at two- and four-year institutions. The survey asked several open-ended questions that probed barriers to integration, the answers to which were analyzed using a mixed-methods approach. The barrier most frequently reported by the 1,260 respondents was lack of faculty expertise/training, but other deterrents—lack of student interest, overly-full curricula, and lack of student preparation—were also common. Interestingly, the barriers faculty face depended strongly on whether they are members of an underrepresented group and on the Carnegie Classification of their home institution. We were surprised to discover that the cohort of faculty who were awarded their terminal degree most recently reported the most preparation in bioinformatics but teach it at the lowest rate.

Double triage to identify poorly annotated genes in maize: The missing link in community curation

Marcela K. Tello-Ruiz, Cristina F. Marco, Fei-Man Hsu, Rajdeep S. Khangura, Pengfei Qiao, Sirjan Sapkota, Michelle C. Stitzer, Rachael Wasikowski, Hao Wu, Junpeng Zhan, Kapeel Chougule, Lindsay M. Barone, Cornel Ghiban, Demitri Muna, Andrew C. Olson, Liya C. Wang, Doreen C. Ware, David A. Micklos

PLoSONE 14(10): e0224086. October 28, 2019


The sophistication of gene prediction algorithms and the abundance of RNA-based evidence for the maize genome may suggest that manual curation of gene models is no longer necessary. However, quality metrics generated by the MAKER-P gene annotation pipeline identified 17,225 of 130,330 (13%) protein-coding transcripts in the B73 Reference Genome V4 gene set with models of low concordance to available biological evidence. Working with eight graduate students, we used the Apollo annotation editor to curate 86 transcript models flagged by quality metrics and a complimentary method using the Gramene gene tree visualizer. All of the triaged models had significant errors – including missing or extra exons, non-canonical splice sites, and incorrect UTRs. A correct transcript model existed for about 60% of genes (or transcripts) flagged by quality metrics; we attribute this to the convention of elevating the transcript with the longest coding sequence (CDS) to the canonical, or first, position. The remaining 40% of flagged genes resulted in novel annotations and represent a manual curation space of about 10% of the maize genome (~4,000 protein-coding genes). MAKER-P metrics have a specificity of 100%, and a sensitivity of 85%; the gene tree visualizer has a specificity of 100%. Together with the Apollo graphical editor, our double triage provides an infrastructure to support the community curation of eukaryotic genomes by scientists, students, and potentially even citizen scientists.

Peer-reviewed publishing of results from Citizen Science projects

Gabriele Gadermaier, Daniel Dörler, Florian Heigl, Stefan Mayr, Johannes Rüdisser, Robert Brodschneider and Christine Marizzi

JCOM September 26, 2018


Citizen science (CS) terms the active participation of the general public in scientific research activities. With increasing amounts of information generated by citizen scientists, best practices to go beyond science communication and publish these findings to the scientific community are needed. This letter is a synopsis of authors' personal experiences when publishing results from citizen science projects in peer-reviewed journals, as presented at the Austrian Citizen Science Conference 2018. Here, we address authors' selection criteria for publishing CS data in open-access, peer-reviewed scientific journals as well as barriers encountered during the publishing process. We also outline factors that influence the probability of publication using CS data, including 1) funding to cover publication costs; 2) quality, quantity and scientific novelty of CS data; 3) recommendations to acknowledge contributions of citizen scientists in scientific, peer-reviewed publications; 4) citizen scientists' preference of the hands-on experience over the product (publication) and 5) bias among scientists for certain data sources and the scientific jargon. These experiences show that addressing these barriers could greatly increase the rate of CS data included in scientific publications.

DNA barcoding Brooklyn (New York): A first assessment of biodiversity in Marine Park by citizen scientists

Christine Marizzi, Antonia Florio , Melissa Lee, Mohammed Khalfan, Cornel Ghiban, Bruce Nash, Jenna Dorey, Sean McKenzie, Christine Mazza, Fabiana Cellini, Carlo Baria, Ron Bepat, Lena Cosentino, Alexander Dvorak, Amina Gacevic, Cristina Guzman-Moumtzis, Francesca Heller, Nicholas Alexander Holt, Jeffrey Horenstein, Vincent Joralemon, Manveer Kaur, Tanveer Kaur, Armani Khan, Jessica Kuppan, Scott Laverty, Camila Lock, Marianne Pena, Ilona Petrychyn, Indu Puthenkalam, Daval Ram, Arlene Ramos, Noelle Scoca, Rachel Sin, Izabel Gonzalez, Akansha Thakur, Husan Usmanov, Karen Han, Andy Wu, Tiger Zhu, David Andrew Micklos

PLoS ONE 13(7): e0199015. July 18, 2018


DNA barcoding is both an important research and science education tool. The technique allows for quick and accurate species identification using only minimal amounts of tissue samples taken from any organism at any developmental phase. DNA barcoding has many practical applications including furthering the study of taxonomy and monitoring biodiversity. In addition to these uses, DNA barcoding is a powerful tool to empower, engage, and educate students in the scientific method while conducting productive and creative research. The study presented here provides the first assessment of Marine Park (Brooklyn, New York, USA) biodiversity using DNA barcoding. New York City citizen scientists (high school students and their teachers) were trained to identify species using DNA barcoding during a two–week long institute. By performing NCBI GenBank BLAST searches, students taxonomically identified 187 samples (1 fungus, 70 animals and 116 plants) and also published 12 novel DNA barcodes on GenBank. Students also identified 7 ant species and demonstrated the potential of DNA barcoding for identification of this especially diverse group when coupled with traditional taxonomy using morphology. Here we outline how DNA barcoding allows citizen scientists to make preliminary taxonomic identifications and contribute to modern biodiversity research.

Bioinformatics Core Competencies for Undergraduate Life Sciences Education

Melissa A. Wilson Sayres, Charles Hauser, Michael Sierk, Srebrenka Robic, Anne G. Rosenwald, Todd M. Smith, Eric W. Triplett, Jason J. Williams, Elizabeth Dinsdale, William Morgan, James M. Burnette III, Sam S. Donovan, Jennifer C. Drew, Sarah C. R. Elgin, Edison R. Fowlks, Sebastian Galindo-Gonzalez, Anya L. Goodman, Neal F. Grandgenett, Carlos C. Goller, John Jungck, Jeffrey D. Newman, William R. Pearson, Elizabeth Ryder, Rafael Tosado-Acevedo, William Tapprich, Tammy C. Tobin, Arlín Toro-Martínez, Lonnie R. Welch, Robin Wright, David Ebenbach, Kimberly C. Olney, Mindy McWilliams, Mark A. Pauley

PLoS ONE 13(6): e0196878. June 5, 2018


Bioinformatics is becoming increasingly central to research in the life sciences. However, despite its importance, bioinformatics skills and knowledge are not well integrated in undergraduate biology education. This curricular gap prevents biology students from harnessing the full potential of their education, limiting their career opportunities and slowing genomic research innovation. To advance the integration of bioinformatics into life sciences education, a framework of core bioinformatics competencies is needed. To that end, we here report the results of a survey of life sciences faculty in the United States about teaching bioinformatics to undergraduate life scientists. Responses were received from 1,260 faculty representing institutions in all fifty states with a combined capacity to educate hundreds of thousands of students every year. Results indicate strong, widespread agreement that bioinformatics knowledge and skills are critical for undergraduate life scientists, as well as considerable agreement about which skills are necessary. Perceptions of the importance of some skills varied with the respondent’s degree of training, time since degree earned, and/or the Carnegie classification of the respondent’s institution. To assess which skills are currently being taught, we analyzed syllabi of courses with bioinformatics content submitted by survey respondents. Finally, we used the survey results, the analysis of syllabi, and our collective research and teaching expertise to develop a set of bioinformatics core competencies for undergraduate life sciences students. These core competencies are intended to serve as a guide for institutions as they work to integrate bioinformatics into their life sciences curricula.

DNA Barcoding for Identification of Consumer-Relevant Fungi Sold in New York: A Powerful Tool for Citizen Scientists?

Emily Jensen-Vargas and Christine Marizzi

Foods. 2018 Jun 8;7(6). pii: E87


Although significant progress has been made in our understanding of fungal diversity, identification based on phenotype can be difficult, even for trained experts. Fungi typically have a cryptic nature and can have a similar appearance to distantly related species. Moreover, the appearance of industrially processed mushrooms complicates species identification, as they are often sold sliced and dried. Here we present a small-scale citizen science project, wherein the participants generated and analyzed DNA sequences from fruiting bodies of dried and fresh fungi that were sold for commercial use in New York City supermarkets. We report positive outcomes and the limitations of a youth citizen scientist, aiming to identify dried mushrooms, using established DNA barcoding protocols and exclusively open-access data analysis tools for species identification. Our results indicate that the single-locus nuclear ribosomal internal transcribed spacer (ITS) DNA barcoding approach allowed for identification of only a subset of all of the samples at the species level, although the generated high-quality DNA barcodes were submitted to three different databases. Our results highlight the need for a curated, centralized, and open access ITS reference database that allows rapid third-party annotations for the benefit of both traditional research as well as the emerging citizen science community.

AgBioData consortium recommendations for sustainable genomics and genetics databases for agriculture

Lisa Harper, Jacqueline Campbell, Ethalinda K S Cannon, Sook Jung, Monica Poelchau, Ramona Walls, Carson Andorf, Elizabeth Arnaud, Tanya Z Berardini, Clayton Birkett, Steve Cannon, James Carson, Bradford Condon, Laurel Cooper, Nathan Dunn, Christine G Elsik, Andrew Farmer, Stephen P Ficklin, David Grant, Emily Grau, Nic Herndon, Zhi-Liang Hu, Jodi Humann, Pankaj Jaiswal, Clement Jonquet, Marie-Angélique Laporte, Pierre Larmande, Gerard Lazo, Fiona McCarthy, Naama Menda, Christopher J Mungall, Monica C Munoz-Torres, Sushma Naithani, Rex Nelson, Daureen Nesdill, Carissa Park, James Reecy, Leonore Reiser, Lacey-Anne Sanderson, Taner Z Sen, Margaret Staton, Sabarinath Subramaniam, Marcela Karey Tello-Ruiz, Victor Unda, Deepak Unni, Liya Wang, Doreen Ware, Jill Wegrzyn, Jason Williams, Margaret Woodhouse, Jing Yu, Doreen Main

Database, Volume 2018, 1 January 2018, Pages 1–32


The future of agricultural research depends on data. The sheer volume of agricultural biological data being produced today makes excellent data management essential. Governmental agencies, publishers and science funders require data management plans for publicly funded research. Furthermore, the value of data increases exponentially when they are properly stored, described, integrated and shared, so that they can be easily utilized in future analyses. AgBioData ( is a consortium of people working at agricultural biological databases, data archives and knowledgbases who strive to identify common issues in database development, curation and management, with the goal of creating database products that are more Findable, Accessible, Interoperable and Reusable. We strive to promote authentic, detailed, accurate and explicit communication between all parties involved in scientific data. As a step toward this goal, we present the current state of biocuration, ontologies, metadata and persistence, database platforms, programmatic (machine) access to data, communication and sustainability with regard to data curation. Each section describes challenges and opportunities for these topics, along with recommendations and best practices.

Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators

Lindsay Barone, Jason Williams , David Micklos

PLoS Comput Biol 13(10): e1005755. October 19, 2017


In a 2016 survey of 704 National Science Foundation (NSF) Biological Sciences Directorate principal investigators (BIO PIs), nearly 90% indicated they are currently or will soon be analyzing large data sets. BIO PIs considered a range of computational needs important to their work, including high performance computing (HPC), bioinformatics support, multistep workflows, updated analysis software, and the ability to store, share, and publish data. Previous studies in the United States and Canada emphasized infrastructure needs. However, BIO PIs said the most pressing unmet needs are training in data integration, data management, and scaling analyses for HPC—acknowledging that data science skills will be required to build a deeper understanding of life. This portends a growing data knowledge gap in biology and challenges institutions and funding agencies to redouble their support for computational training in biology.

A vision for collaborative training infrastructure for bioinformatics

Williams, Jason J. and Teal, Tracy K.

Annals of the New York Academy of Sciences. 07 September 2016


In biology, a missing link connecting data generation and data‐driven discovery is the training that prepares researchers to effectively manage and analyze data. National and international cyberinfrastructure along with evolving private sector resources place biologists and students within reach of the tools needed for data‐intensive biology, but training is still required to make effective use of them. In this concept paper, we review a number of opportunities and challenges that can inform the creation of a national bioinformatics training infrastructure capable of servicing the large number of emerging and existing life scientists. While college curricula are slower to adapt, grassroots startup‐spirited organizations, such as Software and Data Carpentry, have made impressive inroads in training on the best practices of software use, development, and data analysis. Given the transformative potential of biology and medicine as full‐fledged data sciences, more support is needed to organize, amplify, and assess these efforts and their impacts.

Lessons from a Science Education Portal

David Micklos, Susan Lauter, Amy Nisselle

Science 23 Dec 2011: Vol. 334, Issue 6063, pp. 1657-1658

SPORE, Science Prize for Online Resources in Education

When Cold Spring Harbor Laboratory's DNA Learning Center (DNALC) launched its Web site in 1996,, we did not foresee that it would grow into a portal for 18 content sites reaching more than seven million visitors per year. The evolution of our multimedia efforts and the challenges along the way provide lessons for building learning resources or to attract larger audiences....

Building Modern Internet Sites for Science Education: Insights from Science, Technology, and Education

John Connolly, Harouna Ba, Danielle Sixsmith, David Micklos

This paper condenses insights gained during a three-day workshop of 30 experts and opinion leaders from diverse fields – including neuroscience, cognitive science, network theory, knowledge management, science education, and technology convergence. The quick insights are a useful laundry list for anyone creating a modern Internet site on science education, while the deeper insights give a sense of what is on the minds of people leading the effort to use the Internet to connect people in real-time communities of common interest.