Publications by the DNALC

Double triage to identify poorly annotated genes in maize: The missing link in community curation

Marcela K. Tello-Ruiz, Cristina F. Marco, Fei-Man Hsu, Rajdeep S. Khangura, Pengfei Qiao, Sirjan Sapkota, Michelle C. Stitzer, Rachael Wasikowski, Hao Wu, Junpeng Zhan, Kapeel Chougule, Lindsay M. Barone, Cornel Ghiban, Demitri Muna, Andrew C. Olson, Liya C. Wang, Doreen C. Ware, David A. Micklos

PLoSONE 14(10): e0224086. Posted October 28, 2019
https://doi.org/10.1371/journal.pone.0224086

Abstract

The sophistication of gene prediction algorithms and the abundance of RNA-based evidence for the maize genome may suggest that manual curation of gene models is no longer necessary. However, quality metrics generated by the MAKER-P gene annotation pipeline identified 17,225 of 130,330 (13%) protein-coding transcripts in the B73 Reference Genome V4 gene set with models of low concordance to available biological evidence. Working with eight graduate students, we used the Apollo annotation editor to curate 86 transcript models flagged by quality metrics and a complimentary method using the Gramene gene tree visualizer. All of the triaged models had significant errors – including missing or extra exons, non-canonical splice sites, and incorrect UTRs. A correct transcript model existed for about 60% of genes (or transcripts) flagged by quality metrics; we attribute this to the convention of elevating the transcript with the longest coding sequence (CDS) to the canonical, or first, position. The remaining 40% of flagged genes resulted in novel annotations and represent a manual curation space of about 10% of the maize genome (~4,000 protein-coding genes). MAKER-P metrics have a specificity of 100%, and a sensitivity of 85%; the gene tree visualizer has a specificity of 100%. Together with the Apollo graphical editor, our double triage provides an infrastructure to support the community curation of eukaryotic genomes by scientists, students, and potentially even citizen scientists.

Peer-reviewed publishing of results from Citizen Science projects

Gabriele Gadermaier, Daniel Dörler, Florian Heigl, Stefan Mayr, Johannes Rüdisser, Robert Brodschneider and Christine Marizzi

JCOM September 26, 2018
https://doi.org/10.22323/2.17030101

Abstract

Citizen science (CS) terms the active participation of the general public in scientific research activities. With increasing amounts of information generated by citizen scientists, best practices to go beyond science communication and publish these findings to the scientific community are needed. This letter is a synopsis of authors' personal experiences when publishing results from citizen science projects in peer-reviewed journals, as presented at the Austrian Citizen Science Conference 2018. Here, we address authors' selection criteria for publishing CS data in open-access, peer-reviewed scientific journals as well as barriers encountered during the publishing process. We also outline factors that influence the probability of publication using CS data, including 1) funding to cover publication costs; 2) quality, quantity and scientific novelty of CS data; 3) recommendations to acknowledge contributions of citizen scientists in scientific, peer-reviewed publications; 4) citizen scientists' preference of the hands-on experience over the product (publication) and 5) bias among scientists for certain data sources and the scientific jargon. These experiences show that addressing these barriers could greatly increase the rate of CS data included in scientific publications.

DNA barcoding Brooklyn (New York): A first assessment of biodiversity in Marine Park by citizen scientists

Christine Marizzi, Antonia Florio , Melissa Lee, Mohammed Khalfan, Cornel Ghiban, Bruce Nash, Jenna Dorey, Sean McKenzie, Christine Mazza, Fabiana Cellini, Carlo Baria, Ron Bepat, Lena Cosentino, Alexander Dvorak, Amina Gacevic, Cristina Guzman-Moumtzis, Francesca Heller, Nicholas Alexander Holt, Jeffrey Horenstein, Vincent Joralemon, Manveer Kaur, Tanveer Kaur, Armani Khan, Jessica Kuppan, Scott Laverty, Camila Lock, Marianne Pena, Ilona Petrychyn, Indu Puthenkalam, Daval Ram, Arlene Ramos, Noelle Scoca, Rachel Sin, Izabel Gonzalez, Akansha Thakur, Husan Usmanov, Karen Han, Andy Wu, Tiger Zhu, David Andrew Micklos

PLoS ONE 13(7): e0199015. July 18, 2018
https://doi.org/10.1371/journal.pone.0199015

Abstract

DNA barcoding is both an important research and science education tool. The technique allows for quick and accurate species identification using only minimal amounts of tissue samples taken from any organism at any developmental phase. DNA barcoding has many practical applications including furthering the study of taxonomy and monitoring biodiversity. In addition to these uses, DNA barcoding is a powerful tool to empower, engage, and educate students in the scientific method while conducting productive and creative research. The study presented here provides the first assessment of Marine Park (Brooklyn, New York, USA) biodiversity using DNA barcoding. New York City citizen scientists (high school students and their teachers) were trained to identify species using DNA barcoding during a two–week long institute. By performing NCBI GenBank BLAST searches, students taxonomically identified 187 samples (1 fungus, 70 animals and 116 plants) and also published 12 novel DNA barcodes on GenBank. Students also identified 7 ant species and demonstrated the potential of DNA barcoding for identification of this especially diverse group when coupled with traditional taxonomy using morphology. Here we outline how DNA barcoding allows citizen scientists to make preliminary taxonomic identifications and contribute to modern biodiversity research.

Bioinformatics Core Competencies for Undergraduate Life Sciences Education

Melissa A. Wilson Sayres, Charles Hauser, Michael Sierk, Srebrenka Robic, Anne G. Rosenwald, Todd M. Smith, Eric W. Triplett, Jason J. Williams, Elizabeth Dinsdale, William Morgan, James M. Burnette III, Sam S. Donovan, Jennifer C. Drew, Sarah C. R. Elgin, Edison R. Fowlks, Sebastian Galindo-Gonzalez, Anya L. Goodman, Neal F. Grandgenett, Carlos C. Goller, John Jungck, Jeffrey D. Newman, William R. Pearson, Elizabeth Ryder, Rafael Tosado-Acevedo, William Tapprich, Tammy C. Tobin, Arlín Toro-Martínez, Lonnie R. Welch, Robin Wright, David Ebenbach, Kimberly C. Olney, Mindy McWilliams, Mark A. Pauley

PLoS ONE 13(6): e0196878. June 5, 2018
https://doi.org/10.1371/journal.pone.0196878

Abstract

Bioinformatics is becoming increasingly central to research in the life sciences. However, despite its importance, bioinformatics skills and knowledge are not well integrated in undergraduate biology education. This curricular gap prevents biology students from harnessing the full potential of their education, limiting their career opportunities and slowing genomic research innovation. To advance the integration of bioinformatics into life sciences education, a framework of core bioinformatics competencies is needed. To that end, we here report the results of a survey of life sciences faculty in the United States about teaching bioinformatics to undergraduate life scientists. Responses were received from 1,260 faculty representing institutions in all fifty states with a combined capacity to educate hundreds of thousands of students every year. Results indicate strong, widespread agreement that bioinformatics knowledge and skills are critical for undergraduate life scientists, as well as considerable agreement about which skills are necessary. Perceptions of the importance of some skills varied with the respondent’s degree of training, time since degree earned, and/or the Carnegie classification of the respondent’s institution. To assess which skills are currently being taught, we analyzed syllabi of courses with bioinformatics content submitted by survey respondents. Finally, we used the survey results, the analysis of syllabi, and our collective research and teaching expertise to develop a set of bioinformatics core competencies for undergraduate life sciences students. These core competencies are intended to serve as a guide for institutions as they work to integrate bioinformatics into their life sciences curricula.

DNA Barcoding for Identification of Consumer-Relevant Fungi Sold in New York: A Powerful Tool for Citizen Scientists?

Emily Jensen-Vargas and Christine Marizzi

Foods. 2018 Jun 8;7(6). pii: E87
https://doi.org/10.3390/foods7060087

Abstract

Although significant progress has been made in our understanding of fungal diversity, identification based on phenotype can be difficult, even for trained experts. Fungi typically have a cryptic nature and can have a similar appearance to distantly related species. Moreover, the appearance of industrially processed mushrooms complicates species identification, as they are often sold sliced and dried. Here we present a small-scale citizen science project, wherein the participants generated and analyzed DNA sequences from fruiting bodies of dried and fresh fungi that were sold for commercial use in New York City supermarkets. We report positive outcomes and the limitations of a youth citizen scientist, aiming to identify dried mushrooms, using established DNA barcoding protocols and exclusively open-access data analysis tools for species identification. Our results indicate that the single-locus nuclear ribosomal internal transcribed spacer (ITS) DNA barcoding approach allowed for identification of only a subset of all of the samples at the species level, although the generated high-quality DNA barcodes were submitted to three different databases. Our results highlight the need for a curated, centralized, and open access ITS reference database that allows rapid third-party annotations for the benefit of both traditional research as well as the emerging citizen science community.

AgBioData consortium recommendations for sustainable genomics and genetics databases for agriculture

Lisa Harper, Jacqueline Campbell, Ethalinda K S Cannon, Sook Jung, Monica Poelchau, Ramona Walls, Carson Andorf, Elizabeth Arnaud, Tanya Z Berardini, Clayton Birkett, Steve Cannon, James Carson, Bradford Condon, Laurel Cooper, Nathan Dunn, Christine G Elsik, Andrew Farmer, Stephen P Ficklin, David Grant, Emily Grau, Nic Herndon, Zhi-Liang Hu, Jodi Humann, Pankaj Jaiswal, Clement Jonquet, Marie-Angélique Laporte, Pierre Larmande, Gerard Lazo, Fiona McCarthy, Naama Menda, Christopher J Mungall, Monica C Munoz-Torres, Sushma Naithani, Rex Nelson, Daureen Nesdill, Carissa Park, James Reecy, Leonore Reiser, Lacey-Anne Sanderson, Taner Z Sen, Margaret Staton, Sabarinath Subramaniam, Marcela Karey Tello-Ruiz, Victor Unda, Deepak Unni, Liya Wang, Doreen Ware, Jill Wegrzyn, Jason Williams, Margaret Woodhouse, Jing Yu, Doreen Main

Database, Volume 2018, 1 January 2018, Pages 1–32
https://doi.org/10.1093/database/bay088

Abstract

The future of agricultural research depends on data. The sheer volume of agricultural biological data being produced today makes excellent data management essential. Governmental agencies, publishers and science funders require data management plans for publicly funded research. Furthermore, the value of data increases exponentially when they are properly stored, described, integrated and shared, so that they can be easily utilized in future analyses. AgBioData (https://www.agbiodata.org) is a consortium of people working at agricultural biological databases, data archives and knowledgbases who strive to identify common issues in database development, curation and management, with the goal of creating database products that are more Findable, Accessible, Interoperable and Reusable. We strive to promote authentic, detailed, accurate and explicit communication between all parties involved in scientific data. As a step toward this goal, we present the current state of biocuration, ontologies, metadata and persistence, database platforms, programmatic (machine) access to data, communication and sustainability with regard to data curation. Each section describes challenges and opportunities for these topics, along with recommendations and best practices.

Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators

Lindsay Barone, Jason Williams , David Micklos

PLoS Comput Biol 13(10): e1005755. October 19, 2017
https://doi.org/10.1371/journal.pcbi.1005755

Abstract

In a 2016 survey of 704 National Science Foundation (NSF) Biological Sciences Directorate principal investigators (BIO PIs), nearly 90% indicated they are currently or will soon be analyzing large data sets. BIO PIs considered a range of computational needs important to their work, including high performance computing (HPC), bioinformatics support, multistep workflows, updated analysis software, and the ability to store, share, and publish data. Previous studies in the United States and Canada emphasized infrastructure needs. However, BIO PIs said the most pressing unmet needs are training in data integration, data management, and scaling analyses for HPC—acknowledging that data science skills will be required to build a deeper understanding of life. This portends a growing data knowledge gap in biology and challenges institutions and funding agencies to redouble their support for computational training in biology.

Pre-print: Barriers to Integration of Bioinformatics into Undergraduate Life Sciences Education

Jason Williams, Jennifer Drew, Sebastian Galindo-Gonzalez, Srebrenka Robic, Elizabeth Dinsdale, William Morgan, Eric Triplett, James Burnette, Sam Donovan, Sarah Elgin, Edison Fowlks, Anya Goodman, Neal Grandgenett, Carlos Goller, Charles Hauser, John R. Jungck, Jeffrey Newman, William Pearson, Elizabeth Ryder, Melissa Wilson Sayres, Michael Sierk, Todd Smith, Rafael Tosado-Acevedo, William Tapprich, Tammy Tobin, Arlin Toro-Martínez, Lonnie Welch, Robin Wright, David Ebenbach, Mindy McWilliams, Anne Rosenwald, Mark Pauley

bioRxiv 204420; Posted October 19, 2017
https://doi.org/10.1101/204420

Abstract

Bioinformatics, a discipline that combines aspects of biology, statistics, and computer science, is increasingly important for biological research. However, bioinformatics instruction is rarely integrated into life sciences curricula at the undergraduate level. To understand why, the Network for Integrating Bioinformatics into Life Sciences Education (NIBLSE, “nibbles”) recently undertook an extensive survey of life sciences faculty in the United States. The survey responses to open-ended questions about barriers to integration were subjected to keyword analysis. The barrier most frequently reported by the ~1,260 respondents was lack of faculty training. Faculty at associate’s-granting institutions report the least training in bioinformatics and the least integration of bioinformatics into their teaching. Faculty from underrepresented minority groups (URMs) in STEM reported training barriers at a higher rate than others, although the number of URM respondents was small. Interestingly, the cohort of faculty with the most recently awarded PhD degrees reported the most training but were teaching bioinformatics at a lower rate than faculty who earned their degrees in previous decades. Other barriers reported included lack of student interest in bioinformatics; lack of student preparation in mathematics, statistics, and computer science; already overly full curricula; and limited access to resources, including hardware, software, and vetted teaching materials. The results of the survey, the largest to date on bioinformatics education, will guide efforts to further integrate bioinformatics instruction into undergraduate life sciences education.

A vision for collaborative training infrastructure for bioinformatics

Williams, Jason J. and Teal, Tracy K.

Annals of the New York Academy of Sciences. 07 September 2016
https://doi.org/10.1111/nyas.13207

Abstract

In biology, a missing link connecting data generation and data‐driven discovery is the training that prepares researchers to effectively manage and analyze data. National and international cyberinfrastructure along with evolving private sector resources place biologists and students within reach of the tools needed for data‐intensive biology, but training is still required to make effective use of them. In this concept paper, we review a number of opportunities and challenges that can inform the creation of a national bioinformatics training infrastructure capable of servicing the large number of emerging and existing life scientists. While college curricula are slower to adapt, grassroots startup‐spirited organizations, such as Software and Data Carpentry, have made impressive inroads in training on the best practices of software use, development, and data analysis. Given the transformative potential of biology and medicine as full‐fledged data sciences, more support is needed to organize, amplify, and assess these efforts and their impacts.

Lessons from a Science Education Portal

David Micklos, Susan Lauter, Amy Nisselle

Science 23 Dec 2011: Vol. 334, Issue 6063, pp. 1657-1658
https://doi.org/10.1126/science.1197074

SPORE, Science Prize for Online Resources in Education

When Cold Spring Harbor Laboratory's DNA Learning Center (DNALC) launched its Web site in 1996, www.dnalc.org, we did not foresee that it would grow into a portal for 18 content sites reaching more than seven million visitors per year. The evolution of our multimedia efforts and the challenges along the way provide lessons for building learning resources or to attract larger audiences....

Building Modern Internet Sites for Science Education: Insights from Science, Technology, and Education

John Connolly, Harouna Ba, Danielle Sixsmith, David Micklos
2006

This paper condenses insights gained during a three-day workshop of 30 experts and opinion leaders from diverse fields – including neuroscience, cognitive science, network theory, knowledge management, science education, and technology convergence. The quick insights are a useful laundry list for anyone creating a modern Internet site on science education, while the deeper insights give a sense of what is on the minds of people leading the effort to use the Internet to connect people in real-time communities of common interest.