Details: |
Abstract
High-throughput DNA sequencing data has generated a deluge of data requiring novel statistical and computational approaches to appropriately account for technical noise and artifacts. One particular type of high-throughput DNA sequencing data, whole metagenomic shotgun (WMS) sequencing has provided unprecedented insight into microbial communities and the interactions between their members. While data generation is no longer a challenge, there remain statistical and computational challenges in analyzing the associated Big Data. \\
One of the major issues analyzing this data is the amount of missing data potentially due to technology. I first motivate the goals in analyzing this type of data on a leading cause of death in the developing world, diarrhea, and highlight our zero-inflated Gaussian and zero-inflated log-normal parameterizations to come to meaningful results. Lastly, we compare results of the analysis of a Chinese gut microbiome comparing individuals with and without Type II diabetes to several standard methods used in the field for marker-gene surveys, metagenomic, and RNA-seq data, demonstrating a zero-inflated log-normal model performs the best. |