Results

NERRs eDNA

Metabarcoding analysis for NERRs sites sampled quarterly. Metadata for all sites here, including SWMP data when available. Downloaded from https://cdmo.baruch.sc.edu

OTU Abundance

Interactive barplots of relative abundance can be found here Non-interactive plots broken up by region can be found here gulf

Core diversity metrics

PCoA is performed on distance matrices for the metrics below (seems to better handle missing data than PCA does). More complete descriptions here: https://docs.onecodex.com/en/articles/4150649-beta-diversity

Alpha diversity
- Rarefaction
- Shannon’s diversity index (a quantitative measure of community richness)
- Faith’s Phylogenetic Diversity (a qualitative measure of community richness that incorporates phylogenetic relationships between the features)Kruskal-Wallis
- Observed OTUs (a qualitative measure of community richness) Kruskal-Wallis
- Evenness (or Pielou’s Evenness; a measure of community evenness)
Beta diversity
- Jaccard distance (a qualitative measure of community dissimilarity. Qualitative - presence / absence - percentage of taxa not found in both samples) jaccard emperor
- Bray-Curtis distance (a quantitative measure of community dissimilarity. Takes into consideration abundance and presence absence) bray curtis emperor
- Unweighted UniFrac distance (a qualitative measure of community dissimilarity that incorporates phylogenetic relationships between the features. Percentage of phylogenetic branch length not found in both samples) unweighted unifrac emperor
- Weighted UniFrac distance (a quantitative measure of community dissimilarity that incorporates phylogenetic relationships between the features. Similar to Bray-Curtis but takes into consideration phylogenetic relationships) weighted unifrac emperor

Core diversity metrics for samples that have SWMP data

Beta diversity
- Jaccard distance (a qualitative measure of community dissimilarity. Qualitative - presence / absence - percentage of taxa not found in both samples) jaccard emperor
- Bray-Curtis distance (a quantitative measure of community dissimilarity. Takes into consideration abundance and presence absence) bray curtis emperor
- Unweighted UniFrac distance (a qualitative measure of community dissimilarity that incorporates phylogenetic relationships between the features. Percentage of phylogenetic branch length not found in both samples) unweighted unifrac emperor
- Weighted UniFrac distance (a quantitative measure of community dissimilarity that incorporates phylogenetic relationships between the features. Similar to Bray-Curtis but takes into consideration phylogenetic relationships) weighted unifrac emperor

CAP

Sal

Temp

Unifrac PCoA performed on Unweighted UniFrac distance matrix

$unifrac$ Samples colored by minimum salinity from SWMP collected data within X days of eDNA sample collection. This is an interactive plot that can be found here

PCoAs after removal of taxa with extreme 18s copy number

Dissimilarity: bray curtis, jaccard

Phylogenetic: unweighted unifrac, weighted unifrac

Classify samples from ASVs/OTUs

Qiime Sample Classification predicts a categorical sample metadata column using a supervised learning classifier. Splits input data into training and test sets. The training set is used to train and test the estimator using a stratified k-fold cross- validation scheme. This includes optional steps for automated feature extraction and hyperparameter optimization. The test set validates classification accuracy of the optimized estimator. Outputs classification results for test set. For more details on the learning algorithm, see http://scikit-learn.org/stable/supervised_learning.html

We used the default ‘RandomForestRegressor’ estimator. This works by building a collection of decision trees, each trained on a different random subset of the data (bagging) and a random subset of features. The final prediction is the average of the predictions made by all the individual decision trees.

18s copy number in Bacillariophyta, Ciliophora, and Dinophyceae can be orders of magnitude higher than other taxa, resulting in extreme read counts for these taxa, which can swamp out the signals for other groups. One strategy to account for 18s copy number is to multiply read counts by a coefficent that was derived from the ratio between 18s copy number and biomass (as performed by Martin et al 2022, DOI 10.3897/mbmg.6.85794).

notes: Will replace ASV ids with Tax name and ASV (ex: 1 of 4)

Heatmaps:

Sample classification with removal of taxa with extreme 18s copy number:

Among all reserves, which ASVs predict: Region, NERR, Site

The Site plot shows that Cryptomonadales_47of100 ASV is common among GTM,JC, and AB- but no other region (confirm on region heatmaps). This ASV is not common at MA. Also,g__Cryptomonas_17of92 is common of freshwater sites in all regions, but not common in marine/brackish clusters.

Among all reserves, which OTUs predict: Region, NERR, Site from Species, Site from Genus, Site from Order

OTUs that predict sites within a single NERR: SF, GTM

Sample classification without removal of taxa with extreme 18s copy number

Unfiltered OTUs that predict Quarter: GTM

Unfiltered ASVs that predict sites within a NERR: GTM

OTUs predicting continous data: Min Salinity, Min pH

Accuracy Predictions: Min Salinity, Min pH

Predicting including diatoms? OTUs Regions, ASVs Regions. ASVs Salinity

Phylogenetic RPCA and CTF

“In order to account for the correlation among samples from the same subject we will employ compositional tensor factorization (CTF). CTF builds on the ability to account for compositionality and sparsity using the robust center log-ratio transform … but restructures and factors the data as a tensor.”

“… robust principal-component analysis (RPCA) addresses sparsity and compositionality; compositional tensor factorization (CTF) addresses sparsity, compositionality, and repeated measure study designs; and UniFrac incorporates phylogenetic information. Here we introduce a strategy of incorporating phylogenetic information into RPCA and CTF. The resulting methods, phylo-RPCA, and phylo-CTF, provide substantial improvements over state-of-the-art methods in terms of discriminatory power of underlying clustering”

For a tutorial on CTF with Qiime’s Gemelli plugin, see here

Quick explinations of:

Gemelli:”Gemelli is a tool box for running Robust Aitchison PCA (RPCA), Joint Robust Aitchison PCA (Joint-RPCA), TEMPoral TEnsor Decomposition (TEMPTED), and Compositional Tensor Factorization (CTF) on sparse compositional omics datasets. All these methods are unsupervised and aim to describe sample/subject variation and the biological features that separate them.”

The preprocessing transform for both RPCA and CTF is the robust centered log-ratio transform (rlcr) which accounts for sparse data (i.e. many missing/zero values). RPCA and CTF then perform a matrix or tensor factorization on only the observed values after rclr transformation, similar to Aitchison PCA performed on dense data.

The qurro interactive plots are to explore the log fold change abundance of the features loading on each axis of the PCoA. The features can be plotted for groups of samples (grouped by a meatadata column) or along a continous variable (eg: Salinity)

Explore the phylogenetic tree of ASVs alongside the rpca ordination: all-sites, SE, NE, N-Pacific, Pacific-Island

Explore the rpca feature loadings with qurro here:all-sites, SE, NE, N-Pacific, Pacific-Island

Explore the features loading in CTF with Qurro: NE, SE, N-Pacific, Pacific-Island, NO-island

Longitudinal Volatility

Interactive line plots assess how volatile a dependent variable (ASV or taxonomic group) is over a continuous, independent variable (e.g., time) in one or more groups. Select which ASV or taxa to plot on the y-axis to examine how variance in diversity and other metadata changes across time (set with the state-column parameter) in groups of samples and in individual subjects (set with the individual-id-column parameter)

Longitudinal OTU volatility:ASVs, genus, family

Regional longitudinal ASV Volatility SE, N-Pacific, PacIsland, NE

Regional OTU (Genus) Volatility SE, N-Pacific, PacIsland, NE

State Subject Volitility Ordination

Identify features that are predictive of a numeric metadata column, state_column (e.g., time), and plot their relative frequencies across states using interactive feature volatility plots. A supervised learning regressor is used to identify important features and assess their ability to predict sample states. state_column will typically be a measure of time, but any numeric metadata column can be used. Gemelli ctf output into Qiime longatudinal volitility

With diatoms: SE, N-Pacific, PacIsland, NE, All NERRs

Beta-group-significance (unfiltered):

Group samples by a metadata column to determine whether they are significantly different from one another using a permutation-based statistical test. At the national scale,

Beta-permanova with predictor: salinity, region, NERR, Quarter, Site

Longitudinal pairwise distance

The pairwise-distances visualizer also assesses changes between paired samples from two different “states”, but instead of taking a metadata column or artifact as input, it operates on a distance matrix to assess the distance between “pre” and “post” sample pairs, and tests whether these paired differences are significantly different between different groups, as specified by the group-column parameter. (Qiime doc) For our data, this will test whether the effect of season differs between regions. We’d expect northern climates to have a greater seasonal effect. Each comparison was perfomed using the unweighted unifrac distance matrix

Pairwise Comparisons: Region_1-3, Region_2-4, North_South-1_3, North_South-2_4

These plots appears to support greater distances among norther samples over the 1 and 3rd quarter, compared to southern samples over that same timeframe.

Gemelli ctf distance matrix (not sure if this is appropriate):North_South-2_4-gemelli

linear mixed effects models by: [region](https://view.qiime2.org/visualization/?src=https://jthmiller.github.io/files/results/nerrs/all-sites/linear-mixed-effects-region.qzv, [salinity](https://view.qiime2.org/visualization/?src=https://jthmiller.github.io/files/results/nerrs/all-sites/linear-mixed-effects-salinity.qzv, phylo-salinity

Regional Analysis:

New England

Gemelli CTF on regions where a single point is a single site within a NERR over all 4 quarter. If FW and SW sites are more similar to one another among NERRs in the region, points would group by site salinity classification (FW or SW) rather than NERR.

State subject biplot by region: NE accuracy_results

Results:

Bar-plot images of sites broken up by regions here

Network plots of sites broken up by regions here

Picocyanobacteria

Bar plot

Jeffrey T. Miller, Ph.D.