Results

NERRs eDNA

Metabarcoding analysis for NERRs sites sampled quarterly. Metadata for all sites here, including SWMP data when available. Downloaded from https://cdmo.baruch.sc.edu

OTU Abundance

Interactive barplots of relative abundance can be found here Non-interactive plots broken up by region can be found here gulf

Core diversity metrics

PCoA is performed on distance matrices for the metrics below (seems to better handle missing data than PCA does). More complete descriptions here: https://docs.onecodex.com/en/articles/4150649-beta-diversity

Unifrac PCoA performed on Unweighted UniFrac distance matrix

unifrac Samples colored by minimum salinity from SWMP collected data within X days of eDNA sample collection. This is an interactive plot that can be found here

Classify samples from ASVs/OTUs (removed taxa with high 18s copy number)

Qiime Sample Classification predicts a categorical sample metadata column using a supervised learning classifier. Splits input data into training and test sets. The training set is used to train and test the estimator using a stratified k-fold cross- validation scheme. This includes optional steps for automated feature extraction and hyperparameter optimization. The test set validates classification accuracy of the optimized estimator. Outputs classification results for test set. For more details on the learning algorithm, see http://scikit-learn.org/stable/supervised_learning.html

We used the default ‘RandomForestRegressor’ estimator. This works by building a collection of decision trees, each trained on a different random subset of the data (bagging) and a random subset of features. The final prediction is the average of the predictions made by all the individual decision trees.

18s copy number in Bacillariophyta, Ciliophora, and Dinophyceae can be orders of magnitude higher than other taxa, resulting in extreme read counts for these taxa, which can swamp out the signals for other groups. One strategy to account for 18s copy number is to multiply read counts by a coefficent that was derived from the ratio between 18s copy number and biomass (as performed by Martin et al 2022, DOI 10.3897/mbmg.6.85794).

PCoA on sample dissimilarity matricies: bray curtis, jaccard

Phylogenetic PCoAs: unweighted unifrac, weighted unifrac

Heatmaps of the ASVs that predict category: Region, NERR, Site

Heatmaps of the OTUs that predict category: Region, NERR, Site

Heatmaps for each site: SF, GTM

Unfiltered heatmaps for each site: GTM ASVs, GTM OTUs Quarter

d

Heatmaps of the OTUs that predict continous data: Sal_Min-Heatmap, pH_Min-Heatmap

Accuracy Predictions: Sal_Min-accuracy, pH_Min-accuracy

Predicting including diatoms? OTUs Regions, ASVs Regions. ASVs Salinity

RPCA:

CTF:

GEMELLI:

Phylogenetic RPCA and CTF.

“In order to account for the correlation among samples from the same subject we will employ compositional tensor factorization (CTF). CTF builds on the ability to account for compositionality and sparsity using the robust center log-ratio transform … but restructures and factors the data as a tensor.”

“… robust principal-component analysis (RPCA) addresses sparsity and compositionality; compositional tensor factorization (CTF) addresses sparsity, compositionality, and repeated measure study designs; and UniFrac incorporates phylogenetic information. Here we introduce a strategy of incorporating phylogenetic information into RPCA and CTF. The resulting methods, phylo-RPCA, and phylo-CTF, provide substantial improvements over state-of-the-art methods in terms of discriminatory power of underlying clustering https://pmc.ncbi.nlm.nih.gov/articles/PMC9238373/

For a tutorial on CTF with Qiime’s Gemelli plugin, see here

The qurro interactive plots are to explore the log fold change abundance of the features loading on each axis of the PCoA. The features can be plotted for groups of samples (grouped by a meatadata column) or along a continous variable (eg: Salinity)

Explore the phylogenetic tree of ASVs alongside the rpca ordination: all-sites, SE, NE, N-Pacific, Pacific-Island

Explore the rpca feature loadings with qurro here:all-sites, SE, NE, N-Pacific, Pacific-Island

Explore the features loading in CTF with Qurro: NE, SE, N-Pacific, Pacific-Island, NO-island

Longitudinal Volatility

Interactive line plots assess how volatile a dependent variable (ASV or taxonomic group) is over a continuous, independent variable (e.g., time) in one or more groups. Select which ASV or taxa to plot on the y-axis to examine how variance in diversity and other metadata changes across time (set with the state-column parameter) in groups of samples and in individual subjects (set with the individual-id-column parameter)

Longitudinal volatility:ASVs, genus, family

Regional ASV Volatility SE, N-Pacific, PacIsland, NE

Regional Genus Volatility SE, N-Pacific, PacIsland, NE

State Subject Volitility Ordination

Identify features that are predictive of a numeric metadata column, state_column (e.g., time), and plot their relative frequencies across states using interactive feature volatility plots. A supervised learning regressor is used to identify important features and assess their ability to predict sample states. state_column will typically be a measure of time, but any numeric metadata column can be used.

With diatoms: SE, N-Pacific, PacIsland, NE

Accounts for the correlation among samples from the same subject (site within NERR). Points are instead sites. rf-state_subject_ordination

Here, points represent each site (subject) rather than each of the samples. (to look for groupings by salinity or other features) state-subject-ordination

beta-group-significance: Group samples by a metadata column to determine whether they are significantly different from one another using a permutation-based statistical test. At the national scale,

beta-permanovasalinity, region, NERR, Quarter, Site

Longitudinal pairwise distance

The pairwise-distances visualizer also assesses changes between paired samples from two different “states”, but instead of taking a metadata column or artifact as input, it operates on a distance matrix to assess the distance between “pre” and “post” sample pairs, and tests whether these paired differences are significantly different between different groups, as specified by the group-column parameter. (Qiime doc) For our data, this will test whether the effect of season differs between regions. We expect northern climates to have a greater seasonal effect. Each comparison was perfomed using the unweighted unifrac distance matrix

Region_1-3

Region_2-4

These plots appears to support greater distances among norther samples over the 1 and 3rd quarter, compared to southern samples over that same timeframe North_South-1_3

North_South-2_4

This comparison was perfomed using the gemelli ctf distance matrix (not sure if this is appropriate)

North_South-2_4-gemelli

Mixed effects models

linear-mixed-effects-by-region

linear-mixed-effects-by-salinity

phylo-salinity_significance

Regional Analysis:

New England

NE_subject_biplot

state_subject_ordination

accuracy_results

Results:

Bar-plot images of sites broken up by regions here

Network plots of sites broken up by regions here

Picocyanobacteria

Bar plot

Gemelli

For sparse compositional omics datasets. All these methods are unsupervised and aim to describe sample/subject variation and the biological features that separate them.