Results
NERRs eDNA
Metabarcoding analysis for NERRs sites sampled quarterly. Metadata for all sites here, including SWMP data when available. Downloaded from https://cdmo.baruch.sc.edu
OTU Abundance
Interactive barplots of relative abundance can be found here Non-interactive plots broken up by region can be found here
Core diversity metrics
PCoA is performed on distance matrices for the metrics below (seems to better handle missing data than PCA does). More complete descriptions here: https://docs.onecodex.com/en/articles/4150649-beta-diversity
- Alpha diversity
- Rarefaction
- Shannon’s diversity index (a quantitative measure of community richness)
- Faith’s Phylogenetic Diversity (a qualitative measure of community richness that incorporates phylogenetic relationships between the features)Kruskal-Wallis
- Observed OTUs (a qualitative measure of community richness) Kruskal-Wallis
- Evenness (or Pielou’s Evenness; a measure of community evenness)
- Beta diversity
- Jaccard distance (a qualitative measure of community dissimilarity. Qualitative - presence / absence - percentage of taxa not found in both samples) jaccard emperor
- Bray-Curtis distance (a quantitative measure of community dissimilarity. Takes into consideration abundance and presence absence) bray curtis emperor
- Unweighted UniFrac distance (a qualitative measure of community dissimilarity that incorporates phylogenetic relationships between the features. Percentage of phylogenetic branch length not found in both samples) unweighted unifrac emperor
- Weighted UniFrac distance (a quantitative measure of community dissimilarity that incorporates phylogenetic relationships between the features. Similar to Bray-Curtis but takes into consideration phylogenetic relationships) weighted unifrac emperor
Unifrac PCoA performed on Unweighted UniFrac distance matrix
Samples colored by minimum salinity from SWMP collected data within X days of eDNA sample collection. This is an interactive plot that can be found here
Classify samples from ASVs/OTUs (removed taxa with high 18s copy number)
Qiime Sample Classification predicts a categorical sample metadata column using a supervised learning classifier. Splits input data into training and test sets. The training set is used to train and test the estimator using a stratified k-fold cross- validation scheme. This includes optional steps for automated feature extraction and hyperparameter optimization. The test set validates classification accuracy of the optimized estimator. Outputs classification results for test set. For more details on the learning algorithm, see http://scikit-learn.org/stable/supervised_learning.html
We used the default ‘RandomForestRegressor’ estimator. This works by building a collection of decision trees, each trained on a different random subset of the data (bagging) and a random subset of features. The final prediction is the average of the predictions made by all the individual decision trees.
18s copy number in Bacillariophyta, Ciliophora, and Dinophyceae can be orders of magnitude higher than other taxa, resulting in extreme read counts for these taxa, which can swamp out the signals for other groups. One strategy to account for 18s copy number is to multiply read counts by a coefficent that was derived from the ratio between 18s copy number and biomass (as performed by Martin et al 2022, DOI 10.3897/mbmg.6.85794).
PCoA on sample dissimilarity matricies: bray curtis, jaccard
Phylogenetic PCoAs: unweighted unifrac, weighted unifrac
Heatmaps of the ASVs that predict category: Region, NERR, Site
Heatmaps of the OTUs that predict category: Region, NERR, Site
Heatmaps for each site: SF, GTM
Unfiltered heatmaps for each site: GTM ASVs, GTM OTUs Quarter
d
Heatmaps of the OTUs that predict continous data: Sal_Min-Heatmap, pH_Min-Heatmap
Accuracy Predictions: Sal_Min-accuracy, pH_Min-accuracy
Predicting including diatoms? OTUs Regions, ASVs Regions. ASVs Salinity
RPCA:
CTF:
GEMELLI:
Phylogenetic RPCA and CTF.
“In order to account for the correlation among samples from the same subject we will employ compositional tensor factorization (CTF). CTF builds on the ability to account for compositionality and sparsity using the robust center log-ratio transform … but restructures and factors the data as a tensor.”
“… robust principal-component analysis (RPCA) addresses sparsity and compositionality; compositional tensor factorization (CTF) addresses sparsity, compositionality, and repeated measure study designs; and UniFrac incorporates phylogenetic information. Here we introduce a strategy of incorporating phylogenetic information into RPCA and CTF. The resulting methods, phylo-RPCA, and phylo-CTF, provide substantial improvements over state-of-the-art methods in terms of discriminatory power of underlying clustering https://pmc.ncbi.nlm.nih.gov/articles/PMC9238373/
For a tutorial on CTF with Qiime’s Gemelli plugin, see here
The qurro interactive plots are to explore the log fold change abundance of the features loading on each axis of the PCoA. The features can be plotted for groups of samples (grouped by a meatadata column) or along a continous variable (eg: Salinity)
Explore the phylogenetic tree of ASVs alongside the rpca ordination: all-sites, SE, NE, N-Pacific, Pacific-Island
Explore the rpca feature loadings with qurro here:all-sites, SE, NE, N-Pacific, Pacific-Island
Explore the features loading in CTF with Qurro: NE, SE, N-Pacific, Pacific-Island, NO-island
Longitudinal Volatility
Interactive line plots assess how volatile a dependent variable (ASV or taxonomic group) is over a continuous, independent variable (e.g., time) in one or more groups. Select which ASV or taxa to plot on the y-axis to examine how variance in diversity and other metadata changes across time (set with the state-column parameter) in groups of samples and in individual subjects (set with the individual-id-column parameter)
Longitudinal volatility:ASVs, genus, family
Regional ASV Volatility SE, N-Pacific, PacIsland, NE
Regional Genus Volatility SE, N-Pacific, PacIsland, NE
State Subject Volitility Ordination
Identify features that are predictive of a numeric metadata column, state_column (e.g., time), and plot their relative frequencies across states using interactive feature volatility plots. A supervised learning regressor is used to identify important features and assess their ability to predict sample states. state_column will typically be a measure of time, but any numeric metadata column can be used.
With diatoms: SE, N-Pacific, PacIsland, NE
Accounts for the correlation among samples from the same subject (site within NERR). Points are instead sites. rf-state_subject_ordination
Here, points represent each site (subject) rather than each of the samples. (to look for groupings by salinity or other features) state-subject-ordination
beta-group-significance: Group samples by a metadata column to determine whether they are significantly different from one another using a permutation-based statistical test. At the national scale,
beta-permanovasalinity, region, NERR, Quarter, Site
Longitudinal pairwise distance
The pairwise-distances visualizer also assesses changes between paired samples from two different “states”, but instead of taking a metadata column or artifact as input, it operates on a distance matrix to assess the distance between “pre” and “post” sample pairs, and tests whether these paired differences are significantly different between different groups, as specified by the group-column parameter. (Qiime doc) For our data, this will test whether the effect of season differs between regions. We expect northern climates to have a greater seasonal effect. Each comparison was perfomed using the unweighted unifrac distance matrix
These plots appears to support greater distances among norther samples over the 1 and 3rd quarter, compared to southern samples over that same timeframe North_South-1_3
This comparison was perfomed using the gemelli ctf distance matrix (not sure if this is appropriate)
Mixed effects models
linear-mixed-effects-by-region
linear-mixed-effects-by-salinity
Regional Analysis:
New England
Results:
Bar-plot images of sites broken up by regions here
Network plots of sites broken up by regions here
Picocyanobacteria
Gemelli
For sparse compositional omics datasets. All these methods are unsupervised and aim to describe sample/subject variation and the biological features that separate them.
- Robust Aitchison PCA (RPCA)
- Compositional Tensor Factorization (CTF) The preprocessing transform for both RPCA and CTF is the robust centered log-ratio transform (rlcr) which accounts for sparse data (i.e. many missing/zero values).