Abstract
Colorectal cancer is the third leading cause of cancer-related deaths in men and in women, and the second most common cause of cancer deaths when men and women are combined ACS (2012). There are an estimated 104,610 new cases of colon cancer and 43,340 new cases of rectal cancer. The lifetime risk of developing colorectal is 4.4% and 4.1% for men and for women, respectively, and it suggests a slightly higher risk for men ACS (2012). This project aims to explore the effects of colorectal cancer on the microbiome of the gastrointestinal (GI) tract by testing for the differences in microbial sequences in carcinoma and healthy tissues. PathSeq, a non-host RNA analyzing computation tool, was used to identify microbial differences in cancerous and healthy samples. When samples from both cancerous and healthy tissue were run through PathSeq, it output lists of bacteria that were found. These outputs were compared with the aim to find significant differences in levels of species of bacteria. We found that both Sphingomonas echinoides and the genus Propionibacterium have considerable differences between cancer and healthy samples.The digestive system is made up of the gastrointestinal (GI) tract, liver, pancreas and gallbladder. The GI tract is a series of hollow organs joined in a long, twisting tube, and it aids with the intake and digestion of food to extract and absorb energy and nutrients and expel the remaining waste. The human body contains trillions of microbes, with some 4000 different strains of bacteria having diverse roles in maintenance of immune health metabolism Malla et al. (2019). Studies show colorectal cancer disrupts the microbiome within the GI tract. For example, Castellarin et al. found that there was an abundance of Fusobacterium nucleatum, an invasive, adherent, and pro-inflammatory anaerobic bacterium, in patients with colorectal cancer compared to others Castellarin et al. (2012). This leads us to wonder if there are other changes in the microbiome environment of colorectal carcinoma tissue. Based on the results from previous papers, we hypothesize that colorectal cancer heavily influences the microbiome environment within the GI tract, and the Pathseq analysis pipeline is used as the tool to test our hypothesis.
To analyze the potential change in the microbiome due to colorectal cancer, we used data obtained from a study on the change in Fusobacterium nucleatum due to colorectal cancer Castellarin et al. (2012). This study isolated RNA from carcinoma and adjacent healthy tissue and then sequenced the RNA. When the RNA is isolated from the colon, the samples contain the host RNA but also the RNA from actively transcribing microorganisms on the tissue. Detection of the microorganisms could not be achieved if DNA samples were used. From this, we have potentially 13 patients to conduct analysis on, each with a carcinoma and healthy sample. These samples are 75 bp paired end sequences.
Instead of using the same methods as in the study, we used PathSeq to obtain our results. PathSeq allows for RNA analysis of the non-host portion of the dataset. Through sequential subtractive steps, PathSeq removes all alignments to a human reference sequence; therefore, only the pathogens are left present in the dataset. Additionally, PathSeq has a present set of sequencing data from bacteria. In its analysis, PathSeq compared the previous unmapped reads to the bacteria genomes. Finally, PathSeq outputs data that shows which bacteria genomes were matched from the sequences inputted.
Human reference, microbe reference, and taxonomy files are required to run the PathSeq pipeline. PathSeq provides an updated version of these files, which were used for this report. For the human RNA sequence files, FASTQ files were obtained from the National Center for Biotechnology Information Sequence Read Archive (NCBI SRA) website and converted to BAM file format for further use in the Pathseq pipeline. Slurm files were created using the template below to obtain PathSeq output for each patient’s carcinoma and healthy samples.
Figure 1: Example of Pathseq Pipeline Configuration
Each microbiome is specific to each human and PathSeq does not output information for species that have no mapped reads. Therefore, the PathSeq from each patient, and cancerous versus healthy tissue within the same patient, outputs a different list of bacteria found. Rows of data with the species name and zero values for mapped reads were added to each output to account for the different lists. This gave us output with the same length and list of bacteria to conduct analysis.
# Loading in packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(reshape2)
library(pracma)
# Loading in the Pathseq results
p1_healthy <- read.delim("86.pathseq.txt")
p1_carcinoma <- read.delim("98.pathseq.txt")
p2_healthy <- read.delim("87.pathseq.txt")
p2_carcinoma <- read.delim("99.pathseq.txt")
p3_healthy <- read.delim("88.pathseq.txt")
p3_carcinoma <- read.delim("100.pathseq.txt")
p4_healthy <- read.delim("89.pathseq.txt")
p4_carcinoma <- read.delim("101.pathseq.txt")
p5_healthy <- read.delim("90.pathseq.txt")
p5_carcinoma <- read.delim("102.pathseq.txt")
p6_healthy <- read.delim("91.pathseq.txt")
p6_carcinoma <- read.delim("103.pathseq.txt")
# Filtering data for specific columns
p_list <- list(p1_healthy,p1_carcinoma,p2_healthy,p2_carcinoma,p3_healthy,p3_carcinoma,p4_healthy,p4_carcinoma,p5_healthy,p5_carcinoma,p6_healthy,p6_carcinoma)
for(i in 1:length(p_list)){
p_list[[i]] <- p_list[[i]] %>% filter(type == "species") %>% select(1,4,5,8)
}
# Adding zero values to all samples
for(i in 1:length(p_list)){
for(j in 1:length(p_list)){
for(k in 1:nrow(p_list[[j]])){
if (!(p_list[[j]][k,1] %in% (p_list[[i]][,1]))){
val <- p_list[[j]][k,]
val[,4] <- 0
p_list[[i]] <- rbind(p_list[[i]],val)
} else{
}
}
}
}
# Sort by Tax_ID
for(i in 1:length(p_list)){
p_list[[i]]<-p_list[[i]][order(p_list[[i]][,1]),]
}
# Creating Dataframes
p1 <- cbind(p_list[[1]],p_list[[2]][,4])
p2 <- cbind(p_list[[3]],p_list[[4]][,4])
p3 <- cbind(p_list[[5]],p_list[[6]][,4])
p4 <- cbind(p_list[[7]],p_list[[8]][,4])
p5 <- cbind(p_list[[9]],p_list[[10]][,4])
p6 <- cbind(p_list[[11]],p_list[[12]][,4])
colnames(p1) <- c("tax_id","Name","Kingdom","Healthy_Reads","Cancerous_Reads")
colnames(p2) <- c("tax_id","Name","Kingdom","Healthy_Reads","Cancerous_Reads")
colnames(p3) <- c("tax_id","Name","Kingdom","Healthy_Reads","Cancerous_Reads")
colnames(p4) <- c("tax_id","Name","Kingdom","Healthy_Reads","Cancerous_Reads")
colnames(p5) <- c("tax_id","Name","Kingdom","Healthy_Reads","Cancerous_Reads")
colnames(p6) <- c("tax_id","Name","Kingdom","Healthy_Reads","Cancerous_Reads")
The data frames were then sorted to form data frames containing only the taxonomy ID, Kingdom, Number of Healthy Reads, and Number of Cancerous Reads.
The PathSeq pipeline run within the GATK module is utilized for illustrating the microbiome environment when provided with sequencing samples of a host organism. The trial run of the pipeline included completing The Broad Institute’s PathSeq tutorial, which assessed the E Coli K12 content of a human 3M paired-end 151-bp sequencing sample. The results from the tutorial provided 18950 reads that mapped specifically to the strain reference.
The PathSeq template, outlined in the methods section, took nearly a full day to complete. Due to this unanticipated run time, only six patient samples, both cancerous and healthy, were run. The pipeline is set to output two separate files. The first of the two files is a .bam file, which provides all the identified non-host reads that aligned up with the microbiome reference file. Each aligned read is tagged and matched to a taxonomy ID. The other file provided in the output of the PathSeq pipeline is a .txt file, a table providing a summary of the sample’s microbial composition. There are several columns of features, including information on the number of matching reads, as well as a “score”, which indicates the amount of evidence that the taxon is present in the sequence sample.
The matching reads for each bacteria was normalized by finding the percentage of total reads in that given sample. For each patient, the difference in percentage of reads between the cancerous sample and healthy sample was found for each species. The mean difference, standard deviation, and absolute difference in percentage of reads for each species were calculated across patients. In Table 1, we can see these three values for the bacteria that showed above a 0.12% difference between cancerous and healthy tissue.
# Calculating Difference, Percent Difference and Normalizing the data
# P1
p1$Healthy_Reads_Percent <- 100 * p1$Healthy_Reads/sum(p1$Healthy_Reads)
p1$Cancerous_Reads_Percent <- 100 * p1$Cancerous_Reads/sum(p1$Cancerous_Reads)
p1$Difference_between_Percents <- (p1$Cancerous_Reads_Percent - p1$Healthy_Reads_Percent)
p1$Abs_Difference_between_Percents <- abs(p1$Difference_between_Percents)
# P2
p2$Healthy_Reads_Percent <- 100 * p2$Healthy_Reads/sum(p2$Healthy_Reads)
p2$Cancerous_Reads_Percent <- 100 * p2$Cancerous_Reads/sum(p2$Cancerous_Reads)
p2$Difference_between_Percents <- (p2$Cancerous_Reads_Percent - p2$Healthy_Reads_Percent)
p2$Abs_Difference_between_Percents <- abs(p2$Difference_between_Percents)
# P3
p3$Healthy_Reads_Percent <- 100 * p3$Healthy_Reads/sum(p3$Healthy_Reads)
p3$Cancerous_Reads_Percent <- 100 * p3$Cancerous_Reads/sum(p3$Cancerous_Reads)
p3$Difference_between_Percents <- (p3$Cancerous_Reads_Percent - p3$Healthy_Reads_Percent)
p3$Abs_Difference_between_Percents <- abs(p3$Difference_between_Percents)
# P4
p4$Healthy_Reads_Percent <- 100 * p4$Healthy_Reads/sum(p4$Healthy_Reads)
p4$Cancerous_Reads_Percent <- 100 * p4$Cancerous_Reads/sum(p4$Cancerous_Reads)
p4$Difference_between_Percents <- (p4$Cancerous_Reads_Percent - p4$Healthy_Reads_Percent)
p4$Abs_Difference_between_Percents <- abs(p4$Difference_between_Percents)
# P5
p5$Healthy_Reads_Percent <- 100 * p5$Healthy_Reads/sum(p5$Healthy_Reads)
p5$Cancerous_Reads_Percent <- 100 * p5$Cancerous_Reads/sum(p5$Cancerous_Reads)
p5$Difference_between_Percents <- (p5$Cancerous_Reads_Percent - p5$Healthy_Reads_Percent)
p5$Abs_Difference_between_Percents <- abs(p5$Difference_between_Percents)
# P6
p6$Healthy_Reads_Percent <- 100 * p6$Healthy_Reads/sum(p6$Healthy_Reads)
p6$Cancerous_Reads_Percent <- 100 * p6$Cancerous_Reads/sum(p6$Cancerous_Reads)
p6$Difference_between_Percents <- (p6$Cancerous_Reads_Percent - p6$Healthy_Reads_Percent)
p6$Abs_Difference_between_Percents <- abs(p6$Difference_between_Percents)
# Calculating mean and standard deviation of difference between percents
results <- p_list[[1]][,1:2]
for(i in 1:nrow(p1)){
results[i,3] <- mean(p1[i,8],p2[i,8],p3[i,8],p4[i,8],p5[i,8],p6[i,8])
results[i,4] <- sd(c(p1[i,8],p2[i,8],p3[i,8],p4[i,8],p5[i,8],p6[i,8]))
}
colnames(results) <- c("tax_id","Name","mean_DBP","sd_DBP")
results$abs_DBP <- abs(results$mean_DBP)
for(i in 1:length(results)){
results<-results[order(-results$abs_DBP),]
}
results_plot <- results %>% filter(abs_DBP > 0.12) %>% select(2)
head(results,18)
This shows that Sphingomonas echinoides had about 1.21 ± 0.74 absolute change in percentage of reads due to cancer. This absolute difference is significantly higher than any other change in a singular species and the sole bacteria above 1%. In this table, the only other species with a negative change in the mean difference in percentage of reads was also from the Sphingomonas genus. The negative change is because the percentage of reads for Sphingomonas was less in the cancerous samples compared to the healthy. The decrease in read percentage from healthy tissue to carcinoma tissue can be seen in the R code below.
# Setting up data for plotting
for (i in 1:nrow(results_plot)){
results_plot[i,2] <-p1[which(p1$Name == results_plot[i,1]),6]
results_plot[i,3] <-p1[which(p1$Name == results_plot[i,1]),7]
results_plot[i,4] <-p2[which(p2$Name == results_plot[i,1]),6]
results_plot[i,5] <-p2[which(p2$Name == results_plot[i,1]),7]
results_plot[i,6] <-p3[which(p3$Name == results_plot[i,1]),6]
results_plot[i,7] <-p3[which(p3$Name == results_plot[i,1]),7]
results_plot[i,8] <-p4[which(p4$Name == results_plot[i,1]),6]
results_plot[i,9] <-p4[which(p4$Name == results_plot[i,1]),7]
results_plot[i,10] <-p5[which(p5$Name == results_plot[i,1]),6]
results_plot[i,11] <-p5[which(p5$Name == results_plot[i,1]),7]
results_plot[i,12] <-p6[which(p6$Name == results_plot[i,1]),6]
results_plot[i,13] <-p6[which(p6$Name == results_plot[i,1]),7]
}
colnames(results_plot) <- c("Name","p1_Healthy","p1_Cancerous","p2_Healthy","p2_Cancerous","p3_Healthy","p3_Cancerous","p4_Healthy","p4_Cancerous","p5_Healthy","p5_Cancerous","p6_Healthy","p6_Cancerous")
results_plot_melt <- melt(results_plot)
## Using Name as id variables
results_plot
# Plotting the data
library(RColorBrewer)
nb.cols <- 18
mycolors <- colorRampPalette(brewer.pal(8, "Spectral"))(nb.cols)
colnames(results_plot_melt) <- c("Name","Samples","Percentages_of_Reads")
ggplot(results_plot_melt,aes(x = Samples,y = Percentages_of_Reads,fill=Name)) + geom_bar(stat="identity", show.legend = NA) + theme(legend.key.size = unit(0.5,"line"), axis.text.x = element_text(angle=90)) + scale_fill_manual(values = mycolors)
This plot was developed using the read percentages of the bacteria from Table 1 for each patient’s cancerous and healthy samples. The other bacteria shown in Figure 1 all had positive mean differences, showing increased abundance in carcinoma tissue. This set of bacteria mostly belonged to the genus Propionibacterium, all between 0.16 and 0.20%.
In the paper from which we obtained this data, Castellarin et al. concluded that Fusobacterium nucleatum is a highly abundant bacterium in patients with gastrointestinal cancer but not in the healthy tissue of the same patient. Though our analysis found that Fusobacterium nucleatum was more actively present in the patient’s cancerous tissue than their healthy tissue, it only increased by an average less than 0.01%. This is considerably less than the data pertaining to the Sphingomonas and Propionibacterium genuses, for example.
The expected timeline to start analysis was delayed due to the semester being interrupted with the global COVID-19 pandemic. After performing the initial tutorial protocol on one patient with both the cancerous and normal tumor tissue, all thirteen patients, with both healthy and cancerous sequencing samples, were to be tested against the protocol depending on the time constraints of the protocol itself and the remainder of the semester. Given the results of this initial tutorial, we expected to analyze the non-host portion of the sequenced data for bacterial expression like Fusobacterium, utilizing a microbial reference file to label any recognized non-host portions of the sequences. We expected to see elevated levels of expression for other bacteria and explore the impacts of these bacteria within the GI tract. However, running the PathSeq pipeline is incredibly resource intensive, which was greatly exposed by Rivanna, the university’s high performance computing facility. Attempts at repeating the same job requests as the initial pipeline success led to multiple occurrences of ‘timed-out’ jobs. Given the high amounts of memory allocations being requested for these jobs, the queuing time was also lengthy and became an unexpected impedance on productivity. Attempts were made to introduce multithreaded processing in order to improve the computational efficiency, and in the end, it proved largely successful. At the time of reporting, the pipeline and analysis was able to be completed for six subjects.
The PathSeq pipeline provides the results according to the microbe reference list, indicating the number of corresponding mapped reads found on the sample sequence. These mapped reads, when normalized for the size of the sequence strand, can be compared between healthy and cancer stands within patients. Utilizing RStudio, these calculations were performed and mean differences in microbe abundance were found. As mentioned, the gram-negative soil bacterium Sphingomonas echinoides provided the greatest difference between healthy and cancerous samples, with a substantial decrease in relative abundance when the pipeline is run on cancerous samples.
A limited amount of published research and literature on Sphingomonas echinoides makes for limited ability to assume or determine the reasoning as to why this specific bacteria shows high levels of differentiation between sequencing samples from a healthy GI tract and a cancerous GI tract. Elementary level educated assumptions lead the group to believe that the bacteria’s ideal environment for proliferation is found in a healthy GI tract, whereas environmental conditions in a cancerous GI tract make for the opposite. However, the results from this research do not indicate any reasoning, and it must be established that this is only a hypothesis.
As determined through the research, the microbiome environment in the GI tract is influenced by the presence of colorectal cancer. Knowledge of these differences, such as with Sphingomonas echinoides and the genus Propionibacterium, can provide valuable insights for clinicians in predicting or detecting cancer growth. Given the limited sample size presented in this study, which was decreased further given the current events and resource limitations, it is suggested that this exploration continue to further validate these results. Regardless, these results provide statistical significance, or at the least, valuable insights, for researchers and clinicians to engage with. Further, a more detailed understanding of the physiological properties of Sphingomonas echinoides could lead to valuable insights on the functions, implications, and impact of colorectal cancer in the gastrointestinal tract.
Castellarin, Mauro, René L. Warren, J. Douglas Freeman, Lisa Dreolini, Martin Krzywinski, Jaclyn Strauss, Rebecca Barnes, et al. 2012. “Fusobacterium Nucleatum Infection Is Prevalent in Human Colorectal Carcinoma.” Genome Research 22 (2): 299–306. https://doi.org/10.1101/gr.126516.111.
Malla, Muneer Ahmad, Anamika Dubey, Ashwani Kumar, Shweta Yadav, Abeer Hashem, and Elsayed Fathi Abd_Allah. 2019. “Exploring the Human Microbiome: The Potential Future Role of Next-Generation Sequencing in Disease Diagnosis and Treatment.” Frontiers in Immunology 9. https://doi.org/10.3389/fimmu.2018.02868.