WGCNA分析专栏3-表型关联及可视化与重要基因识别


经过前面两个专栏的讲解,我们从数据的预处理,到数据模块构建,洋洋洒洒拆解了大概1万字左右的教程,在这个过程中,我自己也学到了不少的东西,感谢好友们的盛情让我学习到很多东西!今天争取把这个教程后面剩下的内容结了。


1. 模块与表型值的关联

老样子,先导入我们前面构建好的数据

rm(list = ls())
# Display the current working directory
getwd();
# If necessary, change the path below to the directory where the data files are stored. 
# "." means current directory. On Windows use a forward slash / instead of the usual \.
workingDir = ".";
setwd(workingDir); 
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Load the expression and trait data saved in the first part
lnames = load(file = "../result/1.Deodunum-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
# Load network data saved in the second part.
lnames = load(file = "../result/6.QinchuanDeodunum-02-networkConstruction-auto.RData");
lnames

简单提一句,**.RData**是我们保存R数据的一种比较方便的方法,它不同于保存于csv、TXT、xlsx等格式文件,并且还可以同时保存多个数据,不懂可以翻看前两个专栏的教程,大概是下面这样的:

save(data1,.....,datn,file="filename.Rdata")

1.1 量化模块与特征关联

该部分,我们想确定与所测量的表型性状有显著关联的模块,从而确定这个模块内高度相似共表达的基因参与什么生物学问题,并在这个模块内确定处于主导作用的基因。 由于我们已经构建了每个模块信息(eigengene),我们只需将eigengene与外部性状相关联,并寻找最显著的关联。

# Define numbers of genes and samples
nGenes = ncol(datExpr); # 输出参与计算的基因数量
nSamples = nrow(datExpr); # 输出参与计算的样本数量
# Recalculate MEs with color labels
MEs0 = moduleEigengenes(datExpr, moduleColors)$eigengenes
MEs = orderMEs(MEs0)
moduleTraitCor = cor(MEs, datTraits, use = "p");
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples);

这里,截几个图,我就不用解释结果了:

image-20210718192050044

image-20210718192214720

module和样本之间的关联(基于样本基因表达)

image-20210718192234653

module和性状的关联

image-20210718192323897

module与性状相关性显著性检验。在这里,我们可以看出month与turquoise模块之间的r=0.97(p=0.001),其他结果类似。一会儿咱们可视化后会更加清晰,其他的就不仔细看了。

image-20210718192600330

话不多说,直接可视化:

sizeGrWindow(16,9)
# Will display correlations and their p-values
textMatrix =  paste(signif(moduleTraitCor, 2), "\n(",
                           signif(moduleTraitPvalue, 1), ")", sep = "");
# 将module与trait的相关性和显著性构建成一个字符串
dim(textMatrix) = dim(moduleTraitCor)
# Display the correlation values within a heatmap plot
tiff(filename = "../result/16.tiff",
     width = 8, 
     height = 8,
     units = 'in',
     res = 600,
     pointsize = 5)
par(mar = c(4, 15, 3, 1));
labeledHeatmap(Matrix = moduleTraitCor,
               xLabels = names(datTraits),
               yLabels = names(MEs),
               ySymbols = names(MEs),
               colorLabels = FALSE,
               colors = blueWhiteRed(50),
               textMatrix = textMatrix,
               setStdMargins = FALSE,
               cex.text = 0.7,
               zlim = c(-1,1),
               main = paste("Module-trait relationships"))
dev.off()
?par

可视化结果如下:

image-20210718193732048

1.2 基因与性状及重要模块间的关系:基因显著性和模块内基因关系

这句话翻译感觉有点鸟语,英语差的就看我的翻译,还行的就自己翻译去(Gene relationship to trait and important modules: Gene Significance and Module Membership)

这这里,我先给大家解释两个术语,基因显著性(Gene significance,GS)与成员之间相关性(module membership,MM)。

GS: associations of individual genes with our trait of interest (month) by defining Gene Significance GS as (the absolute value of) the correlation between the gene and the trait. 英语差的就看我的翻译,还行的就自己翻译去。即感兴趣性状与单个基因之间相关性的绝对值定义为GS,是否通俗明了?

MM: For each module, we also define a quantitative measure of module membership MM as the correlation of the module eigengene and the gene expression profile. 即基因表达与module主成分之间的相关性

说了这么多,不如上代码来得快:

# Define variable weight containing the weight column of datTrait
InterestedModule = datTraits$Month # 选择自己感兴趣的表型,这里我们选择了Month
InterestedModule = as.data.frame(InterestedModule);
names(InterestedModule) = "InterestedModule"
# names (colors) of the modules
modNames = substring(names(MEs), 3) #返回MEs第三个字符往后的字符串

geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"));
MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples));

names(geneModuleMembership) = paste("MM", modNames, sep="");
names(MMPvalue) = paste("p.MM", modNames, sep="");

geneTraitSignificance = as.data.frame(cor(datExpr, InterestedModule, use = "p"));
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples));

names(geneTraitSignificance) = paste("GS.", names(InterestedModule), sep="");
names(GSPvalue) = paste("p.GS.", names(InterestedModule), sep="");

通过上述代码,我们构建出包含MM的数据框,如下:

image-20210718200223275

1.3 模块内部进一步分析:依据高GS和MM挖掘重要基因

这里可以挖掘到一些重要的基因,具体后面说。这里我们看到这组数据里面,与Month呈正强相关的模块是tarquoise模块,那么我们直接对其进行可视化:

module = 'turquoise'
column = match(module, modNames);
moduleGenes = moduleColors==module;

sizeGrWindow(7, 7);
tiff(filename = "../result/Fig.4E(magenta).tiff",
     width = 6, 
     height = 6,
     units = 'in',
     res = 600)
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
                   abs(geneTraitSignificance[moduleGenes, 1]),
                   xlab = paste("Module Membership in", module, "module"),
                   ylab = "Gene significance for body weight",
                   main = paste("Module membership vs. gene significance\n"),
                   cex.main = 1.2, 
                   cex.lab = 1.2, 
                   cex.axis = 1.2, 
                   col = "magenta",
                   abline = TRUE,
                   lmFnc = lm)
garbage <- dev.off() # 这一行我记得当初总是报错,最后由dev.off()换成这样的代码就好了

结果如图:

image-20210718201101972

从图片我们可以看出,GS和MM是高度相关的,呈线性关系。

1.4 输出结果

还记得我们数据准备的时候做了一个ID的转换吗,现在用的着了,直接读取。

y=read.xlsx("../data/final_expdat.xlsx")
names(y)
annot = y[,c(1,8)]
dim(annot)
names(annot)

annot的数据如下:

image-20210718205207402

然后进行整合并输出结果:

geneInfo0 = data.frame(gene_id = probes,
                      geneSymbol = annot$SYMBOL[probes2annot],
                      LocusLinkID = annot$SYMBOL[probes2annot],
                      moduleColor = moduleColors,
                      geneTraitSignificance,
                      GSPvalue)
# Order modules by their significance for RFI
modOrder = order(-abs(cor(MEs, InterestedModule, use = "p")));
# Add module membership information in the chosen order
for (mod in 1:ncol(geneModuleMembership))
{
  oldNames = names(geneInfo0)
  geneInfo0 = data.frame(geneInfo0, geneModuleMembership[, modOrder[mod]], 
                         MMPvalue[, modOrder[mod]]);
  names(geneInfo0) = c(oldNames, paste("MM.", modNames[modOrder[mod]], sep=""),
                       paste("p.MM.", modNames[modOrder[mod]], sep=""))
}
# Order the genes in the geneInfo variable first by module color, then by geneTraitSignificance
geneOrder = order(geneInfo0$moduleColor, -abs(geneInfo0$GS.InterestedModule));
geneInfo = geneInfo0[geneOrder, ]

head(geneInfo)

image-20210718210143265

上面图片第5第6列是我们感兴趣的模块,在上面1.2节中有指定是month

然后我会自己习惯性的保存点其他东西:

write.xlsx(moduleTraitCor,file ="../result/17.moduleTraitCor.xlsx" )
write.xlsx(moduleTraitPvalue,file ="../result/17.moduleTraitPvalue.xlsx" )

2. 重要基因识别

直接上代码

library(openxlsx)
library(tidyverse)
rm(list = ls())
geneInfo=read.xlsx("result/17.geneInfo_Month.xlsx")
y=as.data.frame(table(geneInfo$moduleColor)) #统计每个模块中的基因数量
core_gene_pink=geneInfo %>% 
  filter(abs(geneInfo$GS.InterestedModule)>=0.4 & 
           abs(geneInfo$MM.red) >= 0.9 & moduleColor == "red")  %>% 
  select(1:6,MM.red,p.MM.red) #筛选关键基因

head(core_gene_pink)
write.xlsx(core_gene_pink,file = "result/27.core_gene(month).xlsx")

当然,还可以根据连接度来筛选关键基因,这里需要用到模块发构建拓扑结构时产生的Tom矩阵,具体怎么构建,见前述WGCNA分析专栏2-网络构建与模块识别代码如下:

lnames = load(file = "result/1.Deodunum-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
# Load network data saved in the second part.
lnames = load(file = "result/4.QinchuanDeodunumTOM-block.1.RData");
lnames

HubGenes <- as.data.frame(chooseTopHubInEachModule(datExpr,moduleColors))
colnames(HubGenes)[1]="Hubba"
HubGenes=data.frame(Module=as.character(rownames(HubGenes)),HubGenes=HubGenes$Hubba)
names(HubGenes)
annot=read.csv("../data/annot.csv")
names(annot)

HubGenes=merge(HubGenes,annot,by.x = 'HubGenes',by.y="ID")
write.xlsx(HubGenes,file = "../result/28.HubGenes_of_each_module.xlsx"

最后,画一张GS图

########终极版GS柱状图
rm(list = ls())
library(openxlsx)
library(tidyverse)
geneInfo = read.xlsx("result/17.geneInfo_SCC.xlsx")
##############
GeneSignificance=abs(geneInfo$GS.InterestedModule)
# Next module significance is defined as average gene significance.
tiff(filename = "result/GeneSignificance(SCC).tiff",
     width = 8, 
     height = 4,
     units = 'in',
     res = 600)
par(mfrow = c(1,1))
plotModuleSignificance(GeneSignificance,
                       geneInfo$moduleColor,
                       main = "Gene significance across modules in SCC\n"
                       )
dev.off()
?plotModuleSignificance

image-20210807092648123

至此,WGCNA的分析的工作完成,进一步的分析就仁者见仁了。


文章作者: chaoyuny
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 chaoyuny !
  目录