基本上翻译自教程：https://choishingwan.github.io/PRS-Tutorial/plink/

背景

使用流行的遗传分析工具plink计算PRS-尽管plink不是专用的PRS软件，但是可以使用plink执行使用C + T标准方法计算PRS所需的每个步骤，并且可以执行此多步骤过程,是学习计算PRS的过程的好方法（通常由PRS软件自动执行）。

需要的数据

来自前面质控步骤的数据。见笔记1，2。

Update Effect Size 更新效应值？

当效应大小与疾病风险相关，因此以OR值而不是BETA（对于连续性状）给出时，则PRS计算为OR的乘积。为了简化此计算，我们通常采用OR的自然对数，以便可以使用简单的求和来计算PRS（此后可以进行反变换）。我们可以使用R获得转换后的摘要统计量：

#于以awk执行的值四舍五入，可能会获得不太准确的结果。 因此，我们建议在R中执行转换，或者允许PRS软件直接执行转换。
install.packages('R.utils') #运行报错，需要安装这个包。
library(data.table)
dat <- fread("Height.QC.gz")
fwrite(dat[,OR:=log(OR)], "Height.QC.Transformed", sep="\t")

Clumping 聚集

连锁不平衡对应于整个基因组的遗传变异的基因型之间的相关性，这使得鉴定因果独立遗传变异的贡献变得极具挑战性。近似捕获正确水平的因果信号的一种方法是执行成簇，仅保留弱相关的SNP，但优先保留与研究表型最相关的SNP。可以在plink中使用以下命令执行聚集：

plink \
    --bfile EUR.QC \
    --clump-p1 1 \ #SNP的P值阈值将被包含为索引SNP。 选择1以使所有SNP都包括在内
    --clump-r2 0.1 \ #去除r2>0.1的
    --clump-kb 250 \ #250k内的SNP被视为聚集
    --clump Height.QC.transformed \ #包含P值信息的基础数据（摘要统计）文件
    --clump-snp-field SNP \  #指定列SNP包含SNP ID
    --clump-field P \ #指定列P包含P值信息
    --out EUR

--clump计算的r2值基于最大似然单倍型频率估计

这将生成EUR.clumped，其中包含执行聚集后的索引SNP。我们可以通过执行以下命令来提取索引SNP ID：
awk 'NR!=1{print $3}' EUR.clumped > EUR.valid.snp #$ 3，因为第三列包含SNP ID
如果目标数据很小（例如N <500），则可以使用1000个基因组计划样本进行LD计算。确保使用最能反映基本样本的总体。

Generate PRS

plink提供了方便的功能--score和--q-score-range用于计算多基因得分。

我们需要的3个文件：
- 1. 基础数据 Height.QC.Transformed
- 2.包含SNP ID及其对应的P值的文件（$ 1，因为SNP ID位于第一列； $ 8，因为P值位于第八列）
awk '{print $1,$8}' Height.QC.Transformed > SNP.pvalue
- 3.包含用于在PRS中包含SNP的不同P值阈值的文件。出于说明目的，此处计算与一些阈值相对应的PRS：

#echo 阈值名称 下限 上限 
#阈值边界是包括在内的。 例如，对于0.05阈值，我们包括P值从0到0.05的所有SNP，包括P值等于0.05的任何SNP。
echo "0.001 0 0.001" > range_list
echo "0.05 0 0.05" >> range_list
echo "0.1 0 0.1" >> range_list
echo "0.2 0 0.2" >> range_list
echo "0.3 0 0.3" >> range_list
echo "0.4 0 0.4" >> range_list
echo "0.5 0 0.5" >> range_list
#计算命令：
plink \
    --bfile EUR.QC \
    --score Height.QC.Transformed 1 4 11 header \
    --q-score-range range_list SNP.pvalue \
    --extract EUR.valid.snp \
    --out EUR

遇到报错，Error: Line 606351 of --score file has fewer tokens than expected.大概前面哪个步骤忘记做了导致的，我手动vi删除几十次解决的。。。

将会产生7个文件：
- EUR.0.5.profile
- EUR.0.4.profile
- EUR.0.3.profile
- EUR.0.2.profile
- EUR.0.1.profile
- EUR.0.05.profile
- EUR.0.001.profile

PRS计算公式：

Si代表SNP；Gi,j代表j样本中样品中观察到的等位基因数目；P是样本为几倍体，人一般为2；N为PRS中包含的样本数量；Mj样本j中观察到的非缺失SNP的数量;如果样品缺少SNP i的基因型，则人群次要等位基因频率乘以倍性代替Gi,j。

人口分层统计

人口结构是GWAS中混淆的主要来源，通常通过将主要成分（PC）作为协变量纳入来解决。我们可以将PC纳入我们的PRS分析中，以解决人口分层问题。

# First, we need to perform prunning
plink \
    --bfile EUR.QC \
    --indep-pairwise 200 50 0.25 \
    --out EUR
# Then we calculate the first 6 PCs
plink \
    --bfile EUR.QC \
    --extract EUR.prune.in \
    --pca 6 \
    --out EUR

选择适当数量的PC的一种方法是，使用不同数量的PC对所研究的表型执行GWAS。然后，可以对GWAS摘要统计数据集执行LDSC分析，并且使用了使LDSC截距最接近1的PC数量的GWAS应该对应于对其种群结构进行了最精确控制的PCAS。

这里的PC已存储在EUR.eigenvec文件中，可用作回归模型中的协变量以说明总体分层。

如果基础样品和目标样品是从不同的全球人群中收集的，则PRS分析的结果可能会产生偏差（请参见本文的第3.4节）。

Finding the "best-fit" PRS

p.threshold <- c(0.001,0.05,0.1,0.2,0.3,0.4,0.5)
# Read in the phenotype file 
phenotype <- read.table("EUR.height", header=T)
# Read in the PCs
pcs <- read.table("EUR.eigenvec", header=F)
# The default output from plink does not include a header
# To make things simple, we will add the appropriate headers
# (1:6 because there are 6 PCs)
colnames(pcs) <- c("FID", "IID", paste0("PC",1:6)) 
# Read in the covariates (here, it is sex)
covariate <- read.table("EUR.covariate", header=T)
# Now merge the files
pheno <- merge(merge(phenotype, covariate, by=c("FID", "IID")), pcs, by=c("FID","IID"))
# We can then calculate the null model (model with PRS) using a linear regression 
# (as height is quantitative)
null.model <- lm(Height~., data=pheno[,!colnames(pheno)%in%c("FID","IID")])
# And the R2 of the null model is 
null.r2 <- summary(null.model)$r.squared
prs.result <- NULL
for(i in p.threshold){
    # Go through each p-value threshold
    prs <- read.table(paste0("EUR.",i,".profile"), header=T)
    # Merge the prs with the phenotype matrix
    # We only want the FID, IID and PRS from the PRS file, therefore we only select the 
    # relevant columns
    pheno.prs <- merge(pheno, prs[,c("FID","IID", "SCORE")], by=c("FID", "IID"))
    # Now perform a linear regression on Height with PRS and the covariates
    # ignoring the FID and IID from our model
    model <- lm(Height~., data=pheno.prs[,!colnames(pheno.prs)%in%c("FID","IID")])
    # model R2 is obtained as 
    model.r2 <- summary(model)$r.squared
    # R2 of PRS is simply calculated as the model R2 minus the null R2
    prs.r2 <- model.r2-null.r2
    # We can also obtain the coeffcient and p-value of association of PRS as follow
    prs.coef <- summary(model)$coeff["SCORE",]
    prs.beta <- as.numeric(prs.coef[1])
    prs.se <- as.numeric(prs.coef[2])
    prs.p <- as.numeric(prs.coef[4])
    # We can then store the results
    prs.result <- rbind(prs.result, data.frame(Threshold=i, R2=prs.r2, P=prs.p, BETA=prs.beta,SE=prs.se))
}
# Best result is:
prs.result[which.max(prs.result$R2),]

结果：

Threshold	R2	P	BETA	SE
0.2	0.04112693	2.696232e-09	44076.31	7270.299

一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

文章

[翻译]PRS多基因评分学习笔记3-plink

背景

需要的数据

Update Effect Size 更新效应值？

Clumping 聚集

Generate PRS

人口分层统计

Finding the "best-fit" PRS

关于 “[翻译]PRS多基因评分学习笔记3-plink” 的 2 个意见

发表回复取消回复

背景

需要的数据

Update Effect Size 更新效应值？

Clumping 聚集

Generate PRS

人口分层统计

Finding the "best-fit" PRS

关于 “[翻译]PRS多基因评分学习笔记3-plink” 的 2 个意见

发表回复 取消回复

发表回复取消回复