--- title: "Package Ravages (RAre Variant Analysis and GEnetic Simulation), Simulations" author: "Herve Perdry and Ozvan Bocher" date: "`r Sys.Date()`" output: pdf_document vignette: > %\VignetteIndexEntry{"Ravages_Simulations"} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r message=FALSE, warning=FALSE, echo = FALSE} library("knitr") require("Ravages") ``` ## Introduction Ravages was developed to simulate genetic data and to perform rare variant association tests (burden tests and the variance-component test SKAT) on different types of phenotypes (Bocher et al., 2019, doi:10.1002/gepi.22210, Bocher et al., 2020, doi:10.1038/s41431-020-00792-8) at a genome-wide scale. Ravages relies on the package Gaston. Most functions are written in C++ thanks to the packages Rcpp, RcppParallel and RcppEigen. Functions of Ravages use bed.matrix to manipulate genetic data as in the package Gaston (see documentation of this package for more details). In this vignette, we illustrate how to perform genetic simulations and how to compute power of rare variant association tests. See the main vignette for more details about rare variant association tests. To learn more about all options of the functions, the reader is advised to look at the manual pages. We developed two main simulations procedures, one based on allelic frequencies and genetic relative risks (GRR), and the other one based on haplotypes. All the functions can be used to simulate more than two groups of individuals. ### Global parameters of Ravages Functions in Ravages are parallelised either using RcppParallel when functions are implemented in C++ or using parallel otherwise. When the argument *cores* is present in the function, it can be directly changed to fix the number of cores to use. Otherwise, the number of threads used by multithreaded functions can be modified through RcppParallel function **setThreadOptions()**. It is advised to try several values for the number of threads, as using too many threads might be counterproductive due to an important overhead. The default value set by RcppParallel is generally too high. Information on progression is provided if *verbose = T* for multiple functions in Ravages. ## Simulations based on allelic frequencies and GRR The main function to use to simulate data based on this method is **rbm.GRR()** which is detailed below. ### Calculation of frequencies in each group of individuals The first step to simulate genetic data is to compute genotypic frequencies in each group of individuals based on frequencies in the general population and on genetic relative risk (GRR) values. GRR correspond to the increased risk of a disease for a given genotype compared to a reference genotype, here the homozygous genotype for the reference allele. More precisely, the GRR associated to the heterozygous genotype in the group $c$ can be calculated as follow: $$GRR_{Aa} = \frac{P(Y=c | Aa)}{P(Y=c | AA)}$$ With Y the phenotype (1 for the controls, and c going from 2 to C for the subgroups of cases), $A$ the reference allele, and $a$ the alternate allele. The frequency of each genotype in each group of cases $c$ can be calculated using Bayes theorem: $$P(Aa | Y=c) = \frac{P(Y=c | Aa) P(Aa)}{\sum_{Geno=AA, Aa, aa} P(Y=c | Geno) P(Geno)} = \frac{GRR_{Aa} \times P(Aa)}{P(AA)+GRR_{Aa} \times P(Aa)+GRR_{aa} \times P(aa)}$$ P(AA), P(Aa) and P(aa) corresponding to the genotypic probabilities in the general population. The three genotypic frequencies can then be calculated in the controls group using the rule of total probability: $$P(Geno | Y=1)=P(Geno) - \sum{_{c=2}^{c=C} P(Geno | Y=c) P(Y=c)}$$ The function **genotypic.freq()** performs these calculations to obtain the three genotypic frequencies in the different groups of individuals. To do so, the user needs to give *P(Y=c)*, the prevalence of each group of cases (argument *prev*), and the GRR values. GRR values need to be in a matrix form with one row per cases group and one column per variant. If there is no supposed link between the GRR associated to the heterozygous genotype and the GRR associated to the homozygous alternative genotype (general model of the disease, $genetic.model="general"$), the user needs to specify two GRR matrices: one for $GRR_{Aa}$ (argument *GRR.het*) and one for $GRR_{aa}$ (argument *GRR.homo.alt*). If *genetic.model="recessive"*, *"multiplicative"* or *"dominant"*, only one GRR matrix is needed (*GRR.het*). The function **genotypic.freq()** will return a list with three matrices, one for each genotype containing the genotypic frequencies, with one row per group of individuals and one column per variant. To help the user with the construction of the GRR matrix for **genotypic.freq()**, we implemented the function **GRR.matrix()**.To use this function, the user needs to specify how the GRR should be calculated (argument *GRR*): * the same GRR is given to all the variants (*GRR="constant"*), its value being specified to *GRR.value*; * the GRR is computed using the same formula as in SKAT: $w=-0.4*|log_{10}(MAF)|$ (*GRR="SKAT"*); * the GRR is computed using another function depending on MAF in the general population (*GRR="variable"*), this function being specified to *GRR.function*. In the last two situations, a file containing the MAF in the general population with at least a column "maf" and a column "gene" should be given to the argument *genes.maf*. Two such files are available in Ravages: the file *Kryukov* containing MAF simulated under the demographic model of Kryukov and the file *GnomADgenes* containing MAF from the NFE population in GnomAD. As these files contain MAF for multiple genes, the user needs to specify which gene to choose to simulate the data with the argument *select.gene*. If this argument is empty, only the first gene will be kept in the simulation procedure. Finally, the multiplicative factor of the GRR between each group of cases compared to the first group of cases needs to be specified to the argument *GRR.multiplicative.factor* (number of values = number of cases groups - 1). The function **GRR.matrix()** will return a GRR matrix in the appropriate format for the function **genotypic.freq()**. Examples of these two functions are shown below: ```{r} # GRR calculated using the same formula as in SKAT, # with values in the second group of cases being twice the values from the first one GRR.del <- GRR.matrix(GRR = "SKAT", genes.maf = Kryukov, n.case.groups = 2, GRR.multiplicative.factor=2, select.gene = "R1") GRR.del[,1:5] # Calculation of genotype frequencies in the two groups of cases and the controls group # The previous GRR matrix is used with a multiplicative model of the disease # The prevalence in each group of cases is 0.001 geno.freq.groups <- genotypic.freq(genes.maf = Kryukov, select.gene="R1", GRR.het = GRR.del, prev = c(0.001, 0.001), genetic.model = "multiplicative") str(geno.freq.groups) # frequencies of the homozygous alternative genotype in the different groups geno.freq.groups$freq.homo.alt[,1:5] ``` It is also possible to calculate the MAF in each group of individuals as follow: ```{r} #MAF calculation for the first five variants geno.freq.groups$freq.homo.alt[,1:5] + 0.5*geno.freq.groups$freq.het[,1:5] ``` ### Simulation of genotypes In addition to compute the genotypic frequencies in each group of individuals, it is possible to directly simulate genotypes for a group of controls and more than two groups of cases. This can be done using the function **rbm.GRR()** which relies on the function **genotypic.freq()** explained previously. The arguments *genes.maf*, *select.gene*, *prev* and *genetic.model* are the same as in **genotypic.freq()**. In **rbm.GRR()**, the parameter *max.maf.causal* corresponds to the maximum maf used to define a rare variants, i.e. among which causal variants will be sampled. The proportion of causal variants (among those rare variants) and the proportion of protective variants (among causal variants) simulated in the genomic region should be specified to *p.causal* and *p.protect* respectively. The argument *GRR.matrix.del* should contain a matrix with GRR values as if all the variants were deleterious. This matrix can be obtained using the function **GRR.matrix()** previously explained. If *genetic.model="general"*, two GRR matrices need to be given as a list to the argument *GRR.matrix.del* (one for the heterozygous genotype and the other for the homozygous alternative genotype). If the user wants to simulate protective variants in addition to deleterious variants, a similar argument to *GRR.matrix* should be given to *GRR.matrix.pro* with GRR values as if all variants were protective. If the argument *GRR.matrix.pro* is empty and *p.protect*>0, *GRR.matrix.pro* will correspond to 1/*GRR.matrix*. These deleterious and protective GRR values will then be assigned to the sampled deleterious and protective variants in the simulations, the non-causal variants having GRR values of 1. The size of the different groups of individuals should be a vector specified to *size*, and the user has to choose whether the causal variants will be the same between the different groups of cases with the argument *same.variant*. Finally, the number of simulations need to be specified via *replicates*. The function **rbm.GRR()** will return a bed matrix with the group of each individual in the field *@ped$pheno*, the first one being considered by default as the controls group, and the replicate number corresponding to the genomic region in the field *@snps$genomic.region*. The example below shows how to simulate a group of 1 000 controls and two groups of 500 cases with 50% of deleterious variants having GRR values from the previous example. The deleterious variants are different between the two groups of cases and 5 genomic regions are simulated. ```{r} x <- rbm.GRR(genes.maf = Kryukov, size = c(1000, 500, 500), replicates = 5, prev = c(0.001, 0.001), GRR.matrix.del = GRR.del, p.causal = 0.5, p.protect = 0, same.variant = FALSE, genetic.model = "multiplicative", select.gene = "R1") x table(x@ped$pheno) table(x@snps$genomic.region) ``` ## Simulations based on haplotypes **Ravages** also offers the possibility to perform genetic simulations based on real haplotypes data to mimic linkage desequlibrium pattern observed in these data. Two options are available to do so: genetic data are simulated using either haplotypes and their frequencies in different groups of individuals (function **rbm.haplos.freqs()**), or using haplotypes and a liability model on their burdens (function **rbm.haplos.thresholds()**). For the first situation, **rbm.haplos.freqs()** requires a matrix of haplotypes (argument *haplos* with one row per haplotype and one column per variant), and a matrix with each haplotype frequency in each simulated group of individuals (argument *freqs* with one row per haplotype and one column per group of individuals). Observed haplotypes are then sampled in each group of individuals in respect to their frequencies in each one of them. For the second situation, **rbm.haplos.thresholds()** requires a matrix of haplotypes (argument *haplos* as well), and a number of other arguments to draw the liability model. The idea of these simulations is to draw a liability distribution of the burden computed on causal variants for each haplotype, to compute the probability of each haplotype in each group of individuals based on this distribution, and finally to sample haplotypes in each group based on those probabilities. The parameter *weights* indicates how to compute the weights into burdens computation: either all variants have the same weights (*weights="constant"*), or higher weights are given to rare variants using the same formula as in SKAT : $w=-0.4*|log_{10}(MAF)|$). The parameter *max.maf.causal* corresponds to the maf threshold used to define a rare variants, i.e. all monomorphic variants and variants with a MAF upper this threshold will have a weight of 0. In addition, the proportion of causal variants among those variants needs to be given to *p.causal*, and the proportion of protective variants among causal variants needs to be given to *p.protect*. Causal variants will be sampled among variants with a positive weight, and haplotypes' burden will be computed on these variants. The parameter *h2* corresponds to the phenotypic variance explained by the gene: haplotypes burdens will be adjusted on this value; and *prev* corresponds to the prevalence of each group of individuals and will be used to compute the liability threshold for each group of individuals (liabilities of individuals in group $i$ will fall between the quantile 1-$prev_{i}$ and +Inf). This quantile is computed using a standard normal distribution if *normal.approx*=TRUE (applicable for low *h2* value and pretty large *prev* values), otherwise, quantiles will be computed from a sample of $10^6$ observed burdens. If one wants to simulate a group of unselected controls, *prev* needs to be set at 1, regardless of the other arguments: haplotypes are sampled uniformly in the whole liability distribution. The arguments *p.causal*, *p.protect*, *h2* and *prev* are vectors of same length as the number of groups. If they are of size 1, their value will be repeated. As for the previous functions, the number of simulations to perform should be given to *replicates*. As computing haplotypes' probabilities for a set of causal variants can be lengthy, it is possible to run several multiple replicates on a given set. Indeed, even if the same causal variants are taken, different haplotypes (with same probabilities) will be sampled in each individual for each replicate. To ensure a minimal variability, we yet recommend to resample the set of causal variants several times. For example, for 1000 replicates, we recommend to resample causal variants 20 times, and therefore to perform 50 replicates by set of causal variants. The number of replicates to perform by set of causal variants should be given to *rep.by.causal*. As for the genetic simulations based on GRR, the sizes of each group should be given to *size*, and the number of simulations to *replicates*. As before, the replicate number will be in the slot *@snps$genomic.region* and the group of individuals in *@ped$pheno*, both being factors. If **rbm.haplos.freqs()** is used, phenotype values will correspond to the colnames of *freqs*. Two examples these functions are presented below to illustrate these functions. The first one simulates 5 groups of individuals, with haplotype frequencies from the five european populations for the LCT gene. The second one simulates one group of 100 controls and two groups of 50 cases based on the liability model with different genetic effects. ```{r} #Load LCT dataset for haplotype matrix data(LCT.haplotypes) #Simulation based on haplotypes frequencies #for the variants in the LCT gene in the EUR population LCT.gene.hap <- LCT.hap[which(LCT.sample$super.population=="EUR"), which(LCT.snps$pos>=136545410 & LCT.snps$pos<=136594750)] #Individuals from EUR LCT.sample.EUR <- subset(LCT.sample, super.population=="EUR") #Matrix of haplotype frequencies LCT.freqs <- sapply(unique(LCT.sample.EUR$population), function(z) ifelse(LCT.sample.EUR$population==z, 1/table(LCT.sample.EUR$population)[z], 0)) #Simulation of genetic data for five groups of 50 individuals x <- rbm.haplos.freqs(haplos=LCT.gene.hap, freqs=LCT.freqs, size=rep(50,5), replicates=5) #Simulation of 100 controls, and two groups of 50 cases with 30 causal variants #and with the second group having half h2 and twice the prevalence #compared to the first one #5 replicates are performed and causal variants are sampled once x <- rbm.haplos.thresholds(haplos=LCT.gene.hap, p.causal = 0.3, h2=c(0.01,0.01,0.02), prev=c(1,0.01,0.005), size=c(100, 50, 50), replicates=5, rep.by.causal = 5) ``` ## Power calculation Power of the RVAT (burden tests and SKAT) can be directly computed using **Ravages**. To do so, the functions *rbm.GRR.power()* and *rbm.haplos.power()* which rely on simulations based on GRR and on haplotypes respectively can be used. Those two functions use the same arguments as the simulation functions previously presented. Power calculations are available for *CAST*, *WSS* and *SKAT*. In addition, using *rbm.GRR.power()*, theoretical power can be rapidly computed for CAST by using **power.type="theoretical"** and **RVAT="CAST"**. To compute power of RVAT, only rare variants define by the argument **maf.filter** will be kept after the simulations to mimic real data analysis where only rare variants are considered. By default, **maf.filter** correspond to **max.maf.causal**, the MAF threshold used to pick causal variants. To disable this filter, **maf.filter** can be set at 0.5. Below is an example of CAST power based on theoretical calculations or simulations with haplotypes. ```{r} #CAST theoretical power GRR.del <- GRR.matrix(GRR = "SKAT", genes.maf = Kryukov, n.case.groups = 2, GRR.multiplicative.factor=2, select.gene = "R1") rbm.GRR.power(genes.maf = Kryukov, size = c(1000, 500, 500), prev = c(0.001, 0.001), GRR.matrix.del = GRR.del, p.causal = 0.5, p.protect = 0, select.gene="R1", same.variant = FALSE, genetic.model = "multiplicative", power.type="theoretical") #CAST power based on simulations on haplotypes (20 replicates) rbm.haplos.power(haplos=LCT.gene.hap, freqs=LCT.freqs, size=rep(50,5), replicates=20, rep.by.causal = 10, RVAT = "CAST", cores = 1) ```