GenomicSEM: A Novel Method for Modeling Multivariate Genetic

93 Slides8.27 MB

GenomicSEM: A Novel Method for Modeling Multivariate Genetic Architecture Presented by: Andrew D. Grotzinger Paper: Grotzinger, A. D., Rhemtulla, M., de Vlaming, R., Ritchie, S. J., Mallard, T. T., Hill, W. D, Ip, H. F., McIntosh, A. M., Deary, I. J., Koellinger, P. D., Harden, K. P., Nivard, M. G., & Tucker-Drob, E. M. (in press). Genomic SEM provides insights into the multivariate genetic architecture of complex traits. Nature Human Behaviour. Link to paper: rdcu.be/bvn7t

Depression Schizophrenia Traits are highly polygenic, so not simply a matter of identifying 5 overlapping genes

Estimates genetic correlations between samples with varying degrees of sample overlap using publicly available data

Test Statistic To estimate SNP Heritability: Regress GWAS test statistic against LD Scores for all SNPs (not just significant ones) To estimate Genetic Correlation: Regress product of GWAS test statistics for two different phenotypes against LD Scores

Pervasive (Statistical) Pleiotropy Necessitates Methods for Analyzing Joint Genetic Architecture

Background Genome-wide methods are clearly suggestive of both high polygenicity and pervasive pleiotropy Genetic correlations as data to be modeled, not simply results by themselves What data-generating process gave rise to the correlations?

Nivard He says hoi Tucker-Drob Genomic SEM

Our solution: GenomicSEM Apply structural equation model to estimated genetic covariance matrices Moves past family-based methods by allowing user to examine traits that could not be measured in the same sample Genomic SEM provides flexible framework for estimating limitless number of structural equation models using multivariate genetic data from GWAS summary statistics Can be applied to sum stats with varying and unknown degrees of overlap

Short Primer on Structural Equation Modeling (SEM)

Imagine we knew the generating causal process 1 x .40 y .40 x uy y 1 uy .84 x (0,1) , uy (0,.84)

Imagine we knew the generating causal process 1 x .40 y 1 .84 uy .60 z 1 uz y .40 x uy x (0,1) , uy (0,.84) z .60 y uz uz (0,.64) .64

Imagine we knew the generating causal process 1 x .40 y 1 Implied covariance matrix in the population .84 uy 1.00 .60 cov(x,y,z)pop z 1 uz y .40 x uy x (0,1) , uy (0,.84) z .60 y uz uz (0,.64) .64 .40 1.00 .24 .60 1.00

In practice, we only observe the sample data, and we propose a model covariance matrix in population observed covariance matrix in a sample .94 1.00 .33 1.02 .27 .62 1.02 .40 1.00 .24 .60 1.00

For the proposed model, estimate parameters from the data, and evaluate model fit to the data .94 σ2 x x bxy cov(x,y,z)sample y 1 .33 1.02 σ2uy uy .27 byz z 1 uz σ2uz 6 unique elements in the covariance matrix being modeled 5 free model parameters 1 degree of freedom (df) .62 1.02

For the proposed model, estimate parameters from the data, and evaluate model fit to the data .94 .94 x .35 cov(x,y,z)sample y 1 .33 1.02 .90 uy .27 .62 1.02 .61 z 1 uz .94 .63 cov(x,y,z)implied .33 1.03 .20 .63 1.00

The model that we fit may include some variables for which we do not observe data σ2F ( 1 for scaling) F is unobserved. Parameters are estimated from, and fit is evaluated relative to, the sample covariance matrix for y1-yk. F λ1 λ2 λ3 λ4 λ5 y1 y2 y3 y4 y5 uy1 uy2 uy3 uy4 uy5 σ2u1 σ2u2 σ2u3 σ2u4 σ2u5

The model that we fit may include some variables for which we do not observe data σ2F1 b12 F1 ( λ11 for 1 ing l a sc λ12 F2 ) li sca λ13 λ14 for 1 λ21( λ22 λ15 ng 1 uF1 σ2uF1 ) λ23 λ24 λ25 y1 y2 y3 y4 y5 z1 z2 z3 z4 z5 uy1 uy2 uy3 uy4 uy5 uz1 uz2 uz3 uz4 uz5 σ2uy4 σ2uy5 σ2uz4 σ2uz5 σ2uy1 σ2uy2 σ2uy3 σ2uz1 σ2uz2 σ2uz3

Genomic SEM uses these principles to fit structural equation models to genetic covariance matrices derived from GWAS summary statistics using 2 Stage Estimation Stage 1: Estimate Genetic Covariance Matrix and associated matrix of standard errors and their codependencies We use LD Score Regression, but any method for estimating this matrix (e.g. GREML) and its sampling distribution can be used Stage 2: Fit a Structural Equation Model to the Matrices from Stage 1

Fitting Structural Equation Models to GWAS-Derived Genetic Covariance Matrices

Start with GWAS Summary Statistics for the Phenotypes of Interest No need for raw data No need to conduct a primary GWAS yourself: Download them online! sumstats for over 3700 phenotypes have been helpfully indexed at http://atlas.ctglab.nl/ sumstats for over 4000 UK Biobank phenotypes are downloadable at http://www.nealelab.is/uk-biobank

Stage 1 Estimation: Multivariable LDSC Create a genetic covariance matrix, S: an “atlas of genetic correlations” Diagonal elements are (heritabilities) Off-diagonal elements are coheritabilities

Stage 1 Estimation: Multivariable LDSC Also produced is a second matrix, V, of squared standard errors and the dependencies between estimation errors Diagonal elements are squared standard errors of genetic variances and covariances Off-diagonal elements are dependencies between estimation errors used to directly model dependencies that occur due to sample overlap from contributing GWASs

Stage 2 Estimation: Specify the SEM Example: Genetic multiple regression EAg b1 SCZg b2 BIPg u SCZ BIP .11 .30 EA ρBIP,SCZ S .67 SCZg b1 b2 EAg 1 BIPg (df 0, model parameters are a simply a transformation of the matrix) uEA

RESULTS EAg -.163 SCZg .406 BIPg u

Example 2: Model Comparisons for Neuroticism 12 neuroticism traits from round 1 of Neale Lab UKB GWAS Goal use model fit indices to compare common factor, two-factor, and three-factor model

Model 1: Common Factor Model .2 7 ( .0 1 ) .2 6 ( .0 1 ) .2 9 ( .0 1 ) M ood M is e ry g g 1 u .2 2 ( .0 1 ) N e rv o u sg W o rry u Irr u Feel u F ed -up .0 3 ( .0 1 ) .0 5 ( .0 1 ) .0 4 (.0 1 ) .0 4 ( .0 1 ) u u Em bg 1 .2 4 ( .0 1 ) L o n e ly N erv e sg 1 uEm .0 5 ( .0 1 ) .0 5 ( .0 1 ) b G u iltg g 1 1 1 u Tense W o rry .0 5 ( .0 1 ) 0 5 (.0 1 ) b Tenseg g 1 N e rv o u s .2 0 ( .0 1 ) .2 4 ( .0 1 ) .2 8 ( .0 1 ) 1 M is M ood .0 3 ( .0 1 ) F e d -u p G 1 1 .2 8 (.0 1 ) .2 6 ( .0 1 ) .2 7 ( .0 1 ) F e e lG Irrg 1 u .2 5 ( .0 1 ) u uN erves .0 3 ( .0 1 ) u L o n e ly .0 4 ( .0 1 ) G u ilt .0 3 ( .0 1 ) S ta n d a r d iz ed 1 N .8 5 ( .0 3 ) .7 5 ( .0 4 ) .8 7 ( .0 2 ) M ood g 1 uM M is e ry g ood .2 5 (.0 1 ) Irrg u F e e lG M is u Irr .2 8 (.0 2 ) .4 3 ( .0 2 ) .7 3 ( .0 3 ) Feel .3 5 ( .0 3 ) .6 9 (.0 3 ) F e d -u p G N e rv o u sg u F e d -u p .3 3 (.0 2 ) u W o rry g Tenseg 1 N e rv o u s .4 6 ( .0 3 ) u 1 W o rry .3 9 ( .0 2 ) .7 1 (.0 3 ) .8 0 ( .0 3 ) .7 9 ( .0 3 ) .8 0 ( .0 3 ) 1 u .7 8 ( .0 3 ) .8 1 (.0 2 ) 1 1 1 .8 1 ( .0 3 ) g u Ten se .3 6 ( .0 3 ) Em bg 1 uEm L o n e ly g G u iltg 1 1 u N erv e s u L o n ely u .3 7 (.0 3 ) .5 0 (.0 4 ) N e rv e sg 1 b .5 2 (.0 3 ) G u ilt .3 7 ( .0 3 )

Model 2

Model 3

Comparison of Model Fit Indices BUT Model 3 fits the best for CFI, SRMR and AIC Model 2 fits better than Model 1 Model 1 does OK .

Incorporating Genetic Covariance Structure into Multivariate GWAS Discovery

Example: the p factor as a GWAS target

1 pG .86 (.06) .81 (.06) SCZg BIPg .53 (.08) .29 (.09) .46 (.04) MDD PTSDg ANXg g 1 uSCZ .26 (.11) 1 uBIP .35 (.11) 1 1 uMDD uPTSD .79 (.07) .91 (.44) 1 uANX .71 (.36)

Add SNP Effects to the “Atlas” Genetic Covariances from LDSC Betas from GWAS sumstats scaled to covariances using MAFs

GWAS of a Latent Factor 1 1 rs4552973 -.045 (.008) SCZg 1 uSCZ .26 (.11) BIPg 1 uBIP .35 (.11) .998 (.049) pG .53 (.08) .86 (.06) .81 (.06) up .29 (.09) .46 (.04) MDDg 1 uMDD .79 (.07) PTSDg 1 ANXg 1 uPTSD uANX .91 (.44) .71 (.36)

128 lead SNPs 27 unique loci not previously identified in any of the five univariate GWA studies ( ) 41 previously significant in a univariate study, but not for p-factor ( ) 1 significant Q estimate ( ) SNP *

GWAS by subtraction 2pq SNP βNon-cog βCog 0 1 1 Cog Non-cog Cog–EA Non-cog–EA Cog–CP Cognitive performance 0 Educational attainment 0 0

Even if you are not interested in genetics: Can now examine systems of relationships between a wide array of (rare) traits that could not be measured in the same sample

Genetics of Early Onset Schizophrenia Genetics of Late Onset Schizophrenia Genetics of Early Onset Anorexia Nervosa Genetics of Late Onset Anorexia Nervosa

Genetics of Early Onset Schizophrenia Genetics of Late Onset Schizophrenia Genetics of Early Onset Anorexia Nervosa Genetics of Late Onset Anorexia Nervosa

Genetics of Early Onset Schizophrenia Genetics of Late Onset Schizophrenia Genetics of Early Onset Anorexia Nervosa Genetics of Late Onset Anorexia Nervosa

Genetics of Early Onset Schizophrenia Genetics of Late Onset Schizophrenia Genetics of Early Onset Anorexia Nervosa Genetics of Late Onset Anorexia Nervosa

Practical

Practical outline I. II. III. IV. Initial considerations Estimating common factor models Estimating user specified model Estimating multivariate GWAS in Genomic SEM

I. Initial Considerations

Start with GWAS Summary Statistics for the Phenotypes of Interest No need for raw data No need to conduct a primary GWAS yourself: Download them online! Example of the top of a summary statistics file

Where to get summary statistics List lots of resources on the Genomic SEM Wiki: https://github.com/MichelNivard/GenomicSEM/wiki/2.Important-resources-and-key-information

Things to know before getting started 1. Be sure you are using summary statistics calculated within a single ethnic population Example: PTSD on PGC web-site 2. Be sure to use LD scores that match the ethnic population in sum stats 3. Typically advisable to only include summary statistics from a GWAS with N 10,000

Things to know before getting started 4. GenomicSEM allows for varying and unknown degrees of sample overlap The user does not need to know the specific levels of overlap 5. Multivariate GWAS in Genomic SEM uses listwise deletion If certain summary statistics have low genomic coverage this will affect the number of SNPs available for all included traits 6. Make sure you are not using a pruned list of summary statistics (e.g., the top 5,000 hits)

Things to know before getting started 7. Both the munge and sumstats functions in GenomicSEM use sample size to perform necessary conversions. Sample size from summary statistics file or provided by the user. In order to produce accurate results, this should be the total sample size for all included traits. Be wary of: a. Summary statistics that report the effective samples b. Publicly available summary statistics that exclude certain cohorts (e.g., 23andMe).

II. Estimating Common Factor Models in Genomic SEM

Three Primary Steps 1. Munge the summary statistics (munge) 2. Run LD-Score Regression to obtain the genetic covariance and sampling covariance matrices (ldsc) 3. Run the model (commonfactor) Munge: convert raw data from one form to another

Lab Using GWAS sumstats for: Schizophrenia (Pardiñas et al., 2018); N 105,318 Bipolar Disorder (Sklar et al., 2011); N 16,731 Major Depressive Disorder (Wray et al., 2018); N 173,005

Step 1: munge example code (done for you)

Example Munge .log file for bipolar disorder

Step 2: ldsc example code (done for you)

Step 2: ldsc example code Populated with ld scores from the same ancestry

Set working directory and load in data! Will likely print 24 warnings about replacing previous imports: OK TO IGNORE

Step 3: commonfactor example code

Pfactor results Parameter being estimated Estimates and SE for model applied to genetic covariance matrix Estimates and SE for model applied to genetic correlation matrix

III. Estimate a UserSpecified Model

Three Primary Steps 1. Munge the summary statistics (munge) 2. Run LD-Score Regression to obtain the genetic covariance and sampling covariance matrices (ldsc) 3. Specify and run the model (usermodel) These two steps mirror that for models without SNP effects and need not be run again for the same traits

How to specify a model We use the lavaan formula language, slightly extended: Regression: A B (Co)variance: A A; A B Factor: F1 A B C D Fix a parameter: A 1*B (the covariance between A and B is 1) Name a parameter: A a*B (the covariance between A and B parameter label a) Allows you to use model constraints for this parameter: a .001

Lets make that a bit more specific Model1 - “ A B B C” Model2 - “ A B A C B C” C B A B A C

Lets make that a bit more specific Model3 - “ F1 NA*A B C F1 1*F1” 1 F1 A B C

Lets make that a bit more specific Model3 - “ F1 1*A B C” F1 1 A B C

Lab Used GWAS sumstats for: Schizophrenia (Pardiñas et al., 2018); N 105,318 Bipolar Disorder (Sklar et al., 2011); N 16,731 Major Depressive Disorder (Wray et al., 2018); N 173,005 Educational Attainment (Lee et al., 2019); N 766,035 Insomnia (Jansen et al., 2019); N 386,533

My preregistration 1 F1 SCZ BIP INSOM MDD EA

Specify Arguments

YourModel results Parameter being estimated Estimates and SE for model applied to genetic covariance matrix Estimates and SE for model applied to genetic correlation matrix matrix Fully standardized estimates

YourModel modelfit chisq: The model chi-square, reflecting index of exact fit to observed data, with lower values indicating better fit. df and p chisq: The degrees of freedom and p-value for the model chi-square. AIC: Akaike Information Criterion. Can be used to compare models regardless of whether they are nested. CFI: Comparative Fit Index. Higher better. .90 acceptable fit; .95 good model fit SRMR: Standardized Room Mean Square Residual. Lower better. .10 acceptable fit; .05 good fit

Delete Input for MY.model and run your own!

PRACTICAL: You Take Control As away of preregistering them, write your model down on paper Remember five variable names are: SCZ, BIP, MDD, EA, INSOM

IV. Multivariate GWAS in Genomic SEM

Four Primary Steps 1. Munge the summary statistics (munge) 2. Run LD-Score Regression to obtain the genetic covariance and sampling covariance matrices (ldsc) 3. Prepare the summary statistics for multivariate GWAS (sumstats) 4. Run the multivariate GWAS (commonfactorGWAS; userGWAS) These two steps mirror that for models without SNP effects and need not be run again for the same traits

Lab Using GWAS sumstats for: Schizophrenia (Pardiñas et al., 2018); N 105,318 Bipolar Disorder (Sklar et al., 2011); N 16,731 Major Depressive Disorder (Wray et al., 2018); N 173,005 Pre-subset summary statistics downloaded online to 100 HapMap3 SNPs Not necessary (inadvisable) in practice; pragmatic just for workshop

Step 3: sumstats example code

Example sumstats .log file

Behind the scenes GenomicSEM GWAS functions automatically combine output from Steps 2 and 3 Creates as many covariance matrices as there are SNPs across traits Step 3: Run sumstats GWAS functions combine the two Step 2: Run ldsc

Step 4a: commonfactorGWAS example code To save memory, saves only the effect of the SNP on the common factor

First five rows of the output

Estimates of SNP level heterogeneity (QSNP) Asks to what extent the effect of the SNP operates through the common factor distributed test statistic, indexing fit of the common pathways model against independent pathways model 1 𝜎 SNP 2 SNPm,F u 1 F 𝜎 SNP FG SNPm λV5 λV3 FG SNPm λV1 λV1 λV2 2 λV4 λV5 λV2 SNPm,V1 λV3 λV4 SNPm,V2 SNPm,V3 SNPm,V4 V1 V2 V3 V4 V5 g g g g g 11 1 1 1 1 uV uV uV 1 2 𝑒 V 12 𝑒V 2 2 3 𝑒 V 32 uV 4 𝑒V 42 uV 𝑒5 V 52 V1 V2 g 1 uV 1 𝑒 V 12 SNPm,V5 V3 V4 V5 g g g g 1 1 uV 2 𝑒 V 22 uV 3 𝑒V 3 2 1 1 uV uV 4 𝑒 V 42 5 𝑒V 52

Troubleshooting

Step 4b: userGWAS example code

Step 4b: userGWAS example code

If there’s time play around with some anthropometric traits Note that you do not need to include all variables in the model

Variable Names BMI Body Mass Index WHR Waist Hip Ratio Waist Waist Circumference Hip Hip circumference CO childhood obesity Height Height BL Birth Length BW Birth Weight IHC Infant Head Circumference

Final Notes Parallel processing for both userGWAS and commonfactorGWAS is available Parallel is the same as serial processing, except that it takes an additional cores argument specifying how many cores to use Ideal run-time scenario: split jobs across computing nodes on a cluster and run in-parallel All runs are independent of one another!

Overview Genomic SEM is ready for use today! Ask questions on our google forum https://groups.google.com/forum/#!forum/genomic-sem-users Lots can be done using existing, openly available GWAS summary statistics Models are flexible and up to the user Use Genomic SEM to derive sumstats for novel phenotypes for use in PGS analyses

Resources See paper at: rdcu.be/bvn7t See github at: https://github.com/MichelNivard/GenomicSEM See tutorials at: https://github.com/MichelNivard/GenomicSEM/wi ki

Acknowledgements Elliot M. Tucker-Drob, Michel G. Nivard, Mijke Rhemtulla NIH grants R01HD083613, R01AG054628, R21HD081437, R24HD042849 Jacobs Foundation Royal Netherlands Academy of Science Professor Award PAH/6635 ZonMw grants 531003014, 849200011 European Union Seventh Framework Program (FP7/20072013) ACTION Project MRC grant MR/K026992/1 AgeUK Disconnected Mind Project

Back to top button