Skip to content

BayesPRSModel

BayesPRSModel

A base class for Bayesian PRS models. This class defines the basic structure and methods that are common to most Bayesian PRS models. Specifically, this class provides methods and interfaces for initialization, harmonization, prediction, and fitting of Bayesian PRS models.

The class is generic is designed to be inherited and extended by specific Bayesian PRS models, such as LDPred and VIPRS.

Attributes:

Name Type Description
gdl

A GWADataLoader object containing harmonized GWAS summary statistics and Linkage-Disequilibrium (LD) matrices.

float_precision

The precision of the floating point variables. Options are: 'float32' or 'float64'.

shapes

A dictionary where keys are chromosomes and values are the shapes of the variant arrays (e.g. the number of variants per chromosome).

n_per_snp

A dictionary where keys are chromosomes and values are the sample sizes per variant.

std_beta

A dictionary of the standardized marginal effect sizes from GWAS.

validation_std_beta

A dictionary of the standardized marginal effect sizes from an independent validation set.

_sample_size

The maximum per-SNP sample size.

pip

The posterior inclusion probability.

post_mean_beta

The posterior mean for the effect sizes.

post_var_beta

The posterior variance for the effect sizes.

Source code in viprs/model/BayesPRSModel.py
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
class BayesPRSModel:
    """
    A base class for Bayesian PRS models. This class defines the basic structure and methods
    that are common to most Bayesian PRS models. Specifically, this class provides methods and interfaces
    for initialization, harmonization, prediction, and fitting of Bayesian PRS models.

    The class is generic is designed to be inherited and extended by
    specific Bayesian PRS models, such as `LDPred` and `VIPRS`.

    :ivar gdl: A GWADataLoader object containing harmonized GWAS summary statistics and
    Linkage-Disequilibrium (LD) matrices.
    :ivar float_precision: The precision of the floating point variables. Options are: 'float32' or 'float64'.
    :ivar shapes: A dictionary where keys are chromosomes and values are the shapes of the variant arrays
    (e.g. the number of variants per chromosome).
    :ivar n_per_snp: A dictionary where keys are chromosomes and values are the sample sizes per variant.
    :ivar std_beta: A dictionary of the standardized marginal effect sizes from GWAS.
    :ivar validation_std_beta: A dictionary of the standardized marginal effect sizes
    from an independent validation set.
    :ivar _sample_size: The maximum per-SNP sample size.
    :ivar pip: The posterior inclusion probability.
    :ivar post_mean_beta: The posterior mean for the effect sizes.
    :ivar post_var_beta: The posterior variance for the effect sizes.
    """

    def __init__(self,
                 gdl: GWADataLoader,
                 float_precision='float32'):
        """
        Initialize the Bayesian PRS model.
        :param gdl: An instance of `GWADataLoader`. Must contain either GWAS summary statistics
        or genotype data.
        :param float_precision: The precision for the floating point numbers.
        """

        # ------------------- Sanity checks -------------------

        assert isinstance(gdl, GWADataLoader), "The `gdl` object must be an instance of GWASDataLoader."

        assert gdl.genotype is not None or (gdl.ld is not None and gdl.sumstats_table is not None), (
            "The GWADataLoader object must contain either genotype data or summary statistics and LD matrices."
        )

        # -----------------------------------------------------

        self.gdl = gdl
        self.float_precision = float_precision
        self.float_eps = np.finfo(self.float_precision).eps
        self.shapes = self.gdl.shapes.copy()

        # Placeholder for sample size per SNP:
        self.n_per_snp = None
        # Placeholder for standardized marginal betas:
        self.std_beta = None

        # Placeholder for standardized marginal betas from an independent validation set:
        self.validation_std_beta = None

        if gdl.sumstats_table is not None:
            # Initialize the input data arrays:
            self.initialize_input_data_arrays()

            # Determine the overall sample size:
            self._sample_size = dict_max(self.n_per_snp)

        # Placeholder for inferred model parameters:
        self.pip = None  # Posterior inclusion probability
        self.post_mean_beta = None  # The posterior mean for the effect sizes
        self.post_var_beta = None  # The posterior variance for the effect sizes

    @property
    def chromosomes(self):
        """
        :return: The list of chromosomes that are included in the BayesPRSModel
        """
        return sorted(list(self.shapes.keys()))

    @property
    def m(self) -> int:
        """

        !!! seealso "See Also"
            * [n_snps][viprs.model.BayesPRSModel.BayesPRSModel.n_snps]

        :return: The number of variants in the model.
        """
        return self.gdl.m

    @property
    def n(self) -> int:
        """
        :return: The number of samples in the model. If not available, average the per-SNP
        sample sizes.
        """
        return self._sample_size

    @property
    def n_snps(self) -> int:
        """
        !!! seealso "See Also"
            * [m][viprs.model.BayesPRSModel.BayesPRSModel.m]

        :return: The number of SNPs in the model.
        """
        return self.m

    def initialize_input_data_arrays(self):
        """
        Initialize the input data arrays for the Bayesian PRS model.
        The input data for summary statistics-based PRS models typically include the following:
            * The sample size per variant (n_per_snp)
            * The standardized marginal betas (std_beta)
            * LD matrices (LD)

        This convenience method initializes the first two inputs, primarily the sample size per variant
        and the standardized marginal betas.
        """

        logger.debug("> Initializing the input data arrays (marginal statistics).")

        try:
            self.n_per_snp = {c: ss.n_per_snp
                              for c, ss in self.gdl.sumstats_table.items()}
            self.std_beta = {c: ss.get_snp_pseudo_corr().astype(self.float_precision)
                             for c, ss in self.gdl.sumstats_table.items()}
        except AttributeError:
            # If not provided, use the overall sample size:
            self.n_per_snp = {c: np.repeat(self.gdl.n, c_size)
                              for c, c_size in self.shapes.items()}

        self.validation_std_beta = None

    def set_validation_sumstats(self):
        """
        Set the validation summary statistics.
        TODO: Allow the user to set the validation sumstats as a property of the model.
        """
        raise NotImplementedError

    def split_gwas_sumstats(self,
                            prop_train=0.8,
                            seed=None,
                            **kwargs):
        """
        Split the GWAS summary statistics into training and validation sets, using the
        PUMAS procedure outlined in Zhao et al. (2021).

        :param prop_train: The proportion of samples to include in the training set.
        :param seed: The random seed for reproducibility.
        :param kwargs: Additional keyword arguments to pass to the `sumstats_train_test_split` function.
        """

        from magenpy.utils.model_utils import sumstats_train_test_split

        logger.debug("> Splitting the GWAS summary statistics into training and validation sets. "
                     f"Training proportion: {prop_train}")

        split_sumstats = sumstats_train_test_split(self.gdl,
                                                   prop_train=prop_train,
                                                   seed=seed,
                                                   **kwargs)

        self.std_beta = {
            c: split_sumstats[c]['train_beta'].astype(self.float_precision)
            for c in self.chromosomes
        }

        self.n_per_snp = {
            c: self.n_per_snp[c]*prop_train
            for c in self.chromosomes
        }

        self.validation_std_beta = {
            c: split_sumstats[c]['test_beta'].astype(self.float_precision)
            for c in self.chromosomes
        }

    def fit(self, *args, **kwargs):
        """
        A genetic method to fit the Bayesian PRS model. This method should be implemented by the
        specific Bayesian PRS model.
        :raises NotImplementedError: If the method is not implemented in the child class.
        """
        raise NotImplementedError

    def get_proportion_causal(self):
        """
        A generic method to get an estimate of the proportion of causal variants.
        :raises NotImplementedError: If the method is not implemented in the child class.
        """
        raise NotImplementedError

    def get_heritability(self):
        """
        A generic method to get an estimate of the heritability, or proportion of variance explained by SNPs.
        :raises NotImplementedError: If the method is not implemented in the child class.
        """
        raise NotImplementedError

    def get_pip(self):
        """
        :return: The posterior inclusion probability for each variant in the model.
        """
        return self.pip

    def get_posterior_mean_beta(self):
        """
        :return: The posterior mean of the effect sizes (BETA) for each variant in the model.
        """
        return self.post_mean_beta

    def get_posterior_variance_beta(self):
        """
        :return: The posterior variance of the effect sizes (BETA) for each variant in the model.
        """
        return self.post_var_beta

    def predict(self, test_gdl=None):
        """
        Given the inferred effect sizes, predict the phenotype for the training samples in
        the GWADataLoader object or new test samples. If `test_gdl` is not provided, genotypes
        from training samples will be used (if available).

        :param test_gdl: A GWADataLoader object containing genotype data for new test samples.
        :raises ValueError: If the posterior means for BETA are not set. AssertionError if the GWADataLoader object
        does not contain genotype data.
        """

        if self.post_mean_beta is None:
            raise ValueError("The posterior means for BETA are not set. Call `.fit()` first.")

        if test_gdl is None:
            assert self.gdl.genotype is not None, "The GWADataLoader object must contain genotype data."
            test_gdl = self.gdl
            post_mean_beta = self.post_mean_beta
        else:
            _, post_mean_beta, _ = self.harmonize_data(gdl=test_gdl)

        return test_gdl.predict(post_mean_beta)

    def harmonize_data(self, gdl=None, parameter_table=None):
        """
        Harmonize the inferred effect sizes with a new GWADataLoader object. This method is useful
        when the user wants to predict on new samples or when the effect sizes are inferred from a
        different set of samples. The method aligns the effect sizes with the SNP table in the
        GWADataLoader object.

        :param gdl: An instance of `GWADataLoader` object.
        :param parameter_table: A `pandas` DataFrame of variant effect sizes.

        :return: A tuple of the harmonized posterior inclusion probability, posterior mean for the effect sizes,
        and posterior variance for the effect sizes.

        """

        if gdl is None and parameter_table is None:
            return

        if gdl is None:
            gdl = self.gdl

        if parameter_table is None:
            parameter_table = self.to_table(per_chromosome=True)
        else:
            parameter_table = {c: parameter_table.loc[parameter_table['CHR'] == c, ]
                               for c in parameter_table['CHR'].unique()}

        snp_tables = gdl.to_snp_table(col_subset=['SNP', 'A1', 'A2'],
                                      per_chromosome=True)

        pip = {}
        post_mean_beta = {}
        post_var_beta = {}

        common_chroms = sorted(list(set(snp_tables.keys()).intersection(set(parameter_table.keys()))))

        from magenpy.utils.model_utils import merge_snp_tables

        for c in common_chroms:

            try:
                post_mean_cols = expand_column_names('BETA', self.post_mean_beta[c].shape)
                pip_cols = expand_column_names('PIP', self.post_mean_beta[c].shape)
                post_var_cols = expand_column_names('VAR_BETA', self.post_mean_beta[c].shape)

            except (TypeError, KeyError):
                pip_cols = [col for col in parameter_table[c].columns if 'PIP' in col]
                post_var_cols = [col for col in parameter_table[c].columns if 'VAR_BETA' in col]
                post_mean_cols = [col for col in parameter_table[c].columns
                                  if 'BETA' in col and col not in post_var_cols]

            # Merge the effect table with the GDL SNP table:
            c_df = merge_snp_tables(snp_tables[c], parameter_table[c], how='left',
                                    signed_statistics=post_mean_cols)

            if len(c_df) < len(snp_tables[c]):
                raise ValueError("The parameter table could not aligned with the reference SNP table. This may due to "
                                 "conflicts/errors in use of reference vs. alternative alleles.")

            # Obtain the values for the posterior mean:
            c_df[post_mean_cols] = c_df[post_mean_cols].fillna(0.)
            post_mean_beta[c] = c_df[post_mean_cols].values

            # Obtain the values for the posterior inclusion probability:
            if len(set(pip_cols).intersection(set(c_df.columns))) > 0:
                c_df[pip_cols] = c_df[pip_cols].fillna(0.)
                pip[c] = c_df[pip_cols].values

            # Obtain the values for the posterior variance:
            if len(set(post_var_cols).intersection(set(c_df.columns))) > 0:
                c_df[post_var_cols] = c_df[post_var_cols].fillna(0.)
                post_var_beta[c] = c_df[post_var_cols].values

        if len(pip) < 1:
            pip = None

        if len(post_var_beta) < 1:
            post_var_beta = None

        return pip, post_mean_beta, post_var_beta

    def to_table(self, col_subset=('CHR', 'SNP', 'POS', 'A1', 'A2'), per_chromosome=False):
        """
        Output the posterior estimates for the effect sizes to a pandas dataframe.
        :param col_subset: The subset of columns to include in the tables (in addition to the effect sizes).
        :param per_chromosome: If True, return a separate table for each chromosome.

        :return: A pandas Dataframe with the posterior estimates for the effect sizes.
        """

        if self.post_mean_beta is None:
            raise Exception("The posterior means for BETA are not set. Call `.fit()` first.")

        tables = self.gdl.to_snp_table(col_subset=col_subset, per_chromosome=True)

        for c in self.chromosomes:

            cols_to_add = []

            mean_beta_df = pd.DataFrame(self.post_mean_beta[c],
                                        columns=expand_column_names('BETA', self.post_mean_beta[c].shape),
                                        index=tables[c].index)
            cols_to_add.append(mean_beta_df)

            if self.pip is not None:
                pip_df = pd.DataFrame(self.pip[c],
                                      columns=expand_column_names('PIP', self.pip[c].shape),
                                      index=tables[c].index)
                cols_to_add.append(pip_df)

            if self.post_var_beta is not None:
                var_beta_df = pd.DataFrame(self.post_var_beta[c],
                                           columns=expand_column_names('VAR_BETA', self.post_var_beta[c].shape),
                                           index=tables[c].index)
                cols_to_add.append(var_beta_df)

            tables[c] = pd.concat([tables[c]] + cols_to_add, axis=1)

        if per_chromosome:
            return tables
        else:
            return pd.concat([tables[c] for c in self.chromosomes])

    def pseudo_validate(self, test_gdl=None):
        """
        Evaluate the prediction accuracy of the inferred PRS using external GWAS summary statistics.

        :param test_gdl: A `GWADataLoader` object with the external GWAS summary statistics and LD matrix information.

        :return: The pseudo-R^2 metric.
        """

        from ..eval.pseudo_metrics import pseudo_r2, _streamlined_pseudo_r2
        from ..utils.compute_utils import dict_concat

        assert self.post_mean_beta is not None, "The posterior means for BETA are not set. Call `.fit()` first."
        assert self.validation_std_beta is not None or test_gdl is not None, (
            "Must provide a GWADataLoader object with validation sumstats or initialize the standardized "
            "betas inside the model."
        )

        if test_gdl is not None:
            return pseudo_r2(test_gdl, self.to_table(per_chromosome=False))
        else:

            # Check if q is an attribute of the model:
            if hasattr(self, 'q'):
                ldw_prs = {c: self.q[c] + self.post_mean_beta[c] for c in self.shapes}
            else:
                # Compute LD-weighted PRS weights first:
                ldw_prs = {}
                for c in self.shapes:
                    ldw_prs[c] = self.gdl.ld[c].dot(self.post_mean_beta[c])

            return _streamlined_pseudo_r2(
                dict_concat(self.validation_std_beta),
                dict_concat(self.post_mean_beta),
                dict_concat(ldw_prs)
            )

    def set_model_parameters(self, parameter_table):
        """
        Parses a pandas dataframe with model parameters and assigns them 
        to the corresponding class attributes. 

        For example: 
            * Columns with `BETA`, will be assigned to `self.post_mean_beta`.
            * Columns with `PIP` will be assigned to `self.pip`.
            * Columns with `VAR_BETA`, will be assigned to `self.post_var_beta`.

        :param parameter_table: A pandas table or dataframe.
        """

        self.pip, self.post_mean_beta, self.post_var_beta = self.harmonize_data(parameter_table=parameter_table)

    def read_inferred_parameters(self, f_names, sep=r"\s+"):
        """
        Read a file with the inferred parameters.
        :param f_names: A path (or list of paths) to the file with the effect sizes.
        :param sep: The delimiter for the file(s).
        """

        if isinstance(f_names, str):
            f_names = [f_names]

        param_table = []

        for f_name in f_names:
            param_table.append(pd.read_csv(f_name, sep=sep))

        if len(param_table) > 0:
            param_table = pd.concat(param_table)
            self.set_model_parameters(param_table)
        else:
            raise FileNotFoundError

    def write_inferred_parameters(self, f_name, per_chromosome=False, sep="\t"):
        """
        A convenience method to write the inferred posterior for the effect sizes to file.

        TODO:
            * Support outputting scoring files compatible with PGS catalog format:
            https://www.pgscatalog.org/downloads/#dl_scoring_files

        :param f_name: The filename (or directory) where to write the effect sizes
        :param per_chromosome: If True, write a file for each chromosome separately.
        :param sep: The delimiter for the file (tab by default).
        """

        tables = self.to_table(per_chromosome=per_chromosome)

        if '.fit' not in f_name:
            ext = '.fit'
        else:
            ext = ''

        if per_chromosome:
            for c, tab in tables.items():
                try:
                    tab.to_csv(osp.join(f_name, f'chr_{c}.fit'), sep=sep, index=False)
                except Exception as e:
                    raise e
        else:
            try:
                tables.to_csv(f_name + ext, sep=sep, index=False)
            except Exception as e:
                raise e

chromosomes property

Returns:

Type Description

The list of chromosomes that are included in the BayesPRSModel

m property

See Also

Returns:

Type Description
int

The number of variants in the model.

n property

Returns:

Type Description
int

The number of samples in the model. If not available, average the per-SNP sample sizes.

n_snps property

See Also

Returns:

Type Description
int

The number of SNPs in the model.

__init__(gdl, float_precision='float32')

Initialize the Bayesian PRS model.

Parameters:

Name Type Description Default
gdl GWADataLoader

An instance of GWADataLoader. Must contain either GWAS summary statistics or genotype data.

required
float_precision

The precision for the floating point numbers.

'float32'
Source code in viprs/model/BayesPRSModel.py
def __init__(self,
             gdl: GWADataLoader,
             float_precision='float32'):
    """
    Initialize the Bayesian PRS model.
    :param gdl: An instance of `GWADataLoader`. Must contain either GWAS summary statistics
    or genotype data.
    :param float_precision: The precision for the floating point numbers.
    """

    # ------------------- Sanity checks -------------------

    assert isinstance(gdl, GWADataLoader), "The `gdl` object must be an instance of GWASDataLoader."

    assert gdl.genotype is not None or (gdl.ld is not None and gdl.sumstats_table is not None), (
        "The GWADataLoader object must contain either genotype data or summary statistics and LD matrices."
    )

    # -----------------------------------------------------

    self.gdl = gdl
    self.float_precision = float_precision
    self.float_eps = np.finfo(self.float_precision).eps
    self.shapes = self.gdl.shapes.copy()

    # Placeholder for sample size per SNP:
    self.n_per_snp = None
    # Placeholder for standardized marginal betas:
    self.std_beta = None

    # Placeholder for standardized marginal betas from an independent validation set:
    self.validation_std_beta = None

    if gdl.sumstats_table is not None:
        # Initialize the input data arrays:
        self.initialize_input_data_arrays()

        # Determine the overall sample size:
        self._sample_size = dict_max(self.n_per_snp)

    # Placeholder for inferred model parameters:
    self.pip = None  # Posterior inclusion probability
    self.post_mean_beta = None  # The posterior mean for the effect sizes
    self.post_var_beta = None  # The posterior variance for the effect sizes

fit(*args, **kwargs)

A genetic method to fit the Bayesian PRS model. This method should be implemented by the specific Bayesian PRS model.

Raises:

Type Description
NotImplementedError

If the method is not implemented in the child class.

Source code in viprs/model/BayesPRSModel.py
def fit(self, *args, **kwargs):
    """
    A genetic method to fit the Bayesian PRS model. This method should be implemented by the
    specific Bayesian PRS model.
    :raises NotImplementedError: If the method is not implemented in the child class.
    """
    raise NotImplementedError

get_heritability()

A generic method to get an estimate of the heritability, or proportion of variance explained by SNPs.

Raises:

Type Description
NotImplementedError

If the method is not implemented in the child class.

Source code in viprs/model/BayesPRSModel.py
def get_heritability(self):
    """
    A generic method to get an estimate of the heritability, or proportion of variance explained by SNPs.
    :raises NotImplementedError: If the method is not implemented in the child class.
    """
    raise NotImplementedError

get_pip()

Returns:

Type Description

The posterior inclusion probability for each variant in the model.

Source code in viprs/model/BayesPRSModel.py
def get_pip(self):
    """
    :return: The posterior inclusion probability for each variant in the model.
    """
    return self.pip

get_posterior_mean_beta()

Returns:

Type Description

The posterior mean of the effect sizes (BETA) for each variant in the model.

Source code in viprs/model/BayesPRSModel.py
def get_posterior_mean_beta(self):
    """
    :return: The posterior mean of the effect sizes (BETA) for each variant in the model.
    """
    return self.post_mean_beta

get_posterior_variance_beta()

Returns:

Type Description

The posterior variance of the effect sizes (BETA) for each variant in the model.

Source code in viprs/model/BayesPRSModel.py
def get_posterior_variance_beta(self):
    """
    :return: The posterior variance of the effect sizes (BETA) for each variant in the model.
    """
    return self.post_var_beta

get_proportion_causal()

A generic method to get an estimate of the proportion of causal variants.

Raises:

Type Description
NotImplementedError

If the method is not implemented in the child class.

Source code in viprs/model/BayesPRSModel.py
def get_proportion_causal(self):
    """
    A generic method to get an estimate of the proportion of causal variants.
    :raises NotImplementedError: If the method is not implemented in the child class.
    """
    raise NotImplementedError

harmonize_data(gdl=None, parameter_table=None)

Harmonize the inferred effect sizes with a new GWADataLoader object. This method is useful when the user wants to predict on new samples or when the effect sizes are inferred from a different set of samples. The method aligns the effect sizes with the SNP table in the GWADataLoader object.

Parameters:

Name Type Description Default
gdl

An instance of GWADataLoader object.

None
parameter_table

A pandas DataFrame of variant effect sizes.

None

Returns:

Type Description

A tuple of the harmonized posterior inclusion probability, posterior mean for the effect sizes, and posterior variance for the effect sizes.

Source code in viprs/model/BayesPRSModel.py
def harmonize_data(self, gdl=None, parameter_table=None):
    """
    Harmonize the inferred effect sizes with a new GWADataLoader object. This method is useful
    when the user wants to predict on new samples or when the effect sizes are inferred from a
    different set of samples. The method aligns the effect sizes with the SNP table in the
    GWADataLoader object.

    :param gdl: An instance of `GWADataLoader` object.
    :param parameter_table: A `pandas` DataFrame of variant effect sizes.

    :return: A tuple of the harmonized posterior inclusion probability, posterior mean for the effect sizes,
    and posterior variance for the effect sizes.

    """

    if gdl is None and parameter_table is None:
        return

    if gdl is None:
        gdl = self.gdl

    if parameter_table is None:
        parameter_table = self.to_table(per_chromosome=True)
    else:
        parameter_table = {c: parameter_table.loc[parameter_table['CHR'] == c, ]
                           for c in parameter_table['CHR'].unique()}

    snp_tables = gdl.to_snp_table(col_subset=['SNP', 'A1', 'A2'],
                                  per_chromosome=True)

    pip = {}
    post_mean_beta = {}
    post_var_beta = {}

    common_chroms = sorted(list(set(snp_tables.keys()).intersection(set(parameter_table.keys()))))

    from magenpy.utils.model_utils import merge_snp_tables

    for c in common_chroms:

        try:
            post_mean_cols = expand_column_names('BETA', self.post_mean_beta[c].shape)
            pip_cols = expand_column_names('PIP', self.post_mean_beta[c].shape)
            post_var_cols = expand_column_names('VAR_BETA', self.post_mean_beta[c].shape)

        except (TypeError, KeyError):
            pip_cols = [col for col in parameter_table[c].columns if 'PIP' in col]
            post_var_cols = [col for col in parameter_table[c].columns if 'VAR_BETA' in col]
            post_mean_cols = [col for col in parameter_table[c].columns
                              if 'BETA' in col and col not in post_var_cols]

        # Merge the effect table with the GDL SNP table:
        c_df = merge_snp_tables(snp_tables[c], parameter_table[c], how='left',
                                signed_statistics=post_mean_cols)

        if len(c_df) < len(snp_tables[c]):
            raise ValueError("The parameter table could not aligned with the reference SNP table. This may due to "
                             "conflicts/errors in use of reference vs. alternative alleles.")

        # Obtain the values for the posterior mean:
        c_df[post_mean_cols] = c_df[post_mean_cols].fillna(0.)
        post_mean_beta[c] = c_df[post_mean_cols].values

        # Obtain the values for the posterior inclusion probability:
        if len(set(pip_cols).intersection(set(c_df.columns))) > 0:
            c_df[pip_cols] = c_df[pip_cols].fillna(0.)
            pip[c] = c_df[pip_cols].values

        # Obtain the values for the posterior variance:
        if len(set(post_var_cols).intersection(set(c_df.columns))) > 0:
            c_df[post_var_cols] = c_df[post_var_cols].fillna(0.)
            post_var_beta[c] = c_df[post_var_cols].values

    if len(pip) < 1:
        pip = None

    if len(post_var_beta) < 1:
        post_var_beta = None

    return pip, post_mean_beta, post_var_beta

initialize_input_data_arrays()

Initialize the input data arrays for the Bayesian PRS model. The input data for summary statistics-based PRS models typically include the following: * The sample size per variant (n_per_snp) * The standardized marginal betas (std_beta) * LD matrices (LD)

This convenience method initializes the first two inputs, primarily the sample size per variant and the standardized marginal betas.

Source code in viprs/model/BayesPRSModel.py
def initialize_input_data_arrays(self):
    """
    Initialize the input data arrays for the Bayesian PRS model.
    The input data for summary statistics-based PRS models typically include the following:
        * The sample size per variant (n_per_snp)
        * The standardized marginal betas (std_beta)
        * LD matrices (LD)

    This convenience method initializes the first two inputs, primarily the sample size per variant
    and the standardized marginal betas.
    """

    logger.debug("> Initializing the input data arrays (marginal statistics).")

    try:
        self.n_per_snp = {c: ss.n_per_snp
                          for c, ss in self.gdl.sumstats_table.items()}
        self.std_beta = {c: ss.get_snp_pseudo_corr().astype(self.float_precision)
                         for c, ss in self.gdl.sumstats_table.items()}
    except AttributeError:
        # If not provided, use the overall sample size:
        self.n_per_snp = {c: np.repeat(self.gdl.n, c_size)
                          for c, c_size in self.shapes.items()}

    self.validation_std_beta = None

predict(test_gdl=None)

Given the inferred effect sizes, predict the phenotype for the training samples in the GWADataLoader object or new test samples. If test_gdl is not provided, genotypes from training samples will be used (if available).

Parameters:

Name Type Description Default
test_gdl

A GWADataLoader object containing genotype data for new test samples.

None

Raises:

Type Description
ValueError

If the posterior means for BETA are not set. AssertionError if the GWADataLoader object does not contain genotype data.

Source code in viprs/model/BayesPRSModel.py
def predict(self, test_gdl=None):
    """
    Given the inferred effect sizes, predict the phenotype for the training samples in
    the GWADataLoader object or new test samples. If `test_gdl` is not provided, genotypes
    from training samples will be used (if available).

    :param test_gdl: A GWADataLoader object containing genotype data for new test samples.
    :raises ValueError: If the posterior means for BETA are not set. AssertionError if the GWADataLoader object
    does not contain genotype data.
    """

    if self.post_mean_beta is None:
        raise ValueError("The posterior means for BETA are not set. Call `.fit()` first.")

    if test_gdl is None:
        assert self.gdl.genotype is not None, "The GWADataLoader object must contain genotype data."
        test_gdl = self.gdl
        post_mean_beta = self.post_mean_beta
    else:
        _, post_mean_beta, _ = self.harmonize_data(gdl=test_gdl)

    return test_gdl.predict(post_mean_beta)

pseudo_validate(test_gdl=None)

Evaluate the prediction accuracy of the inferred PRS using external GWAS summary statistics.

Parameters:

Name Type Description Default
test_gdl

A GWADataLoader object with the external GWAS summary statistics and LD matrix information.

None

Returns:

Type Description

The pseudo-R^2 metric.

Source code in viprs/model/BayesPRSModel.py
def pseudo_validate(self, test_gdl=None):
    """
    Evaluate the prediction accuracy of the inferred PRS using external GWAS summary statistics.

    :param test_gdl: A `GWADataLoader` object with the external GWAS summary statistics and LD matrix information.

    :return: The pseudo-R^2 metric.
    """

    from ..eval.pseudo_metrics import pseudo_r2, _streamlined_pseudo_r2
    from ..utils.compute_utils import dict_concat

    assert self.post_mean_beta is not None, "The posterior means for BETA are not set. Call `.fit()` first."
    assert self.validation_std_beta is not None or test_gdl is not None, (
        "Must provide a GWADataLoader object with validation sumstats or initialize the standardized "
        "betas inside the model."
    )

    if test_gdl is not None:
        return pseudo_r2(test_gdl, self.to_table(per_chromosome=False))
    else:

        # Check if q is an attribute of the model:
        if hasattr(self, 'q'):
            ldw_prs = {c: self.q[c] + self.post_mean_beta[c] for c in self.shapes}
        else:
            # Compute LD-weighted PRS weights first:
            ldw_prs = {}
            for c in self.shapes:
                ldw_prs[c] = self.gdl.ld[c].dot(self.post_mean_beta[c])

        return _streamlined_pseudo_r2(
            dict_concat(self.validation_std_beta),
            dict_concat(self.post_mean_beta),
            dict_concat(ldw_prs)
        )

read_inferred_parameters(f_names, sep='\\s+')

Read a file with the inferred parameters.

Parameters:

Name Type Description Default
f_names

A path (or list of paths) to the file with the effect sizes.

required
sep

The delimiter for the file(s).

'\\s+'
Source code in viprs/model/BayesPRSModel.py
def read_inferred_parameters(self, f_names, sep=r"\s+"):
    """
    Read a file with the inferred parameters.
    :param f_names: A path (or list of paths) to the file with the effect sizes.
    :param sep: The delimiter for the file(s).
    """

    if isinstance(f_names, str):
        f_names = [f_names]

    param_table = []

    for f_name in f_names:
        param_table.append(pd.read_csv(f_name, sep=sep))

    if len(param_table) > 0:
        param_table = pd.concat(param_table)
        self.set_model_parameters(param_table)
    else:
        raise FileNotFoundError

set_model_parameters(parameter_table)

Parses a pandas dataframe with model parameters and assigns them to the corresponding class attributes.

For example: * Columns with BETA, will be assigned to self.post_mean_beta. * Columns with PIP will be assigned to self.pip. * Columns with VAR_BETA, will be assigned to self.post_var_beta.

Parameters:

Name Type Description Default
parameter_table

A pandas table or dataframe.

required
Source code in viprs/model/BayesPRSModel.py
def set_model_parameters(self, parameter_table):
    """
    Parses a pandas dataframe with model parameters and assigns them 
    to the corresponding class attributes. 

    For example: 
        * Columns with `BETA`, will be assigned to `self.post_mean_beta`.
        * Columns with `PIP` will be assigned to `self.pip`.
        * Columns with `VAR_BETA`, will be assigned to `self.post_var_beta`.

    :param parameter_table: A pandas table or dataframe.
    """

    self.pip, self.post_mean_beta, self.post_var_beta = self.harmonize_data(parameter_table=parameter_table)

set_validation_sumstats()

Set the validation summary statistics. TODO: Allow the user to set the validation sumstats as a property of the model.

Source code in viprs/model/BayesPRSModel.py
def set_validation_sumstats(self):
    """
    Set the validation summary statistics.
    TODO: Allow the user to set the validation sumstats as a property of the model.
    """
    raise NotImplementedError

split_gwas_sumstats(prop_train=0.8, seed=None, **kwargs)

Split the GWAS summary statistics into training and validation sets, using the PUMAS procedure outlined in Zhao et al. (2021).

Parameters:

Name Type Description Default
prop_train

The proportion of samples to include in the training set.

0.8
seed

The random seed for reproducibility.

None
kwargs

Additional keyword arguments to pass to the sumstats_train_test_split function.

{}
Source code in viprs/model/BayesPRSModel.py
def split_gwas_sumstats(self,
                        prop_train=0.8,
                        seed=None,
                        **kwargs):
    """
    Split the GWAS summary statistics into training and validation sets, using the
    PUMAS procedure outlined in Zhao et al. (2021).

    :param prop_train: The proportion of samples to include in the training set.
    :param seed: The random seed for reproducibility.
    :param kwargs: Additional keyword arguments to pass to the `sumstats_train_test_split` function.
    """

    from magenpy.utils.model_utils import sumstats_train_test_split

    logger.debug("> Splitting the GWAS summary statistics into training and validation sets. "
                 f"Training proportion: {prop_train}")

    split_sumstats = sumstats_train_test_split(self.gdl,
                                               prop_train=prop_train,
                                               seed=seed,
                                               **kwargs)

    self.std_beta = {
        c: split_sumstats[c]['train_beta'].astype(self.float_precision)
        for c in self.chromosomes
    }

    self.n_per_snp = {
        c: self.n_per_snp[c]*prop_train
        for c in self.chromosomes
    }

    self.validation_std_beta = {
        c: split_sumstats[c]['test_beta'].astype(self.float_precision)
        for c in self.chromosomes
    }

to_table(col_subset=('CHR', 'SNP', 'POS', 'A1', 'A2'), per_chromosome=False)

Output the posterior estimates for the effect sizes to a pandas dataframe.

Parameters:

Name Type Description Default
col_subset

The subset of columns to include in the tables (in addition to the effect sizes).

('CHR', 'SNP', 'POS', 'A1', 'A2')
per_chromosome

If True, return a separate table for each chromosome.

False

Returns:

Type Description

A pandas Dataframe with the posterior estimates for the effect sizes.

Source code in viprs/model/BayesPRSModel.py
def to_table(self, col_subset=('CHR', 'SNP', 'POS', 'A1', 'A2'), per_chromosome=False):
    """
    Output the posterior estimates for the effect sizes to a pandas dataframe.
    :param col_subset: The subset of columns to include in the tables (in addition to the effect sizes).
    :param per_chromosome: If True, return a separate table for each chromosome.

    :return: A pandas Dataframe with the posterior estimates for the effect sizes.
    """

    if self.post_mean_beta is None:
        raise Exception("The posterior means for BETA are not set. Call `.fit()` first.")

    tables = self.gdl.to_snp_table(col_subset=col_subset, per_chromosome=True)

    for c in self.chromosomes:

        cols_to_add = []

        mean_beta_df = pd.DataFrame(self.post_mean_beta[c],
                                    columns=expand_column_names('BETA', self.post_mean_beta[c].shape),
                                    index=tables[c].index)
        cols_to_add.append(mean_beta_df)

        if self.pip is not None:
            pip_df = pd.DataFrame(self.pip[c],
                                  columns=expand_column_names('PIP', self.pip[c].shape),
                                  index=tables[c].index)
            cols_to_add.append(pip_df)

        if self.post_var_beta is not None:
            var_beta_df = pd.DataFrame(self.post_var_beta[c],
                                       columns=expand_column_names('VAR_BETA', self.post_var_beta[c].shape),
                                       index=tables[c].index)
            cols_to_add.append(var_beta_df)

        tables[c] = pd.concat([tables[c]] + cols_to_add, axis=1)

    if per_chromosome:
        return tables
    else:
        return pd.concat([tables[c] for c in self.chromosomes])

write_inferred_parameters(f_name, per_chromosome=False, sep='\t')

A convenience method to write the inferred posterior for the effect sizes to file.

TODO: * Support outputting scoring files compatible with PGS catalog format: https://www.pgscatalog.org/downloads/#dl_scoring_files

Parameters:

Name Type Description Default
f_name

The filename (or directory) where to write the effect sizes

required
per_chromosome

If True, write a file for each chromosome separately.

False
sep

The delimiter for the file (tab by default).

'\t'
Source code in viprs/model/BayesPRSModel.py
def write_inferred_parameters(self, f_name, per_chromosome=False, sep="\t"):
    """
    A convenience method to write the inferred posterior for the effect sizes to file.

    TODO:
        * Support outputting scoring files compatible with PGS catalog format:
        https://www.pgscatalog.org/downloads/#dl_scoring_files

    :param f_name: The filename (or directory) where to write the effect sizes
    :param per_chromosome: If True, write a file for each chromosome separately.
    :param sep: The delimiter for the file (tab by default).
    """

    tables = self.to_table(per_chromosome=per_chromosome)

    if '.fit' not in f_name:
        ext = '.fit'
    else:
        ext = ''

    if per_chromosome:
        for c, tab in tables.items():
            try:
                tab.to_csv(osp.join(f_name, f'chr_{c}.fit'), sep=sep, index=False)
            except Exception as e:
                raise e
    else:
        try:
            tables.to_csv(f_name + ext, sep=sep, index=False)
        except Exception as e:
            raise e