Skip to content

GenotypeMatrix

GenotypeMatrix

Bases: object

A class to represent a genotype matrix. The genotype matrix is a matrix of where the rows represent samples and the columns represent genetic variants. In general, genotype matrices are assumed to reside on disk and this class provides a convenient interface to interact with and perform computations on the genotype matrix.

Currently, we assume that the genotype matrix is stored using plink's BED file format, with associated tables for the samples (i.e. FAM file) and genetic variants (i.e. BIM file). Classes that inherit from this generic class support various backends to access and performing computations on this genotype data.

See Also

* [xarrayGenotypeMatrix][magenpy.GenotypeMatrix.xarrayGenotypeMatrix]
* [plinkBEDGenotypeMatrix][magenpy.GenotypeMatrix.plinkBEDGenotypeMatrix]

Attributes:

Name Type Description
sample_table Union[DataFrame, SampleTable, None]

A table containing information about the samples in the genotype matrix (initially read from the FAM file).

snp_table Union[DataFrame, None]

A table containing information about the genetic variants in the genotype matrix (initially read from the BIM file).

bed_file

The path to the plink BED file containing the genotype matrix.

_genome_build

The genome build or assembly under which the SNP coordinates are defined.

temp_dir

The directory where temporary files will be stored (if needed).

threads

The number of threads to use for parallel computations.

Source code in magenpy/GenotypeMatrix.py
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
class GenotypeMatrix(object):
    """
    A class to represent a genotype matrix. The genotype matrix is a matrix of
    where the rows represent samples and the columns represent genetic variants.
    In general, genotype matrices are assumed to reside on disk and this class
    provides a convenient interface to interact with and perform computations
    on the genotype matrix.

    Currently, we assume that the genotype matrix is stored using plink's BED
    file format, with associated tables for the samples (i.e. FAM file) and genetic
    variants (i.e. BIM file). Classes that inherit from this generic class support
    various backends to access and performing computations on this genotype data.

    !!! seealso "See Also"
            * [xarrayGenotypeMatrix][magenpy.GenotypeMatrix.xarrayGenotypeMatrix]
            * [plinkBEDGenotypeMatrix][magenpy.GenotypeMatrix.plinkBEDGenotypeMatrix]

    :ivar sample_table: A table containing information about the samples in the genotype matrix
    (initially read from the FAM file).
    :ivar snp_table: A table containing information about the genetic variants in the genotype matrix
    (initially read from the BIM file).
    :ivar bed_file: The path to the plink BED file containing the genotype matrix.
    :ivar _genome_build: The genome build or assembly under which the SNP coordinates are defined.
    :ivar temp_dir: The directory where temporary files will be stored (if needed).
    :ivar threads: The number of threads to use for parallel computations.

    """

    def __init__(self,
                 sample_table: Union[pd.DataFrame, SampleTable, None] = None,
                 snp_table: Union[pd.DataFrame, None] = None,
                 temp_dir: str = 'temp',
                 bed_file: str = None,
                 genome_build=None,
                 threads=1,
                 **kwargs):
        """
        Initialize a GenotypeMatrix object.

        :param sample_table: A table containing information about the samples in the genotype matrix.
        :param snp_table: A table containing information about the genetic variants in the genotype matrix.
        :param temp_dir: The directory where temporary files will be stored (if needed).
        :param bed_file: The path to the plink BED file containing the genotype matrix.
        :param genome_build: The genome build or assembly under which the SNP coordinates are defined.
        :param threads: The number of threads to use for parallel computations.
        :param kwargs: Additional keyword arguments.
        """

        self.sample_table: Union[pd.DataFrame, SampleTable, None] = None
        self.snp_table: Union[pd.DataFrame, None] = snp_table

        if sample_table is not None:
            self.set_sample_table(sample_table)

        if snp_table is not None and 'original_index' not in self.snp_table.columns:
            self.snp_table['original_index'] = np.arange(len(self.snp_table))

        from .utils.system_utils import makedir

        makedir(temp_dir)

        temp_dir_prefix = 'gmat_'

        if self.chromosome is not None:
            temp_dir_prefix += f'chr{self.chromosome}_'

        self.temp_dir = tempfile.TemporaryDirectory(dir=temp_dir, prefix=temp_dir_prefix)

        self.bed_file = bed_file
        self._genome_build = genome_build

        self.threads = threads

    @classmethod
    def from_file(cls, file_path, temp_dir='temp', **kwargs):
        """
        Initialize a genotype matrix object by passing a file path + other keyword arguments.
        :param file_path: The path to the plink BED file.
        :type file_path: str
        :param temp_dir: The directory where temporary files will be stored.
        :type temp_dir: str
        :param kwargs: Additional keyword arguments.
        """
        raise NotImplementedError

    @property
    def shape(self):
        """
        :return: The shape of the genotype matrix. Rows correspond to the
        number of samples and columns to the number of SNPs.
        """
        return self.n, self.m

    @property
    def n(self):
        """
        !!! seealso "See Also"
            * [sample_size][magenpy.GenotypeMatrix.GenotypeMatrix.sample_size]

        :return: The sample size or number of individuals in the genotype matrix.
        """
        return self.sample_table.n

    @property
    def sample_size(self):
        """
        !!! seealso "See Also"
            * [n][magenpy.GenotypeMatrix.GenotypeMatrix.n]

        :return: The sample size or number of individuals in the genotype matrix.
        """
        return self.n

    @property
    def samples(self):
        """
        :return: An array of sample IDs in the genotype matrix.
        """
        return self.sample_table.iid

    @property
    def m(self):
        """

        !!! seealso "See Also"
            * [n_snps][magenpy.GenotypeMatrix.GenotypeMatrix.n_snps]

        :return: The number of variants in the genotype matrix.
        """
        if self.snp_table is not None:
            return len(self.snp_table)

    @property
    def n_snps(self):
        """
        !!! seealso "See Also"
            * [m][magenpy.GenotypeMatrix.GenotypeMatrix.m]

        :return: The number of variants in the genotype matrix.
        """
        return self.m

    @property
    def genome_build(self):
        """
        :return: The genome build or assembly under which the SNP coordinates are defined.
        """
        return self._genome_build

    @property
    def chromosome(self):
        """
        ..note::
        This is a convenience method that assumes that the genotype matrix contains variants
        from a single chromosome. If there are multiple chromosomes, the method will return `None`.

        :return: The chromosome associated with the variants in the genotype matrix.
        """
        chrom = self.chromosomes
        if chrom is not None and len(chrom) == 1:
            return chrom[0]

    @property
    def chromosomes(self):
        """
        :return: The unique set of chromosomes comprising the genotype matrix.
        """
        chrom = self.get_snp_attribute('CHR')
        if chrom is not None:
            return np.unique(chrom)

    @property
    def snps(self):
        """
        :return: The SNP rsIDs for variants in the genotype matrix.
        """
        return self.get_snp_attribute('SNP')

    @property
    def bp_pos(self):
        """
        :return: The basepair position for the genetic variants in the genotype matrix.
        """
        return self.get_snp_attribute('POS')

    @property
    def cm_pos(self):
        """
        :return: The position of genetic variants in the genotype matrix in units of Centi Morgan.
        :raises KeyError: If the genetic distance is not set in the genotype file.
        """
        cm = self.get_snp_attribute('cM')
        if len(set(cm)) == 1:
            raise KeyError("Genetic distance in centi Morgan (cM) is not "
                           "set in the genotype file!")
        return cm

    @property
    def a1(self):
        """
        !!! seealso "See Also"
            * [alt_allele][magenpy.GenotypeMatrix.GenotypeMatrix.alt_allele]
            * [effect_allele][magenpy.GenotypeMatrix.GenotypeMatrix.effect_allele]

        :return: The effect allele `A1` for each genetic variant.

        """
        return self.get_snp_attribute('A1')

    @property
    def a2(self):
        """

        !!! seealso "See Also"
            * [ref_allele][magenpy.GenotypeMatrix.GenotypeMatrix.ref_allele]

        :return: The reference allele `A2` for each genetic variant.

        """
        return self.get_snp_attribute('A2')

    @property
    def ref_allele(self):
        """

        !!! seealso "See Also"
            * [a2][magenpy.GenotypeMatrix.GenotypeMatrix.a2]

        :return: The reference allele `A2` for each genetic variant.
        """
        return self.a2

    @property
    def alt_allele(self):
        """
        !!! seealso "See Also"
            * [effect_allele][magenpy.GenotypeMatrix.GenotypeMatrix.effect_allele]
            * [a1][magenpy.GenotypeMatrix.GenotypeMatrix.a1]

        :return: The effect allele `A1` for each genetic variant.

        """
        return self.a1

    @property
    def effect_allele(self):
        """

        !!! seealso "See Also"
            * [alt_allele][magenpy.GenotypeMatrix.GenotypeMatrix.alt_allele]
            * [a1][magenpy.GenotypeMatrix.GenotypeMatrix.a1]

        :return: The effect allele `A1` for each genetic variant.

        """
        return self.a1

    @property
    def n_per_snp(self):
        """
        :return: Sample size per genetic variant (accounting for potential missing values).
        """
        n = self.get_snp_attribute('N')
        if n is not None:
            return n
        else:
            self.compute_sample_size_per_snp()
            return self.get_snp_attribute('N')

    @property
    def maf(self):
        """
        :return: The minor allele frequency (MAF) of each variant in the genotype matrix.
        """
        maf = self.get_snp_attribute('MAF')
        if maf is not None:
            return maf
        else:
            self.compute_allele_frequency()
            return self.get_snp_attribute('MAF')

    @property
    def maf_var(self):
        """
        :return: The variance in minor allele frequency (MAF) of each variant in the genotype matrix.
        """
        return 2. * self.maf * (1. - self.maf)

    def estimate_memory_allocation(self, dtype=np.float32):
        """
        :return: An estimate of the memory allocation for the genotype matrix in megabytes.
        """
        return self.n * self.m * np.dtype(dtype).itemsize / 1024 ** 2

    def get_snp_table(self, col_subset=None):
        """
        A convenience method to extract SNP-related information from the genotype matrix.
        :param col_subset: A list of columns to extract from the SNP table.

        :return: A `pandas` DataFrame with the requested columns.
        """

        if col_subset is None:
            return self.snp_table.copy()
        else:
            present_cols = list(set(col_subset).intersection(set(self.snp_table.columns)))
            non_present_cols = list(set(col_subset) - set(present_cols))

            if len(present_cols) > 0:
                table = self.snp_table[present_cols].copy()
            else:
                table = pd.DataFrame({c: [] for c in non_present_cols})

            for col in non_present_cols:

                if col == 'MAF':
                    table['MAF'] = self.maf
                elif col == 'MAF_VAR':
                    table['MAF_VAR'] = self.maf_var
                elif col == 'N':
                    table['N'] = self.n_per_snp
                else:
                    raise KeyError(f"Column '{col}' is not available in the SNP table!")

            return table[list(col_subset)]

    def get_snp_attribute(self, attr):
        """

        :param attr: The name of the attribute to extract from the SNP table.
        :return: The values of a specific attribute for each variant in the genotype matrix.
        """
        if self.snp_table is not None and attr in self.snp_table.columns:
            return self.snp_table[attr].values

    def compute_ld(self,
                   estimator,
                   output_dir,
                   dtype='int16',
                   compressor_name='zstd',
                   compression_level=7,
                   compute_spectral_properties=False,
                   **ld_kwargs):
        """

        Compute the Linkage-Disequilibrium (LD) or SNP-by-SNP correlation matrix
        for the variants defined in the genotype matrix.

        :param estimator: The estimator for the LD matrix. We currently support
        4 different estimators: `sample`, `windowed`, `shrinkage`, and `block`.
        :param output_dir: The output directory where the Zarr array containing the
        entries of the LD matrix will be stored.
        :param dtype: The data type for the entries of the LD matrix (supported data types are float32, float64
        and integer quantized data types int8 and int16).
        :param compressor_name: The name of the compressor to use for the Zarr array.
        :param compression_level: The compression level for the Zarr array (1-9)
        :param ld_kwargs: keyword arguments for the various LD estimators. Consult
        the implementations of `WindowedLD`, `ShrinkageLD`, and `BlockLD` for details.
        :param compute_spectral_properties: If True, compute and store information about the eigenvalues of
        the LD matrix.
        """

        from .stats.ld.estimator import SampleLD, WindowedLD, ShrinkageLD, BlockLD

        if estimator == 'sample':
            ld_est = SampleLD(self)
        elif estimator == 'windowed':
            ld_est = WindowedLD(self, **ld_kwargs)
        elif estimator == 'shrinkage':
            ld_est = ShrinkageLD(self, **ld_kwargs)
        elif estimator == 'block':
            ld_est = BlockLD(self, **ld_kwargs)
        else:
            raise KeyError(f"LD estimator {estimator} is not recognized!")

        return ld_est.compute(output_dir,
                              dtype=dtype,
                              compressor_name=compressor_name,
                              compression_level=compression_level,
                              compute_spectral_properties=compute_spectral_properties)

    def set_sample_table(self, sample_table):
        """
        A convenience method set the sample table for the genotype matrix.
        This may be useful for syncing sample tables across different Genotype matrices
        corresponding to different chromosomes or genomic regions.

        :param sample_table: An instance of SampleTable or a pandas dataframe containing
        information about the samples in the genotype matrix.

        """

        if isinstance(sample_table, SampleTable):
            self.sample_table = sample_table
        elif isinstance(sample_table, pd.DataFrame):
            self.sample_table = SampleTable(sample_table)
        else:
            raise ValueError("The sample table is invalid! "
                             "Has to be either an instance of "
                             "SampleTable or pandas DataFrame.")

    def filter_snps(self, extract_snps=None, extract_file=None):
        """
        Filter variants from the genotype matrix. User must specify
        either a list of variants to extract or the path to a plink-style file
        with the list of variants to extract.

        :param extract_snps: A list (or array) of SNP IDs to keep in the genotype matrix.
        :param extract_file: The path to a file with the list of variants to extract.
        """

        assert extract_snps is not None or extract_file is not None

        if extract_snps is None:
            from .parsers.misc_parsers import read_snp_filter_file
            extract_snps = read_snp_filter_file(extract_file)

        self.snp_table = self.snp_table.merge(pd.DataFrame({'SNP': extract_snps}))

    def filter_by_allele_frequency(self, min_maf=None, min_mac=1):
        """
        Filter variants by minimum minor allele frequency or allele count cutoffs.

        :param min_maf: Minimum minor allele frequency
        :param min_mac: Minimum minor allele count (1 by default)
        """

        if min_mac or min_maf:

            maf = self.maf
            n = self.n_per_snp

            keep_flag = None

            if min_mac:
                mac = (2*maf*n).astype(np.int64)
                keep_flag = (mac >= min_mac) & ((2*n - mac) >= min_mac)

            if min_maf:

                maf_cond = (maf >= min_maf) & (1. - maf >= min_maf)
                if keep_flag is not None:
                    keep_flag = keep_flag & maf_cond
                else:
                    keep_flag = maf_cond

            if keep_flag is not None:
                self.filter_snps(extract_snps=self.snps[keep_flag])

    def drop_duplicated_snps(self):
        """
        A convenience method to drop variants with duplicated SNP rsIDs.
        """

        u_snps, counts = np.unique(self.snps, return_counts=True)
        if len(u_snps) < self.n_snps:
            # Keep only SNPs which occur once in the sequence:
            self.filter_snps(u_snps[counts == 1])

    def filter_samples(self, keep_samples=None, keep_file=None):
        """
        Filter samples from the genotype matrix. User must specify
        either a list of samples to keep or the path to a plink-style file
        with the list of samples to keep.

        :param keep_samples: A list (or array) of sample IDs to keep in the genotype matrix.
        :param keep_file: The path to a file with the list of samples to keep.
        """

        self.sample_table.filter_samples(keep_samples=keep_samples, keep_file=keep_file)

        # IMPORTANT: After filtering samples, update SNP attributes that depend on the
        # samples, such as MAF and N:
        if 'N' in self.snp_table:
            self.compute_sample_size_per_snp()
        if 'MAF' in self.snp_table:
            self.compute_allele_frequency()

    def score(self, beta, standardize_genotype=False):
        """
        Perform linear scoring, i.e. multiply the genotype matrix by the vector of effect sizes, `beta`.

        :param beta: A vector of effect sizes for each variant in the genotype matrix.
        :param standardize_genotype: If True, standardized the genotype matrix when computing the score.
        """
        raise NotImplementedError

    def perform_gwas(self, **gwa_kwargs):
        """
        Perform genome-wide association testing of all variants against the phenotype.

        :param gwa_kwargs: Keyword arguments to pass to the GWA functions. Consult `stats.gwa.utils`
        for relevant keyword arguments for each backend.

        :raises NotImplementedError: If the method is not implemented in the subclass.
        """
        raise NotImplementedError

    def compute_allele_frequency(self):
        """
        Compute the allele frequency of each variant or SNP in the genotype matrix.

        :raises NotImplementedError: If the method is not implemented in the subclass.
        """
        raise NotImplementedError

    def compute_sample_size_per_snp(self):
        """
        Compute the sample size for each variant in the genotype matrix, accounting for
        potential missing values.

        :raises NotImplementedError: If the method is not implemented in the subclass.
        """
        raise NotImplementedError

    def split_by_chromosome(self):
        """
        Split the genotype matrix by chromosome, so that we would
        have a separate `GenotypeMatrix` objects for each chromosome.
        This method returns a dictionary where the key is the chromosome number
        and the value is an object of `GenotypeMatrix` for that chromosome.

        :return: A dictionary of `GenotypeMatrix` objects, one for each chromosome.
        """

        chromosome = self.chromosome

        if chromosome:
            return {chromosome: self}
        else:
            chrom_tables = self.snp_table.groupby('CHR')

            return {
                c: self.__class__(sample_table=self.sample_table,
                                  snp_table=chrom_tables.get_group(c),
                                  bed_file=self.bed_file,
                                  temp_dir=self.temp_dir.name,
                                  genome_build=self.genome_build,
                                  threads=self.threads)
                for c in chrom_tables.groups
            }

    def split_by_variants(self, variant_group_dict):
        """
        Split the genotype matrix by variants into separate `GenotypeMatrix` objects
        based on the groups defined in `variant_group_dict`. The dictionary should have
        the group name as the key and the list of SNP rsIDs in that group as the value.

        :param variant_group_dict: A dictionary where the key is the group name and the value
        is a list of SNP rsIDs to group together.

        :return: A dictionary of `GenotypeMatrix` objects, one for each group.
        """

        if isinstance(variant_group_dict, dict):

            variant_group_dict = pd.concat([
                pd.DataFrame({'group': group, 'SNP': snps})
                for group, snps in variant_group_dict.items()
            ])
        elif isinstance(variant_group_dict, pd.DataFrame):
            assert 'SNP' in variant_group_dict.columns and 'group' in variant_group_dict.columns
        else:
            raise ValueError("The variant group dictionary is invalid!")

        grouped_table = self.snp_table.merge(variant_group_dict, on='SNP').groupby('group')

        return {
            group: self.__class__(sample_table=self.sample_table,
                                  snp_table=grouped_table.get_group(group).drop(columns='group'),
                                  bed_file=self.bed_file,
                                  temp_dir=self.temp_dir.name,
                                  genome_build=self.genome_build,
                                  threads=self.threads)
            for group in grouped_table.groups
        }

    def cleanup(self):
        """
        Clean up all temporary files and directories
        """

        try:
            self.temp_dir.cleanup()
        except FileNotFoundError:
            pass

a1 property

Returns:

Type Description

The effect allele A1 for each genetic variant.

a2 property

See Also

Returns:

Type Description

The reference allele A2 for each genetic variant.

alt_allele property

See Also

Returns:

Type Description

The effect allele A1 for each genetic variant.

bp_pos property

Returns:

Type Description

The basepair position for the genetic variants in the genotype matrix.

chromosome property

..note:: This is a convenience method that assumes that the genotype matrix contains variants from a single chromosome. If there are multiple chromosomes, the method will return None.

Returns:

Type Description

The chromosome associated with the variants in the genotype matrix.

chromosomes property

Returns:

Type Description

The unique set of chromosomes comprising the genotype matrix.

cm_pos property

Returns:

Type Description

The position of genetic variants in the genotype matrix in units of Centi Morgan.

Raises:

Type Description
KeyError

If the genetic distance is not set in the genotype file.

effect_allele property

See Also

Returns:

Type Description

The effect allele A1 for each genetic variant.

genome_build property

Returns:

Type Description

The genome build or assembly under which the SNP coordinates are defined.

m property

See Also

Returns:

Type Description

The number of variants in the genotype matrix.

maf property

Returns:

Type Description

The minor allele frequency (MAF) of each variant in the genotype matrix.

maf_var property

Returns:

Type Description

The variance in minor allele frequency (MAF) of each variant in the genotype matrix.

n property

See Also

Returns:

Type Description

The sample size or number of individuals in the genotype matrix.

n_per_snp property

Returns:

Type Description

Sample size per genetic variant (accounting for potential missing values).

n_snps property

See Also

Returns:

Type Description

The number of variants in the genotype matrix.

ref_allele property

See Also

Returns:

Type Description

The reference allele A2 for each genetic variant.

sample_size property

See Also

Returns:

Type Description

The sample size or number of individuals in the genotype matrix.

samples property

Returns:

Type Description

An array of sample IDs in the genotype matrix.

shape property

Returns:

Type Description

The shape of the genotype matrix. Rows correspond to the number of samples and columns to the number of SNPs.

snps property

Returns:

Type Description

The SNP rsIDs for variants in the genotype matrix.

__init__(sample_table=None, snp_table=None, temp_dir='temp', bed_file=None, genome_build=None, threads=1, **kwargs)

Initialize a GenotypeMatrix object.

Parameters:

Name Type Description Default
sample_table Union[DataFrame, SampleTable, None]

A table containing information about the samples in the genotype matrix.

None
snp_table Union[DataFrame, None]

A table containing information about the genetic variants in the genotype matrix.

None
temp_dir str

The directory where temporary files will be stored (if needed).

'temp'
bed_file str

The path to the plink BED file containing the genotype matrix.

None
genome_build

The genome build or assembly under which the SNP coordinates are defined.

None
threads

The number of threads to use for parallel computations.

1
kwargs

Additional keyword arguments.

{}
Source code in magenpy/GenotypeMatrix.py
def __init__(self,
             sample_table: Union[pd.DataFrame, SampleTable, None] = None,
             snp_table: Union[pd.DataFrame, None] = None,
             temp_dir: str = 'temp',
             bed_file: str = None,
             genome_build=None,
             threads=1,
             **kwargs):
    """
    Initialize a GenotypeMatrix object.

    :param sample_table: A table containing information about the samples in the genotype matrix.
    :param snp_table: A table containing information about the genetic variants in the genotype matrix.
    :param temp_dir: The directory where temporary files will be stored (if needed).
    :param bed_file: The path to the plink BED file containing the genotype matrix.
    :param genome_build: The genome build or assembly under which the SNP coordinates are defined.
    :param threads: The number of threads to use for parallel computations.
    :param kwargs: Additional keyword arguments.
    """

    self.sample_table: Union[pd.DataFrame, SampleTable, None] = None
    self.snp_table: Union[pd.DataFrame, None] = snp_table

    if sample_table is not None:
        self.set_sample_table(sample_table)

    if snp_table is not None and 'original_index' not in self.snp_table.columns:
        self.snp_table['original_index'] = np.arange(len(self.snp_table))

    from .utils.system_utils import makedir

    makedir(temp_dir)

    temp_dir_prefix = 'gmat_'

    if self.chromosome is not None:
        temp_dir_prefix += f'chr{self.chromosome}_'

    self.temp_dir = tempfile.TemporaryDirectory(dir=temp_dir, prefix=temp_dir_prefix)

    self.bed_file = bed_file
    self._genome_build = genome_build

    self.threads = threads

cleanup()

Clean up all temporary files and directories

Source code in magenpy/GenotypeMatrix.py
def cleanup(self):
    """
    Clean up all temporary files and directories
    """

    try:
        self.temp_dir.cleanup()
    except FileNotFoundError:
        pass

compute_allele_frequency()

Compute the allele frequency of each variant or SNP in the genotype matrix.

Raises:

Type Description
NotImplementedError

If the method is not implemented in the subclass.

Source code in magenpy/GenotypeMatrix.py
def compute_allele_frequency(self):
    """
    Compute the allele frequency of each variant or SNP in the genotype matrix.

    :raises NotImplementedError: If the method is not implemented in the subclass.
    """
    raise NotImplementedError

compute_ld(estimator, output_dir, dtype='int16', compressor_name='zstd', compression_level=7, compute_spectral_properties=False, **ld_kwargs)

Compute the Linkage-Disequilibrium (LD) or SNP-by-SNP correlation matrix for the variants defined in the genotype matrix.

Parameters:

Name Type Description Default
estimator

The estimator for the LD matrix. We currently support 4 different estimators: sample, windowed, shrinkage, and block.

required
output_dir

The output directory where the Zarr array containing the entries of the LD matrix will be stored.

required
dtype

The data type for the entries of the LD matrix (supported data types are float32, float64 and integer quantized data types int8 and int16).

'int16'
compressor_name

The name of the compressor to use for the Zarr array.

'zstd'
compression_level

The compression level for the Zarr array (1-9)

7
ld_kwargs

keyword arguments for the various LD estimators. Consult the implementations of WindowedLD, ShrinkageLD, and BlockLD for details.

{}
compute_spectral_properties

If True, compute and store information about the eigenvalues of the LD matrix.

False
Source code in magenpy/GenotypeMatrix.py
def compute_ld(self,
               estimator,
               output_dir,
               dtype='int16',
               compressor_name='zstd',
               compression_level=7,
               compute_spectral_properties=False,
               **ld_kwargs):
    """

    Compute the Linkage-Disequilibrium (LD) or SNP-by-SNP correlation matrix
    for the variants defined in the genotype matrix.

    :param estimator: The estimator for the LD matrix. We currently support
    4 different estimators: `sample`, `windowed`, `shrinkage`, and `block`.
    :param output_dir: The output directory where the Zarr array containing the
    entries of the LD matrix will be stored.
    :param dtype: The data type for the entries of the LD matrix (supported data types are float32, float64
    and integer quantized data types int8 and int16).
    :param compressor_name: The name of the compressor to use for the Zarr array.
    :param compression_level: The compression level for the Zarr array (1-9)
    :param ld_kwargs: keyword arguments for the various LD estimators. Consult
    the implementations of `WindowedLD`, `ShrinkageLD`, and `BlockLD` for details.
    :param compute_spectral_properties: If True, compute and store information about the eigenvalues of
    the LD matrix.
    """

    from .stats.ld.estimator import SampleLD, WindowedLD, ShrinkageLD, BlockLD

    if estimator == 'sample':
        ld_est = SampleLD(self)
    elif estimator == 'windowed':
        ld_est = WindowedLD(self, **ld_kwargs)
    elif estimator == 'shrinkage':
        ld_est = ShrinkageLD(self, **ld_kwargs)
    elif estimator == 'block':
        ld_est = BlockLD(self, **ld_kwargs)
    else:
        raise KeyError(f"LD estimator {estimator} is not recognized!")

    return ld_est.compute(output_dir,
                          dtype=dtype,
                          compressor_name=compressor_name,
                          compression_level=compression_level,
                          compute_spectral_properties=compute_spectral_properties)

compute_sample_size_per_snp()

Compute the sample size for each variant in the genotype matrix, accounting for potential missing values.

Raises:

Type Description
NotImplementedError

If the method is not implemented in the subclass.

Source code in magenpy/GenotypeMatrix.py
def compute_sample_size_per_snp(self):
    """
    Compute the sample size for each variant in the genotype matrix, accounting for
    potential missing values.

    :raises NotImplementedError: If the method is not implemented in the subclass.
    """
    raise NotImplementedError

drop_duplicated_snps()

A convenience method to drop variants with duplicated SNP rsIDs.

Source code in magenpy/GenotypeMatrix.py
def drop_duplicated_snps(self):
    """
    A convenience method to drop variants with duplicated SNP rsIDs.
    """

    u_snps, counts = np.unique(self.snps, return_counts=True)
    if len(u_snps) < self.n_snps:
        # Keep only SNPs which occur once in the sequence:
        self.filter_snps(u_snps[counts == 1])

estimate_memory_allocation(dtype=np.float32)

Returns:

Type Description

An estimate of the memory allocation for the genotype matrix in megabytes.

Source code in magenpy/GenotypeMatrix.py
def estimate_memory_allocation(self, dtype=np.float32):
    """
    :return: An estimate of the memory allocation for the genotype matrix in megabytes.
    """
    return self.n * self.m * np.dtype(dtype).itemsize / 1024 ** 2

filter_by_allele_frequency(min_maf=None, min_mac=1)

Filter variants by minimum minor allele frequency or allele count cutoffs.

Parameters:

Name Type Description Default
min_maf

Minimum minor allele frequency

None
min_mac

Minimum minor allele count (1 by default)

1
Source code in magenpy/GenotypeMatrix.py
def filter_by_allele_frequency(self, min_maf=None, min_mac=1):
    """
    Filter variants by minimum minor allele frequency or allele count cutoffs.

    :param min_maf: Minimum minor allele frequency
    :param min_mac: Minimum minor allele count (1 by default)
    """

    if min_mac or min_maf:

        maf = self.maf
        n = self.n_per_snp

        keep_flag = None

        if min_mac:
            mac = (2*maf*n).astype(np.int64)
            keep_flag = (mac >= min_mac) & ((2*n - mac) >= min_mac)

        if min_maf:

            maf_cond = (maf >= min_maf) & (1. - maf >= min_maf)
            if keep_flag is not None:
                keep_flag = keep_flag & maf_cond
            else:
                keep_flag = maf_cond

        if keep_flag is not None:
            self.filter_snps(extract_snps=self.snps[keep_flag])

filter_samples(keep_samples=None, keep_file=None)

Filter samples from the genotype matrix. User must specify either a list of samples to keep or the path to a plink-style file with the list of samples to keep.

Parameters:

Name Type Description Default
keep_samples

A list (or array) of sample IDs to keep in the genotype matrix.

None
keep_file

The path to a file with the list of samples to keep.

None
Source code in magenpy/GenotypeMatrix.py
def filter_samples(self, keep_samples=None, keep_file=None):
    """
    Filter samples from the genotype matrix. User must specify
    either a list of samples to keep or the path to a plink-style file
    with the list of samples to keep.

    :param keep_samples: A list (or array) of sample IDs to keep in the genotype matrix.
    :param keep_file: The path to a file with the list of samples to keep.
    """

    self.sample_table.filter_samples(keep_samples=keep_samples, keep_file=keep_file)

    # IMPORTANT: After filtering samples, update SNP attributes that depend on the
    # samples, such as MAF and N:
    if 'N' in self.snp_table:
        self.compute_sample_size_per_snp()
    if 'MAF' in self.snp_table:
        self.compute_allele_frequency()

filter_snps(extract_snps=None, extract_file=None)

Filter variants from the genotype matrix. User must specify either a list of variants to extract or the path to a plink-style file with the list of variants to extract.

Parameters:

Name Type Description Default
extract_snps

A list (or array) of SNP IDs to keep in the genotype matrix.

None
extract_file

The path to a file with the list of variants to extract.

None
Source code in magenpy/GenotypeMatrix.py
def filter_snps(self, extract_snps=None, extract_file=None):
    """
    Filter variants from the genotype matrix. User must specify
    either a list of variants to extract or the path to a plink-style file
    with the list of variants to extract.

    :param extract_snps: A list (or array) of SNP IDs to keep in the genotype matrix.
    :param extract_file: The path to a file with the list of variants to extract.
    """

    assert extract_snps is not None or extract_file is not None

    if extract_snps is None:
        from .parsers.misc_parsers import read_snp_filter_file
        extract_snps = read_snp_filter_file(extract_file)

    self.snp_table = self.snp_table.merge(pd.DataFrame({'SNP': extract_snps}))

from_file(file_path, temp_dir='temp', **kwargs) classmethod

Initialize a genotype matrix object by passing a file path + other keyword arguments.

Parameters:

Name Type Description Default
file_path str

The path to the plink BED file.

required
temp_dir str

The directory where temporary files will be stored.

'temp'
kwargs

Additional keyword arguments.

{}
Source code in magenpy/GenotypeMatrix.py
@classmethod
def from_file(cls, file_path, temp_dir='temp', **kwargs):
    """
    Initialize a genotype matrix object by passing a file path + other keyword arguments.
    :param file_path: The path to the plink BED file.
    :type file_path: str
    :param temp_dir: The directory where temporary files will be stored.
    :type temp_dir: str
    :param kwargs: Additional keyword arguments.
    """
    raise NotImplementedError

get_snp_attribute(attr)

Parameters:

Name Type Description Default
attr

The name of the attribute to extract from the SNP table.

required

Returns:

Type Description

The values of a specific attribute for each variant in the genotype matrix.

Source code in magenpy/GenotypeMatrix.py
def get_snp_attribute(self, attr):
    """

    :param attr: The name of the attribute to extract from the SNP table.
    :return: The values of a specific attribute for each variant in the genotype matrix.
    """
    if self.snp_table is not None and attr in self.snp_table.columns:
        return self.snp_table[attr].values

get_snp_table(col_subset=None)

A convenience method to extract SNP-related information from the genotype matrix.

Parameters:

Name Type Description Default
col_subset

A list of columns to extract from the SNP table.

None

Returns:

Type Description

A pandas DataFrame with the requested columns.

Source code in magenpy/GenotypeMatrix.py
def get_snp_table(self, col_subset=None):
    """
    A convenience method to extract SNP-related information from the genotype matrix.
    :param col_subset: A list of columns to extract from the SNP table.

    :return: A `pandas` DataFrame with the requested columns.
    """

    if col_subset is None:
        return self.snp_table.copy()
    else:
        present_cols = list(set(col_subset).intersection(set(self.snp_table.columns)))
        non_present_cols = list(set(col_subset) - set(present_cols))

        if len(present_cols) > 0:
            table = self.snp_table[present_cols].copy()
        else:
            table = pd.DataFrame({c: [] for c in non_present_cols})

        for col in non_present_cols:

            if col == 'MAF':
                table['MAF'] = self.maf
            elif col == 'MAF_VAR':
                table['MAF_VAR'] = self.maf_var
            elif col == 'N':
                table['N'] = self.n_per_snp
            else:
                raise KeyError(f"Column '{col}' is not available in the SNP table!")

        return table[list(col_subset)]

perform_gwas(**gwa_kwargs)

Perform genome-wide association testing of all variants against the phenotype.

Parameters:

Name Type Description Default
gwa_kwargs

Keyword arguments to pass to the GWA functions. Consult stats.gwa.utils for relevant keyword arguments for each backend.

{}

Raises:

Type Description
NotImplementedError

If the method is not implemented in the subclass.

Source code in magenpy/GenotypeMatrix.py
def perform_gwas(self, **gwa_kwargs):
    """
    Perform genome-wide association testing of all variants against the phenotype.

    :param gwa_kwargs: Keyword arguments to pass to the GWA functions. Consult `stats.gwa.utils`
    for relevant keyword arguments for each backend.

    :raises NotImplementedError: If the method is not implemented in the subclass.
    """
    raise NotImplementedError

score(beta, standardize_genotype=False)

Perform linear scoring, i.e. multiply the genotype matrix by the vector of effect sizes, beta.

Parameters:

Name Type Description Default
beta

A vector of effect sizes for each variant in the genotype matrix.

required
standardize_genotype

If True, standardized the genotype matrix when computing the score.

False
Source code in magenpy/GenotypeMatrix.py
def score(self, beta, standardize_genotype=False):
    """
    Perform linear scoring, i.e. multiply the genotype matrix by the vector of effect sizes, `beta`.

    :param beta: A vector of effect sizes for each variant in the genotype matrix.
    :param standardize_genotype: If True, standardized the genotype matrix when computing the score.
    """
    raise NotImplementedError

set_sample_table(sample_table)

A convenience method set the sample table for the genotype matrix. This may be useful for syncing sample tables across different Genotype matrices corresponding to different chromosomes or genomic regions.

Parameters:

Name Type Description Default
sample_table

An instance of SampleTable or a pandas dataframe containing information about the samples in the genotype matrix.

required
Source code in magenpy/GenotypeMatrix.py
def set_sample_table(self, sample_table):
    """
    A convenience method set the sample table for the genotype matrix.
    This may be useful for syncing sample tables across different Genotype matrices
    corresponding to different chromosomes or genomic regions.

    :param sample_table: An instance of SampleTable or a pandas dataframe containing
    information about the samples in the genotype matrix.

    """

    if isinstance(sample_table, SampleTable):
        self.sample_table = sample_table
    elif isinstance(sample_table, pd.DataFrame):
        self.sample_table = SampleTable(sample_table)
    else:
        raise ValueError("The sample table is invalid! "
                         "Has to be either an instance of "
                         "SampleTable or pandas DataFrame.")

split_by_chromosome()

Split the genotype matrix by chromosome, so that we would have a separate GenotypeMatrix objects for each chromosome. This method returns a dictionary where the key is the chromosome number and the value is an object of GenotypeMatrix for that chromosome.

Returns:

Type Description

A dictionary of GenotypeMatrix objects, one for each chromosome.

Source code in magenpy/GenotypeMatrix.py
def split_by_chromosome(self):
    """
    Split the genotype matrix by chromosome, so that we would
    have a separate `GenotypeMatrix` objects for each chromosome.
    This method returns a dictionary where the key is the chromosome number
    and the value is an object of `GenotypeMatrix` for that chromosome.

    :return: A dictionary of `GenotypeMatrix` objects, one for each chromosome.
    """

    chromosome = self.chromosome

    if chromosome:
        return {chromosome: self}
    else:
        chrom_tables = self.snp_table.groupby('CHR')

        return {
            c: self.__class__(sample_table=self.sample_table,
                              snp_table=chrom_tables.get_group(c),
                              bed_file=self.bed_file,
                              temp_dir=self.temp_dir.name,
                              genome_build=self.genome_build,
                              threads=self.threads)
            for c in chrom_tables.groups
        }

split_by_variants(variant_group_dict)

Split the genotype matrix by variants into separate GenotypeMatrix objects based on the groups defined in variant_group_dict. The dictionary should have the group name as the key and the list of SNP rsIDs in that group as the value.

Parameters:

Name Type Description Default
variant_group_dict

A dictionary where the key is the group name and the value is a list of SNP rsIDs to group together.

required

Returns:

Type Description

A dictionary of GenotypeMatrix objects, one for each group.

Source code in magenpy/GenotypeMatrix.py
def split_by_variants(self, variant_group_dict):
    """
    Split the genotype matrix by variants into separate `GenotypeMatrix` objects
    based on the groups defined in `variant_group_dict`. The dictionary should have
    the group name as the key and the list of SNP rsIDs in that group as the value.

    :param variant_group_dict: A dictionary where the key is the group name and the value
    is a list of SNP rsIDs to group together.

    :return: A dictionary of `GenotypeMatrix` objects, one for each group.
    """

    if isinstance(variant_group_dict, dict):

        variant_group_dict = pd.concat([
            pd.DataFrame({'group': group, 'SNP': snps})
            for group, snps in variant_group_dict.items()
        ])
    elif isinstance(variant_group_dict, pd.DataFrame):
        assert 'SNP' in variant_group_dict.columns and 'group' in variant_group_dict.columns
    else:
        raise ValueError("The variant group dictionary is invalid!")

    grouped_table = self.snp_table.merge(variant_group_dict, on='SNP').groupby('group')

    return {
        group: self.__class__(sample_table=self.sample_table,
                              snp_table=grouped_table.get_group(group).drop(columns='group'),
                              bed_file=self.bed_file,
                              temp_dir=self.temp_dir.name,
                              genome_build=self.genome_build,
                              threads=self.threads)
        for group in grouped_table.groups
    }

plinkBEDGenotypeMatrix

Bases: GenotypeMatrix

A class that defines methods and interfaces for interacting with genotype matrices using plink2 software. This class provides a convenient interface to perform various computations on genotype matrices stored in the plink BED format.

This class inherits all the attributes of the GenotypeMatrix class.

Source code in magenpy/GenotypeMatrix.py
class plinkBEDGenotypeMatrix(GenotypeMatrix):
    """
    A class that defines methods and interfaces for interacting with genotype matrices
    using `plink2` software. This class provides a convenient interface to perform various
    computations on genotype matrices stored in the plink BED format.

    This class inherits all the attributes of the `GenotypeMatrix` class.
    """

    def __init__(self,
                 sample_table=None,
                 snp_table=None,
                 temp_dir='temp',
                 bed_file=None,
                 genome_build=None,
                 threads=1):
        """
        Initialize a `plinkBEDGenotypeMatrix` object.

        :param sample_table: A table containing information about the samples in the genotype matrix.
        :param snp_table: A table containing information about the genetic variants in the genotype matrix.
        :param temp_dir: The directory where temporary files will be stored (if needed).
        :param bed_file: The path to the plink BED file containing the genotype matrix.
        :param genome_build: The genome build or assembly under which the SNP coordinates are defined.
        :param threads: The number of threads to use for parallel computations.
        """

        super().__init__(sample_table=sample_table,
                         snp_table=snp_table,
                         temp_dir=temp_dir,
                         bed_file=bed_file,
                         genome_build=genome_build,
                         threads=threads)

        from .parsers.plink_parsers import parse_fam_file, parse_bim_file

        if self.bed_file is not None:
            self.bed_file = self.bed_file.replace('.bed', '')

        if self.sample_table is None and self.bed_file:
            self.sample_table = SampleTable(parse_fam_file(self.bed_file))

        if self.snp_table is None and self.bed_file:
            self.snp_table = parse_bim_file(self.bed_file)
            self.snp_table['original_index'] = np.arange(len(self.snp_table))

    @classmethod
    def from_file(cls, file_path, temp_dir='temp', **kwargs):
        """
        A convenience method to create a `plinkBEDGenotypeMatrix` object by
         providing a path to a PLINK BED file.

        :param file_path: The path to the plink BED file.
        :param temp_dir: The directory where temporary files will be stored.
        :param kwargs: Additional keyword arguments.
        """

        p_gt = cls(bed_file=file_path, temp_dir=temp_dir, **kwargs)

        return p_gt

    def score(self, beta, standardize_genotype=False):
        """
        Perform linear scoring on the genotype matrix. This function takes a vector (or matrix) of
        effect sizes and returns the matrix-vector or matrix-matrix product of the genotype matrix
        multiplied by the effect sizes.

        This can be used for polygenic score calculation or projecting the genotype matrix.

        :param beta: A vector or matrix of effect sizes for each variant in the genotype matrix.
        :param standardize_genotype: If True, standardize the genotype when computing the polygenic score.

        :return: The polygenic score (PGS) for each sample in the genotype matrix.
        """

        from .stats.score.utils import score_plink2

        # Create a temporary directory where we store intermediate results:
        tmp_score_dir = tempfile.TemporaryDirectory(dir=self.temp_dir.name, prefix='score_')

        return score_plink2(self, beta,
                            standardize_genotype=standardize_genotype,
                            temp_dir=tmp_score_dir.name)

    def perform_gwas(self, **gwa_kwargs):
        """
        Perform genome-wide association testing of all variants against the phenotype.
        This method calls specialized functions that, in turn, call `plink2` to perform
        the association testing.

        :return: A Summary statistics table containing the results of the association testing.
        """

        from .stats.gwa.utils import perform_gwa_plink2

        # Create a temporary directory where we store intermediate results:
        tmp_gwas_dir = tempfile.TemporaryDirectory(dir=self.temp_dir.name, prefix='gwas_')

        return perform_gwa_plink2(self, temp_dir=tmp_gwas_dir.name, **gwa_kwargs)

    def compute_allele_frequency(self):
        """
        Compute the allele frequency of each variant or SNP in the genotype matrix.
        This method calls specialized functions that, in turn, call `plink2` to compute
        allele frequency.
        """

        from .stats.variant.utils import compute_allele_frequency_plink2

        # Create a temporary directory where we store intermediate results:
        tmp_freq_dir = tempfile.TemporaryDirectory(dir=self.temp_dir.name, prefix='freq_')

        self.snp_table['MAF'] = compute_allele_frequency_plink2(self, temp_dir=tmp_freq_dir.name)

    def compute_sample_size_per_snp(self):
        """
        Compute the sample size for each variant in the genotype matrix, accounting for
        potential missing values.

        This method calls specialized functions that, in turn, call `plink2` to compute sample
        size per variant.
        """

        from .stats.variant.utils import compute_sample_size_per_snp_plink2

        # Create a temporary directory where we store intermediate results:
        tmp_miss_dir = tempfile.TemporaryDirectory(dir=self.temp_dir.name, prefix='miss_')

        self.snp_table['N'] = compute_sample_size_per_snp_plink2(self, temp_dir=tmp_miss_dir.name)

__init__(sample_table=None, snp_table=None, temp_dir='temp', bed_file=None, genome_build=None, threads=1)

Initialize a plinkBEDGenotypeMatrix object.

Parameters:

Name Type Description Default
sample_table

A table containing information about the samples in the genotype matrix.

None
snp_table

A table containing information about the genetic variants in the genotype matrix.

None
temp_dir

The directory where temporary files will be stored (if needed).

'temp'
bed_file

The path to the plink BED file containing the genotype matrix.

None
genome_build

The genome build or assembly under which the SNP coordinates are defined.

None
threads

The number of threads to use for parallel computations.

1
Source code in magenpy/GenotypeMatrix.py
def __init__(self,
             sample_table=None,
             snp_table=None,
             temp_dir='temp',
             bed_file=None,
             genome_build=None,
             threads=1):
    """
    Initialize a `plinkBEDGenotypeMatrix` object.

    :param sample_table: A table containing information about the samples in the genotype matrix.
    :param snp_table: A table containing information about the genetic variants in the genotype matrix.
    :param temp_dir: The directory where temporary files will be stored (if needed).
    :param bed_file: The path to the plink BED file containing the genotype matrix.
    :param genome_build: The genome build or assembly under which the SNP coordinates are defined.
    :param threads: The number of threads to use for parallel computations.
    """

    super().__init__(sample_table=sample_table,
                     snp_table=snp_table,
                     temp_dir=temp_dir,
                     bed_file=bed_file,
                     genome_build=genome_build,
                     threads=threads)

    from .parsers.plink_parsers import parse_fam_file, parse_bim_file

    if self.bed_file is not None:
        self.bed_file = self.bed_file.replace('.bed', '')

    if self.sample_table is None and self.bed_file:
        self.sample_table = SampleTable(parse_fam_file(self.bed_file))

    if self.snp_table is None and self.bed_file:
        self.snp_table = parse_bim_file(self.bed_file)
        self.snp_table['original_index'] = np.arange(len(self.snp_table))

compute_allele_frequency()

Compute the allele frequency of each variant or SNP in the genotype matrix. This method calls specialized functions that, in turn, call plink2 to compute allele frequency.

Source code in magenpy/GenotypeMatrix.py
def compute_allele_frequency(self):
    """
    Compute the allele frequency of each variant or SNP in the genotype matrix.
    This method calls specialized functions that, in turn, call `plink2` to compute
    allele frequency.
    """

    from .stats.variant.utils import compute_allele_frequency_plink2

    # Create a temporary directory where we store intermediate results:
    tmp_freq_dir = tempfile.TemporaryDirectory(dir=self.temp_dir.name, prefix='freq_')

    self.snp_table['MAF'] = compute_allele_frequency_plink2(self, temp_dir=tmp_freq_dir.name)

compute_sample_size_per_snp()

Compute the sample size for each variant in the genotype matrix, accounting for potential missing values.

This method calls specialized functions that, in turn, call plink2 to compute sample size per variant.

Source code in magenpy/GenotypeMatrix.py
def compute_sample_size_per_snp(self):
    """
    Compute the sample size for each variant in the genotype matrix, accounting for
    potential missing values.

    This method calls specialized functions that, in turn, call `plink2` to compute sample
    size per variant.
    """

    from .stats.variant.utils import compute_sample_size_per_snp_plink2

    # Create a temporary directory where we store intermediate results:
    tmp_miss_dir = tempfile.TemporaryDirectory(dir=self.temp_dir.name, prefix='miss_')

    self.snp_table['N'] = compute_sample_size_per_snp_plink2(self, temp_dir=tmp_miss_dir.name)

from_file(file_path, temp_dir='temp', **kwargs) classmethod

A convenience method to create a plinkBEDGenotypeMatrix object by providing a path to a PLINK BED file.

Parameters:

Name Type Description Default
file_path

The path to the plink BED file.

required
temp_dir

The directory where temporary files will be stored.

'temp'
kwargs

Additional keyword arguments.

{}
Source code in magenpy/GenotypeMatrix.py
@classmethod
def from_file(cls, file_path, temp_dir='temp', **kwargs):
    """
    A convenience method to create a `plinkBEDGenotypeMatrix` object by
     providing a path to a PLINK BED file.

    :param file_path: The path to the plink BED file.
    :param temp_dir: The directory where temporary files will be stored.
    :param kwargs: Additional keyword arguments.
    """

    p_gt = cls(bed_file=file_path, temp_dir=temp_dir, **kwargs)

    return p_gt

perform_gwas(**gwa_kwargs)

Perform genome-wide association testing of all variants against the phenotype. This method calls specialized functions that, in turn, call plink2 to perform the association testing.

Returns:

Type Description

A Summary statistics table containing the results of the association testing.

Source code in magenpy/GenotypeMatrix.py
def perform_gwas(self, **gwa_kwargs):
    """
    Perform genome-wide association testing of all variants against the phenotype.
    This method calls specialized functions that, in turn, call `plink2` to perform
    the association testing.

    :return: A Summary statistics table containing the results of the association testing.
    """

    from .stats.gwa.utils import perform_gwa_plink2

    # Create a temporary directory where we store intermediate results:
    tmp_gwas_dir = tempfile.TemporaryDirectory(dir=self.temp_dir.name, prefix='gwas_')

    return perform_gwa_plink2(self, temp_dir=tmp_gwas_dir.name, **gwa_kwargs)

score(beta, standardize_genotype=False)

Perform linear scoring on the genotype matrix. This function takes a vector (or matrix) of effect sizes and returns the matrix-vector or matrix-matrix product of the genotype matrix multiplied by the effect sizes.

This can be used for polygenic score calculation or projecting the genotype matrix.

Parameters:

Name Type Description Default
beta

A vector or matrix of effect sizes for each variant in the genotype matrix.

required
standardize_genotype

If True, standardize the genotype when computing the polygenic score.

False

Returns:

Type Description

The polygenic score (PGS) for each sample in the genotype matrix.

Source code in magenpy/GenotypeMatrix.py
def score(self, beta, standardize_genotype=False):
    """
    Perform linear scoring on the genotype matrix. This function takes a vector (or matrix) of
    effect sizes and returns the matrix-vector or matrix-matrix product of the genotype matrix
    multiplied by the effect sizes.

    This can be used for polygenic score calculation or projecting the genotype matrix.

    :param beta: A vector or matrix of effect sizes for each variant in the genotype matrix.
    :param standardize_genotype: If True, standardize the genotype when computing the polygenic score.

    :return: The polygenic score (PGS) for each sample in the genotype matrix.
    """

    from .stats.score.utils import score_plink2

    # Create a temporary directory where we store intermediate results:
    tmp_score_dir = tempfile.TemporaryDirectory(dir=self.temp_dir.name, prefix='score_')

    return score_plink2(self, beta,
                        standardize_genotype=standardize_genotype,
                        temp_dir=tmp_score_dir.name)

xarrayGenotypeMatrix

Bases: GenotypeMatrix

A class that defines methods and interfaces for interacting with genotype matrices using the xarray library. In particular, the class leverages functionality provided by the pandas-plink package to represent on-disk genotype matrices as chunked multidimensional arrays that can be queried and manipulated efficiently and in parallel.

This class inherits all the attributes of the GenotypeMatrix class.

Attributes:

Name Type Description
xr_mat

The xarray object representing the genotype matrix.

Source code in magenpy/GenotypeMatrix.py
class xarrayGenotypeMatrix(GenotypeMatrix):
    """
    A class that defines methods and interfaces for interacting with genotype matrices
    using the `xarray` library. In particular, the class leverages functionality provided by
    the `pandas-plink` package to represent on-disk genotype matrices as chunked multidimensional
    arrays that can be queried and manipulated efficiently and in parallel.

    This class inherits all the attributes of the `GenotypeMatrix` class.

    :ivar xr_mat: The `xarray` object representing the genotype matrix.

    """

    def __init__(self,
                 sample_table=None,
                 snp_table=None,
                 bed_file=None,
                 temp_dir='temp',
                 xr_mat=None,
                 genome_build=None,
                 threads=1):
        """
        Initialize an xarrayGenotypeMatrix object.

        :param sample_table: A table containing information about the samples in the genotype matrix.
        :param snp_table: A table containing information about the genetic variants in the genotype matrix.
        :param bed_file: The path to the plink BED file containing the genotype matrix.
        :param temp_dir: The directory where temporary files will be stored (if needed).
        :param xr_mat: The xarray object representing the genotype matrix.
        :param genome_build: The genome build or assembly under which the SNP coordinates are defined.
        :param threads: The number of threads to use for parallel computations.
        """

        super().__init__(sample_table=sample_table,
                         snp_table=snp_table,
                         temp_dir=temp_dir,
                         bed_file=bed_file,
                         genome_build=genome_build,
                         threads=threads)

        # xarray matrix object, as defined by pandas-plink:
        self.xr_mat = xr_mat

    @classmethod
    def from_file(cls, file_path, temp_dir='temp', **kwargs):
        """
        Create a GenotypeMatrix object using a PLINK BED file with the help
        of the data structures defined in `pandas_plink`. The genotype matrix
        will be represented implicitly in an `xarray` object, and we will use it
        to perform various computations. This method is a utility function to
        construct the genotype matrix object from a plink BED file.

        :param file_path: Path to the plink BED file.
        :param temp_dir: The directory where the temporary files will be stored.
        :param kwargs: Additional keyword arguments.
        """

        from pandas_plink import read_plink1_bin

        try:
            xr_gt = read_plink1_bin(file_path + ".bed", ref="a0", verbose=False)
        except ValueError:
            xr_gt = read_plink1_bin(file_path, ref="a0", verbose=False)

        # Set the sample table:
        sample_table = xr_gt.sample.coords.to_dataset().to_dataframe()
        sample_table.columns = ['FID', 'IID', 'fatherID', 'motherID', 'sex', 'phenotype']
        sample_table.reset_index(inplace=True, drop=True)
        sample_table = sample_table.astype({
            'FID': str,
            'IID': str,
            'fatherID': str,
            'motherID': str,
            'sex': float,
            'phenotype': float
        })

        sample_table['phenotype'] = sample_table['phenotype'].replace({-9.: np.nan})

        # Set the snp table:
        snp_table = xr_gt.variant.coords.to_dataset().to_dataframe()
        snp_table.columns = ['CHR', 'SNP', 'cM', 'POS', 'A1', 'A2']
        snp_table.reset_index(inplace=True, drop=True)
        snp_table = snp_table.astype({
            'CHR': int,
            'SNP': str,
            'cM': np.float32,
            'POS': np.int32,
            'A1': str,
            'A2': str
        })

        g_mat = cls(sample_table=SampleTable(sample_table),
                    snp_table=snp_table,
                    temp_dir=temp_dir,
                    bed_file=file_path,
                    xr_mat=xr_gt,
                    **kwargs)

        return g_mat

    def set_sample_table(self, sample_table):
        """
        A convenience method set the sample table for the genotype matrix.
        This is useful for cases when we need to sync the sample table across chromosomes.

        :param sample_table: An instance of SampleTable or a pandas dataframe containing
        information about the samples in the genotype matrix.
        """

        super().set_sample_table(sample_table)

        try:
            if self.n != self.xr_mat.shape[0]:
                self.xr_mat = self.xr_mat.sel(sample=self.samples)
        except AttributeError:
            pass

    def filter_snps(self, extract_snps=None, extract_file=None):
        """
        Filter variants from the genotype matrix. User must specify either a list of variants to
        extract or the path to a file with the list of variants to extract.

        :param extract_snps: A list or array of SNP rsIDs to keep in the genotype matrix.
        :param extract_file: The path to a file with the list of variants to extract.
        """

        super().filter_snps(extract_snps=extract_snps, extract_file=extract_file)

        from .utils.compute_utils import intersect_arrays

        idx = intersect_arrays(self.xr_mat.variant.coords['snp'].values, self.snps, return_index=True)

        self.xr_mat = self.xr_mat.isel(variant=idx)

    def filter_samples(self, keep_samples=None, keep_file=None):
        """
        Filter samples from the genotype matrix.
        User must specify either a list of samples to keep or the path to a file with the list of samples to keep.

        :param keep_samples: A list (or array) of sample IDs to keep in the genotype matrix.
        :param keep_file: The path to a file with the list of samples to keep.
        """

        super().filter_samples(keep_samples=keep_samples, keep_file=keep_file)
        self.xr_mat = self.xr_mat.sel(sample=self.samples)

    def to_numpy(self, dtype=np.int8):
        """
        Convert the genotype matrix to a numpy array.
        :param dtype: The data type of the numpy array. Default: Int8

        :return: A numpy array representation of the genotype matrix.
        """

        return self.xr_mat.data.astype(dtype).compute()

    def to_csr(self, dtype=np.int8):
        """
        Convert the genotype matrix to a scipy sparse CSR matrix.
        :param dtype: The data type of the scipy array. Default: Int8

        :return: A `scipy` sparse CSR matrix representation of the genotype matrix.
        """

        mat = self.to_numpy(dtype=dtype)

        from scipy.sparse import csr_matrix

        return csr_matrix(mat)

    def score(self, beta, standardize_genotype=False, skip_na=True):
        """
        Perform linear scoring on the genotype matrix.
        :param beta: A vector or matrix of effect sizes for each variant in the genotype matrix.
        :param standardize_genotype: If True, standardize the genotype when computing the polygenic score.
        :param skip_na: If True, skip missing values when computing the polygenic score.

        :return: The polygenic score(s) (PGS) for each sample in the genotype matrix.

        """

        import dask.array as da

        chunked_beta = da.from_array(beta, chunks=self.xr_mat.data.chunksize[1])

        if standardize_genotype:
            from .stats.transforms.genotype import standardize
            pgs = da.dot(standardize(self.xr_mat).data, chunked_beta).compute()
        else:
            if skip_na:
                pgs = da.dot(da.nan_to_num(self.xr_mat.data), chunked_beta).compute()
            else:
                pgs = da.dot(self.xr_mat.fillna(self.maf).data, chunked_beta).compute()

        return pgs

    def perform_gwas(self, **gwa_kwargs):
        """
        A convenience method that calls specialized utility functions that perform
        genome-wide association testing of all variants against the phenotype.

        :return: A Summary statistics table containing the results of the association testing.
        """

        from .stats.gwa.utils import perform_gwa_xarray
        return perform_gwa_xarray(self, **gwa_kwargs)

    def compute_allele_frequency(self):
        """
        A convenience method that calls specialized utility functions that
        compute the allele frequency of each variant or SNP in the genotype matrix.
        """
        self.snp_table['MAF'] = (self.xr_mat.sum(axis=0) / (2. * self.n_per_snp)).compute().values

    def compute_sample_size_per_snp(self):
        """
        A convenience method that calls specialized utility functions that compute
        the sample size for each variant in the genotype matrix, accounting for
        potential missing values.
        """
        self.snp_table['N'] = self.xr_mat.shape[0] - self.xr_mat.isnull().sum(axis=0).compute().values

    def split_by_chromosome(self):
        """
        Split the genotype matrix by chromosome.
        :return: A dictionary of `xarrayGenotypeMatrix` objects, one for each chromosome.
        """
        split = super().split_by_chromosome()

        for c, gt in split.items():
            gt.xr_mat = self.xr_mat
            if len(split) > 1:
                gt.filter_snps(extract_snps=gt.snps)

        return split

    def split_by_variants(self, variant_group_dict):
        """
        Split the genotype matrix by variants into separate `xarrayGenotypeMatrix` objects
        based on the groups defined in `variant_group_dict`. The dictionary should have
        the group name as the key and the list of SNP rsIDs in that group as the value.

        :param variant_group_dict: A dictionary where the key is the group name and the value
        is a list of SNP rsIDs to group together.

        :return: A dictionary of `xarrayGenotypeMatrix` objects, one for each group.
        """

        split = super().split_by_variants(variant_group_dict)

        for g, gt in split.items():
            gt.xr_mat = self.xr_mat.copy()
            gt.filter_snps(extract_snps=gt.snps)

        return split

__init__(sample_table=None, snp_table=None, bed_file=None, temp_dir='temp', xr_mat=None, genome_build=None, threads=1)

Initialize an xarrayGenotypeMatrix object.

Parameters:

Name Type Description Default
sample_table

A table containing information about the samples in the genotype matrix.

None
snp_table

A table containing information about the genetic variants in the genotype matrix.

None
bed_file

The path to the plink BED file containing the genotype matrix.

None
temp_dir

The directory where temporary files will be stored (if needed).

'temp'
xr_mat

The xarray object representing the genotype matrix.

None
genome_build

The genome build or assembly under which the SNP coordinates are defined.

None
threads

The number of threads to use for parallel computations.

1
Source code in magenpy/GenotypeMatrix.py
def __init__(self,
             sample_table=None,
             snp_table=None,
             bed_file=None,
             temp_dir='temp',
             xr_mat=None,
             genome_build=None,
             threads=1):
    """
    Initialize an xarrayGenotypeMatrix object.

    :param sample_table: A table containing information about the samples in the genotype matrix.
    :param snp_table: A table containing information about the genetic variants in the genotype matrix.
    :param bed_file: The path to the plink BED file containing the genotype matrix.
    :param temp_dir: The directory where temporary files will be stored (if needed).
    :param xr_mat: The xarray object representing the genotype matrix.
    :param genome_build: The genome build or assembly under which the SNP coordinates are defined.
    :param threads: The number of threads to use for parallel computations.
    """

    super().__init__(sample_table=sample_table,
                     snp_table=snp_table,
                     temp_dir=temp_dir,
                     bed_file=bed_file,
                     genome_build=genome_build,
                     threads=threads)

    # xarray matrix object, as defined by pandas-plink:
    self.xr_mat = xr_mat

compute_allele_frequency()

A convenience method that calls specialized utility functions that compute the allele frequency of each variant or SNP in the genotype matrix.

Source code in magenpy/GenotypeMatrix.py
def compute_allele_frequency(self):
    """
    A convenience method that calls specialized utility functions that
    compute the allele frequency of each variant or SNP in the genotype matrix.
    """
    self.snp_table['MAF'] = (self.xr_mat.sum(axis=0) / (2. * self.n_per_snp)).compute().values

compute_sample_size_per_snp()

A convenience method that calls specialized utility functions that compute the sample size for each variant in the genotype matrix, accounting for potential missing values.

Source code in magenpy/GenotypeMatrix.py
def compute_sample_size_per_snp(self):
    """
    A convenience method that calls specialized utility functions that compute
    the sample size for each variant in the genotype matrix, accounting for
    potential missing values.
    """
    self.snp_table['N'] = self.xr_mat.shape[0] - self.xr_mat.isnull().sum(axis=0).compute().values

filter_samples(keep_samples=None, keep_file=None)

Filter samples from the genotype matrix. User must specify either a list of samples to keep or the path to a file with the list of samples to keep.

Parameters:

Name Type Description Default
keep_samples

A list (or array) of sample IDs to keep in the genotype matrix.

None
keep_file

The path to a file with the list of samples to keep.

None
Source code in magenpy/GenotypeMatrix.py
def filter_samples(self, keep_samples=None, keep_file=None):
    """
    Filter samples from the genotype matrix.
    User must specify either a list of samples to keep or the path to a file with the list of samples to keep.

    :param keep_samples: A list (or array) of sample IDs to keep in the genotype matrix.
    :param keep_file: The path to a file with the list of samples to keep.
    """

    super().filter_samples(keep_samples=keep_samples, keep_file=keep_file)
    self.xr_mat = self.xr_mat.sel(sample=self.samples)

filter_snps(extract_snps=None, extract_file=None)

Filter variants from the genotype matrix. User must specify either a list of variants to extract or the path to a file with the list of variants to extract.

Parameters:

Name Type Description Default
extract_snps

A list or array of SNP rsIDs to keep in the genotype matrix.

None
extract_file

The path to a file with the list of variants to extract.

None
Source code in magenpy/GenotypeMatrix.py
def filter_snps(self, extract_snps=None, extract_file=None):
    """
    Filter variants from the genotype matrix. User must specify either a list of variants to
    extract or the path to a file with the list of variants to extract.

    :param extract_snps: A list or array of SNP rsIDs to keep in the genotype matrix.
    :param extract_file: The path to a file with the list of variants to extract.
    """

    super().filter_snps(extract_snps=extract_snps, extract_file=extract_file)

    from .utils.compute_utils import intersect_arrays

    idx = intersect_arrays(self.xr_mat.variant.coords['snp'].values, self.snps, return_index=True)

    self.xr_mat = self.xr_mat.isel(variant=idx)

from_file(file_path, temp_dir='temp', **kwargs) classmethod

Create a GenotypeMatrix object using a PLINK BED file with the help of the data structures defined in pandas_plink. The genotype matrix will be represented implicitly in an xarray object, and we will use it to perform various computations. This method is a utility function to construct the genotype matrix object from a plink BED file.

Parameters:

Name Type Description Default
file_path

Path to the plink BED file.

required
temp_dir

The directory where the temporary files will be stored.

'temp'
kwargs

Additional keyword arguments.

{}
Source code in magenpy/GenotypeMatrix.py
@classmethod
def from_file(cls, file_path, temp_dir='temp', **kwargs):
    """
    Create a GenotypeMatrix object using a PLINK BED file with the help
    of the data structures defined in `pandas_plink`. The genotype matrix
    will be represented implicitly in an `xarray` object, and we will use it
    to perform various computations. This method is a utility function to
    construct the genotype matrix object from a plink BED file.

    :param file_path: Path to the plink BED file.
    :param temp_dir: The directory where the temporary files will be stored.
    :param kwargs: Additional keyword arguments.
    """

    from pandas_plink import read_plink1_bin

    try:
        xr_gt = read_plink1_bin(file_path + ".bed", ref="a0", verbose=False)
    except ValueError:
        xr_gt = read_plink1_bin(file_path, ref="a0", verbose=False)

    # Set the sample table:
    sample_table = xr_gt.sample.coords.to_dataset().to_dataframe()
    sample_table.columns = ['FID', 'IID', 'fatherID', 'motherID', 'sex', 'phenotype']
    sample_table.reset_index(inplace=True, drop=True)
    sample_table = sample_table.astype({
        'FID': str,
        'IID': str,
        'fatherID': str,
        'motherID': str,
        'sex': float,
        'phenotype': float
    })

    sample_table['phenotype'] = sample_table['phenotype'].replace({-9.: np.nan})

    # Set the snp table:
    snp_table = xr_gt.variant.coords.to_dataset().to_dataframe()
    snp_table.columns = ['CHR', 'SNP', 'cM', 'POS', 'A1', 'A2']
    snp_table.reset_index(inplace=True, drop=True)
    snp_table = snp_table.astype({
        'CHR': int,
        'SNP': str,
        'cM': np.float32,
        'POS': np.int32,
        'A1': str,
        'A2': str
    })

    g_mat = cls(sample_table=SampleTable(sample_table),
                snp_table=snp_table,
                temp_dir=temp_dir,
                bed_file=file_path,
                xr_mat=xr_gt,
                **kwargs)

    return g_mat

perform_gwas(**gwa_kwargs)

A convenience method that calls specialized utility functions that perform genome-wide association testing of all variants against the phenotype.

Returns:

Type Description

A Summary statistics table containing the results of the association testing.

Source code in magenpy/GenotypeMatrix.py
def perform_gwas(self, **gwa_kwargs):
    """
    A convenience method that calls specialized utility functions that perform
    genome-wide association testing of all variants against the phenotype.

    :return: A Summary statistics table containing the results of the association testing.
    """

    from .stats.gwa.utils import perform_gwa_xarray
    return perform_gwa_xarray(self, **gwa_kwargs)

score(beta, standardize_genotype=False, skip_na=True)

Perform linear scoring on the genotype matrix.

Parameters:

Name Type Description Default
beta

A vector or matrix of effect sizes for each variant in the genotype matrix.

required
standardize_genotype

If True, standardize the genotype when computing the polygenic score.

False
skip_na

If True, skip missing values when computing the polygenic score.

True

Returns:

Type Description

The polygenic score(s) (PGS) for each sample in the genotype matrix.

Source code in magenpy/GenotypeMatrix.py
def score(self, beta, standardize_genotype=False, skip_na=True):
    """
    Perform linear scoring on the genotype matrix.
    :param beta: A vector or matrix of effect sizes for each variant in the genotype matrix.
    :param standardize_genotype: If True, standardize the genotype when computing the polygenic score.
    :param skip_na: If True, skip missing values when computing the polygenic score.

    :return: The polygenic score(s) (PGS) for each sample in the genotype matrix.

    """

    import dask.array as da

    chunked_beta = da.from_array(beta, chunks=self.xr_mat.data.chunksize[1])

    if standardize_genotype:
        from .stats.transforms.genotype import standardize
        pgs = da.dot(standardize(self.xr_mat).data, chunked_beta).compute()
    else:
        if skip_na:
            pgs = da.dot(da.nan_to_num(self.xr_mat.data), chunked_beta).compute()
        else:
            pgs = da.dot(self.xr_mat.fillna(self.maf).data, chunked_beta).compute()

    return pgs

set_sample_table(sample_table)

A convenience method set the sample table for the genotype matrix. This is useful for cases when we need to sync the sample table across chromosomes.

Parameters:

Name Type Description Default
sample_table

An instance of SampleTable or a pandas dataframe containing information about the samples in the genotype matrix.

required
Source code in magenpy/GenotypeMatrix.py
def set_sample_table(self, sample_table):
    """
    A convenience method set the sample table for the genotype matrix.
    This is useful for cases when we need to sync the sample table across chromosomes.

    :param sample_table: An instance of SampleTable or a pandas dataframe containing
    information about the samples in the genotype matrix.
    """

    super().set_sample_table(sample_table)

    try:
        if self.n != self.xr_mat.shape[0]:
            self.xr_mat = self.xr_mat.sel(sample=self.samples)
    except AttributeError:
        pass

split_by_chromosome()

Split the genotype matrix by chromosome.

Returns:

Type Description

A dictionary of xarrayGenotypeMatrix objects, one for each chromosome.

Source code in magenpy/GenotypeMatrix.py
def split_by_chromosome(self):
    """
    Split the genotype matrix by chromosome.
    :return: A dictionary of `xarrayGenotypeMatrix` objects, one for each chromosome.
    """
    split = super().split_by_chromosome()

    for c, gt in split.items():
        gt.xr_mat = self.xr_mat
        if len(split) > 1:
            gt.filter_snps(extract_snps=gt.snps)

    return split

split_by_variants(variant_group_dict)

Split the genotype matrix by variants into separate xarrayGenotypeMatrix objects based on the groups defined in variant_group_dict. The dictionary should have the group name as the key and the list of SNP rsIDs in that group as the value.

Parameters:

Name Type Description Default
variant_group_dict

A dictionary where the key is the group name and the value is a list of SNP rsIDs to group together.

required

Returns:

Type Description

A dictionary of xarrayGenotypeMatrix objects, one for each group.

Source code in magenpy/GenotypeMatrix.py
def split_by_variants(self, variant_group_dict):
    """
    Split the genotype matrix by variants into separate `xarrayGenotypeMatrix` objects
    based on the groups defined in `variant_group_dict`. The dictionary should have
    the group name as the key and the list of SNP rsIDs in that group as the value.

    :param variant_group_dict: A dictionary where the key is the group name and the value
    is a list of SNP rsIDs to group together.

    :return: A dictionary of `xarrayGenotypeMatrix` objects, one for each group.
    """

    split = super().split_by_variants(variant_group_dict)

    for g, gt in split.items():
        gt.xr_mat = self.xr_mat.copy()
        gt.filter_snps(extract_snps=gt.snps)

    return split

to_csr(dtype=np.int8)

Convert the genotype matrix to a scipy sparse CSR matrix.

Parameters:

Name Type Description Default
dtype

The data type of the scipy array. Default: Int8

int8

Returns:

Type Description

A scipy sparse CSR matrix representation of the genotype matrix.

Source code in magenpy/GenotypeMatrix.py
def to_csr(self, dtype=np.int8):
    """
    Convert the genotype matrix to a scipy sparse CSR matrix.
    :param dtype: The data type of the scipy array. Default: Int8

    :return: A `scipy` sparse CSR matrix representation of the genotype matrix.
    """

    mat = self.to_numpy(dtype=dtype)

    from scipy.sparse import csr_matrix

    return csr_matrix(mat)

to_numpy(dtype=np.int8)

Convert the genotype matrix to a numpy array.

Parameters:

Name Type Description Default
dtype

The data type of the numpy array. Default: Int8

int8

Returns:

Type Description

A numpy array representation of the genotype matrix.

Source code in magenpy/GenotypeMatrix.py
def to_numpy(self, dtype=np.int8):
    """
    Convert the genotype matrix to a numpy array.
    :param dtype: The data type of the numpy array. Default: Int8

    :return: A numpy array representation of the genotype matrix.
    """

    return self.xr_mat.data.astype(dtype).compute()