Construct a Centered SNP Design Matrix

Construct a Centered SNP Design Matrix

Description

Build a centered SNP design matrix \(W_{ij} = X_{ij} - 2 p_j\) from an allele-dosage matrix \(X\) (entries 0, 1, 2). The output is suitable for direct use with run_bayesr and run_bayesa, which are agnostic to whether the design matrix encodes biallelic SNPs or multi-allelic haplotypes.

Usage

construct_snp_matrix(X, ref_freq = NULL)

Arguments

X Numeric or integer matrix of allele dosages (typically 0, 1, or 2) with dimensions n x p_snp. Coerced to double internally.
ref_freq Optional numeric vector of length ncol(X) giving the reference allele frequencies \(p_j\) from the training set. If NULL (default) the frequencies are computed from X itself (training-set use).

Details

Training / test workflow. Always center the test set with training allele frequencies, never with frequencies recomputed from the test set itself:

  1. Training: call construct_snp_matrix(X_train) (leave ref_freq = NULL). The returned $freq contains the per-SNP reference frequencies \(p_j = mean(X_{train,j}) / 2\).

  2. Test: call construct_snp_matrix(X_test, ref_freq = train$freq) so the columns of W_test are aligned with W_train on the same baseline frequencies.

For phased multi-allelic haplotype data, use construct_wah_matrix instead. For other marker representations, build the design matrix manually and pass it to the Bayesian fitters.

Value

A list with elements:

W
Numeric matrix n x p_snp of centered dosages \(W_{ij} = X_{ij} - 2 p_j\).
freq
Numeric vector of length p_snp with the allele frequencies used for centering.
n, p
Number of individuals and SNPs.

See Also

construct_wah_matrix, run_bayesr, run_bayesa

Examples

library("masbayes")

set.seed(1)
n_ind <- 30
n_snp <- 50
maf   <- runif(n_snp, 0.1, 0.5)

# Simulate dosage matrix (0/1/2) for training and test sets
X_train <- matrix(rbinom(n_ind * n_snp, 2, prob = maf),
                  nrow = n_ind)
X_test  <- matrix(rbinom(10    * n_snp, 2, prob = maf),
                  nrow = 10)

# Training: compute frequencies from X_train
train   <- construct_snp_matrix(X_train)
W_train <- train$W

# Test: reuse training frequencies (do NOT recompute from X_test)
test    <- construct_snp_matrix(X_test, ref_freq = train$freq)
W_test  <- test$W

stopifnot(ncol(W_train) == ncol(W_test))

# The returned W can be passed directly to run_bayesr() or run_bayesa()
# together with the response y_train.