library("masbayes")
set.seed(1)
n_ind <- 30
n_block <- 10
n_allele_per_block <- 3
# Simulate a phased haplotype matrix (n x 2*blocks).
# Each pair of columns holds the two phased copies of one block.
hap <- matrix(
sample.int(n_allele_per_block, n_ind * n_block * 2, replace = TRUE),
nrow = n_ind
)
colnames(hap) <- paste0("hap_", rep(seq_len(n_block), each = 2))
# Allele frequency table for the training set
freq <- data.frame(
haplotype = paste0("hap_",
rep(seq_len(n_block), each = n_allele_per_block)),
allele = rep(seq_len(n_allele_per_block), n_block),
freq = rep(c(0.5, 0.3, 0.2), n_block)
)
# Training set
train <- construct_wah_matrix(hap, colnames(hap), freq)
W_train <- train$W_ah
# Test set: reuse the training allele structure
hap_te <- matrix(
sample.int(n_allele_per_block, 10 * n_block * 2, replace = TRUE),
nrow = 10
)
colnames(hap_te) <- colnames(hap)
test <- construct_wah_matrix(hap_te, colnames(hap_te),
reference_structure = train)
W_test <- test$W_ah
stopifnot(ncol(W_train) == ncol(W_test))Construct the W_ah Design Matrix from Phased Haplotype Genotypes
Construct the W_ah Design Matrix from Phased Haplotype Genotypes
Description
Build the multi-allelic design matrix \(W_{\alpha h}\) (Da, 2015) from phased haplotype calls. The matrix is the input expected by run_bayesr and run_bayesa.
Usage
construct_wah_matrix(
hap_matrix,
colnames,
allele_freq_filtered = NULL,
reference_structure = NULL,
drop_baseline = TRUE
)
Arguments
hap_matrix
|
Integer matrix of phased haplotype allele codes, with dimensions n x 2*blocks. Coerced to integer storage internally.
|
colnames
|
Column names of hap_matrix (length 2*blocks).
|
allele_freq_filtered
|
A data.frame with columns haplotype, allele, and freq. Required for the training set; pass NULL for the test set.
|
reference_structure
|
The full return value of a prior training-set call. Required for the test set; pass NULL for the training set.
|
drop_baseline
|
If TRUE (default) the most-frequent allele per block is dropped to ensure full rank.
|
Details
The haplotype matrix has dimensions n x 2*blocks: each individual contributes two phased copies per block. Column names encode block identity (e.g. hap_1_1 and hap_1_1_1 are the two copies of block 1).
For each block, the most-frequent allele (the baseline, dropped when drop_baseline = TRUE) is removed for identifiability; the remaining alleles become columns of \(W_{\alpha h}\) encoded with the Da (2015) rule. A frequency-weighted projection enforces the sum-to-zero constraint \(E[W_k] = 0\).
Cross-validation / train-test split workflow:
-
Build the matrix on the training set: pass
allele_freq_filtered = <freq_table>andreference_structure = NULL. -
Build the matrix on the test set with the training structure: pass the entire training-set return value as
reference_structureand leaveallele_freq_filtered = NULL. This guarantees the test matrix has identical columns and the same baseline alleles as the training matrix.
Value
A list with three elements:
-
W_ah -
Numeric matrix
n x pwith named columns<block>_allele<k>. -
allele_info -
Data frame describing each retained column (
allele_id, block, allele, freq). -
dropped_alleles -
Data frame of baseline alleles that were removed; empty if
drop_baseline = FALSE.
References
Da, Y. (2015). Multi-allelic haplotype model based on genetic partition for genomic prediction and variance component estimation using SNP markers. BMC Genetics, 16(1), 144. doi:10.1186/s12863-015-0301-1
See Also
run_bayesr, run_bayesa