Construct the W_ah Design Matrix from Phased Haplotype Genotypes

Description

Build the multi-allelic design matrix \(W_{\alpha h}\) (Da, 2015) from phased haplotype calls. The matrix is the input expected by run_bayesr and run_bayesa.

Usage

construct_wah_matrix(
  hap_matrix,
  colnames,
  allele_freq_filtered = NULL,
  reference_structure = NULL,
  drop_baseline = TRUE
)

Arguments

`hap_matrix`	Integer matrix of phased haplotype allele codes, with dimensions `n x 2*blocks`. Coerced to integer storage internally.
`colnames`	Column names of `hap_matrix` (length `2*blocks`).
`allele_freq_filtered`	A `data.frame` with columns `haplotype`, `allele`, and `freq`. Required for the training set; pass `NULL` for the test set.
`reference_structure`	The full return value of a prior training-set call. Required for the test set; pass `NULL` for the training set.
`drop_baseline`	If `TRUE` (default) the most-frequent allele per block is dropped to ensure full rank.

Details

The haplotype matrix has dimensions n x 2*blocks: each individual contributes two phased copies per block. Column names encode block identity (e.g. hap_1_1 and hap_1_1_1 are the two copies of block 1).

For each block, the most-frequent allele (the baseline, dropped when drop_baseline = TRUE) is removed for identifiability; the remaining alleles become columns of \(W_{\alpha h}\) encoded with the Da (2015) rule. A frequency-weighted projection enforces the sum-to-zero constraint \(E[W_k] = 0\).

Cross-validation / train-test split workflow:

Build the matrix on the training set: pass allele_freq_filtered = <freq_table> and reference_structure = NULL.
Build the matrix on the test set with the training structure: pass the entire training-set return value as reference_structure and leave allele_freq_filtered = NULL. This guarantees the test matrix has identical columns and the same baseline alleles as the training matrix.

Value

A list with three elements:

W_ah: Numeric matrix n x p with named columns <block>_allele<k>.
allele_info: Data frame describing each retained column (allele_id, block, allele, freq).
dropped_alleles: Data frame of baseline alleles that were removed; empty if drop_baseline = FALSE.

References

Da, Y. (2015). Multi-allelic haplotype model based on genetic partition for genomic prediction and variance component estimation using SNP markers. BMC Genetics, 16(1), 144. doi:10.1186/s12863-015-0301-1

Examples

library("masbayes")

set.seed(1)
n_ind              <- 30
n_block            <- 10
n_allele_per_block <- 3

# Simulate a phased haplotype matrix (n x 2*blocks).
# Each pair of columns holds the two phased copies of one block.
hap <- matrix(
  sample.int(n_allele_per_block, n_ind * n_block * 2, replace = TRUE),
  nrow = n_ind
)
colnames(hap) <- paste0("hap_", rep(seq_len(n_block), each = 2))

# Allele frequency table for the training set
freq <- data.frame(
  haplotype = paste0("hap_",
                     rep(seq_len(n_block), each = n_allele_per_block)),
  allele    = rep(seq_len(n_allele_per_block), n_block),
  freq      = rep(c(0.5, 0.3, 0.2), n_block)
)

# Training set
train   <- construct_wah_matrix(hap, colnames(hap), freq)
W_train <- train$W_ah

# Test set: reuse the training allele structure
hap_te <- matrix(
  sample.int(n_allele_per_block, 10 * n_block * 2, replace = TRUE),
  nrow = 10
)
colnames(hap_te) <- colnames(hap)
test   <- construct_wah_matrix(hap_te, colnames(hap_te),
                               reference_structure = train)
W_test <- test$W_ah
stopifnot(ncol(W_train) == ncol(W_test))