Phenotype
Schema
Both packages consume phenotypes as a data frame with one row per individual and at least three column kinds:
| Column kind | Type | Required by | Description |
|---|---|---|---|
| Identifier | character | both | Joins to rownames(snp) / rownames(mh)
|
| Fixed-effect covariates | numeric or factor | both | e.g. sex, age, batch |
| Response(s) | numeric or integer 0/1 | both | One column per trait — continuous OR binary |
(Optional) tbv_*
|
numeric | none | True breeding values; only meaningful for simulation studies |
The demo dataset bundles both continuous and binary versions of two traits (SNP-architecture and MH-architecture). Standalone CSVs split them out:
| File | Columns |
|---|---|
pheno_continuous_G1.csv |
id, sex, y_cont_qtl_snp, y_cont_qtl_mh, tbv_qtl_snp, tbv_qtl_mh |
pheno_binary_G1.csv |
id, sex, y_bin_qtl_snp, y_bin_qtl_mh, tbv_qtl_snp, tbv_qtl_mh |
d$pheno carries all eight columns in one frame.
Example (from the demo dataset)
Continuous trait (pheno_continuous_G1.csv)
Columns: id, sex, y_cont_qtl_snp, y_cont_qtl_mh, tbv_qtl_snp, tbv_qtl_mh.
Download pheno_continuous_G1.csv (large) Download pheno_continuous_G1.csv (small)
Binary trait (pheno_binary_G1.csv)
Same schema but with the median-thresholded y_bin_* columns in place of the continuous ones. Prevalence is 0.5 by construction.
Download pheno_binary_G1.csv (large) Download pheno_binary_G1.csv (small)
Case/control balance
QTL ground truth
Ten QTL under each architecture (SNP and MH), with unit-length-normalised effects (\(\sum \beta_i^2 = 1\)). Use these for QTL recovery analysis only, not as model input. Four standalone .rds files mirror d$qtl with human-readable identifiers — SNP names and MH block ids — so downstream tools can join against d$map_snp / d$map_mh directly.
Causal SNPs
Causal MH blocks
d$qtl$mh_idx indexes columns of \(W_{ah}\) rather than block columns directly, because the causal “allele” within a multi-allelic block is itself a free parameter. The derived qtl_mh_blocks.rds reduces each \(W_{ah}\) column to its parent block id; if two QTL fall in the same block, the block id appears twice.
Download qtl_snp_ids.rds Download qtl_mh_blocks.rds Download effects_snp.rds Download effects_mh.rds
Loading
library(masreml)
d <- masreml::load_data("large")
# Continuous trait, SNP architecture
y_cont <- d$pheno$y_cont_qtl_snp
# Binary trait, MH architecture
y_bin <- d$pheno$y_bin_qtl_mh
# Or load the standalone CSVs directly:
pheno_c <- read.csv("demo-data/main/pheno_continuous_G1.csv")
pheno_b <- read.csv("demo-data/main/pheno_binary_G1.csv")Passing phenotypes to each package
masreml::masreml() takes the response as a named numeric vector y, the fixed-effects design matrix X explicitly (build it with stats::model.matrix()), and markers as a named list (snp_add, snp_dom, mh_add). There is no formula argument and no direct pedigree argument — for pedigree-based prediction, precompute \(\mathbf{A}\) via build_A_ped() and pass it through the G slot.
# Continuous trait, SNP-additive GBLUP with sex as fixed effect.
y <- d$pheno$y_cont_qtl_snp
names(y) <- d$pheno$id
X_fixed <- stats::model.matrix(~ sex, data = d$pheno)
fit <- masreml::masreml(
y = y,
X = X_fixed,
markers = list(snp_add = d$snp),
trait = "continuous"
)
# masbayes: BayesA on the same trait. The wrapper takes the
# pre-standardised design matrix `w` plus colSums(w^2) as inputs;
# sigma2_g and sigma2_e_init are required hyperparameters.
W <- masbayes::construct_snp_matrix(d$snp, encoding = "vanRaden")
fit <- masbayes::run_bayesa(
w = W,
y = y,
wtw_diag = colSums(W^2),
X = X_fixed,
marker_type = "snp",
nu = 4.5,
sigma2_g = var(y) * 0.5,
sigma2_e_init = var(y) * 0.5,
method = "mcmc"
)
# Binary trait via threshold model in masreml (Laplace approximation).
y_bin <- d$pheno$y_bin_qtl_snp
names(y_bin) <- d$pheno$id
fit_bin <- masreml::masreml(
y = y_bin,
X = X_fixed,
markers = list(snp_add = d$snp),
trait = "binary",
link = "logit"
)Common pitfalls
Factor levels must be consistent across train and test. If your fixed-effect covariate is a factor (e.g. sex), make sure both train and test splits see all levels before calling stats::model.matrix() — otherwise the design-matrix column count diverges and prediction breaks.
Binary phenotypes go through a threshold model, not regression on 0/1. Pass trait = "binary" (with link = "logit" or "probit") in masreml::masreml() so the liability-scale variance is identified. Naive Gaussian fitting of binary data (trait = "continuous") produces unbiased EBVs in some regimes but distorts heritability estimates.
tbv_* columns are ground truth, not input. They are only useful for simulation: comparing \(\widehat{u}\) against tbv is how the Cross-validation tutorial computes accuracy. Never include tbv_* as a covariate.
📐 Related theory: Mixed Model & BLUP Family · Continuous and Binary Trait
🛠 Used in: Genomic Prediction · Binary trait analysis · Cross-validation