Phenotype

Schema

Both packages consume phenotypes as a data frame with one row per individual and at least three column kinds:

Column kind	Type	Required by	Description
Identifier	character	both	Joins to `rownames(snp)` / `rownames(mh)`
Fixed-effect covariates	numeric or factor	both	e.g. `sex`, `age`, batch
Response(s)	numeric or integer 0/1	both	One column per trait — continuous OR binary
(Optional) `tbv_*`	numeric	none	True breeding values; only meaningful for simulation studies

The demo dataset bundles both continuous and binary versions of two traits (SNP-architecture and MH-architecture). Standalone CSVs split them out:

File	Columns
`pheno_continuous_G1.csv`	`id, sex, y_cont_qtl_snp, y_cont_qtl_mh, tbv_qtl_snp, tbv_qtl_mh`
`pheno_binary_G1.csv`	`id, sex, y_bin_qtl_snp, y_bin_qtl_mh, tbv_qtl_snp, tbv_qtl_mh`

d$pheno carries all eight columns in one frame.

Example (from the demo dataset)

Continuous trait (`pheno_continuous_G1.csv`)

Columns: id, sex, y_cont_qtl_snp, y_cont_qtl_mh, tbv_qtl_snp, tbv_qtl_mh.

Download pheno_continuous_G1.csv (large) Download pheno_continuous_G1.csv (small)

Binary trait (`pheno_binary_G1.csv`)

Same schema but with the median-thresholded y_bin_* columns in place of the continuous ones. Prevalence is 0.5 by construction.

Download pheno_binary_G1.csv (large) Download pheno_binary_G1.csv (small)

Case/control balance

QTL ground truth

Ten QTL under each architecture (SNP and MH), with unit-length-normalised effects ($\sum \beta_i^2 = 1$). Use these for QTL recovery analysis only, not as model input. Four standalone .rds files mirror d$qtl with human-readable identifiers — SNP names and MH block ids — so downstream tools can join against d$map_snp / d$map_mh directly.

Causal SNPs

Causal MH blocks

d$qtl$mh_idx indexes columns of $W_{ah}$ rather than block columns directly, because the causal “allele” within a multi-allelic block is itself a free parameter. The derived qtl_mh_blocks.rds reduces each $W_{ah}$ column to its parent block id; if two QTL fall in the same block, the block id appears twice.

Download qtl_snp_ids.rds Download qtl_mh_blocks.rds Download effects_snp.rds Download effects_mh.rds

Loading

library(masreml)
d <- masreml::load_data("large")

# Continuous trait, SNP architecture
y_cont <- d$pheno$y_cont_qtl_snp

# Binary trait, MH architecture
y_bin  <- d$pheno$y_bin_qtl_mh

# Or load the standalone CSVs directly:
pheno_c <- read.csv("demo-data/main/pheno_continuous_G1.csv")
pheno_b <- read.csv("demo-data/main/pheno_binary_G1.csv")

Passing phenotypes to each package

masreml::masreml() takes the response as a named numeric vector y, the fixed-effects design matrix X explicitly (build it with stats::model.matrix()), and markers as a named list (snp_add, snp_dom, mh_add). There is no formula argument and no direct pedigree argument — for pedigree-based prediction, precompute $\mathbf{A}$ via build_A_ped() and pass it through the G slot.

# Continuous trait, SNP-additive GBLUP with sex as fixed effect.
y <- d$pheno$y_cont_qtl_snp
names(y) <- d$pheno$id

X_fixed <- stats::model.matrix(~ sex, data = d$pheno)

fit <- masreml::masreml(
  y       = y,
  X       = X_fixed,
  markers = list(snp_add = d$snp),
  trait   = "continuous"
)

# masbayes: BayesA on the same trait. The wrapper takes the
# pre-standardised design matrix `w` plus colSums(w^2) as inputs;
# sigma2_g and sigma2_e_init are required hyperparameters.
W   <- masbayes::construct_snp_matrix(d$snp, encoding = "vanRaden")
fit <- masbayes::run_bayesa(
  w             = W,
  y             = y,
  wtw_diag      = colSums(W^2),
  X             = X_fixed,
  marker_type   = "snp",
  nu            = 4.5,
  sigma2_g      = var(y) * 0.5,
  sigma2_e_init = var(y) * 0.5,
  method        = "mcmc"
)

# Binary trait via threshold model in masreml (Laplace approximation).
y_bin <- d$pheno$y_bin_qtl_snp
names(y_bin) <- d$pheno$id
fit_bin <- masreml::masreml(
  y       = y_bin,
  X       = X_fixed,
  markers = list(snp_add = d$snp),
  trait   = "binary",
  link    = "logit"
)

Common pitfalls

Warning

Row order must match the genotype matrix. masreml::masreml() aligns on names(y) against rownames(markers$snp_add) (and rownames of X). This is the most common source of silent bugs when ID conventions differ. Always verify before fitting:

stopifnot(identical(d$pheno$id, rownames(d$snp)))

Warning

Factor levels must be consistent across train and test. If your fixed-effect covariate is a factor (e.g. sex), make sure both train and test splits see all levels before calling stats::model.matrix() — otherwise the design-matrix column count diverges and prediction breaks.

Warning

Binary phenotypes go through a threshold model, not regression on 0/1. Pass trait = "binary" (with link = "logit" or "probit") in masreml::masreml() so the liability-scale variance is identified. Naive Gaussian fitting of binary data (trait = "continuous") produces unbiased EBVs in some regimes but distorts heritability estimates.

Tip

tbv_* columns are ground truth, not input. They are only useful for simulation: comparing $\widehat{u}$ against tbv is how the Cross-validation tutorial computes accuracy. Never include tbv_* as a covariate.

📐 Related theory: Mixed Model & BLUP Family · Continuous and Binary Trait

🛠 Used in: Genomic Prediction · Binary trait analysis · Cross-validation

Schema

Example (from the demo dataset)

Continuous trait (pheno_continuous_G1.csv)

Binary trait (pheno_binary_G1.csv)

Case/control balance

QTL ground truth

Causal SNPs

Causal MH blocks

Loading

Passing phenotypes to each package

Common pitfalls

Continuous trait (`pheno_continuous_G1.csv`)

Binary trait (`pheno_binary_G1.csv`)