SNP genotypes
Schema
Both packages consume SNP genotypes as an integer dosage matrix with one row per individual and one column per SNP.
| Field | Type | Description |
|---|---|---|
| Row dimension | \(n\) individuals |
rownames() must be set to the individual IDs |
| Column dimension | \(p\) SNPs |
colnames() set to SNP IDs (must match map_snp$SNP) |
| Cell values | integer 0/1/2
|
Count of the alternate allele |
| Missingness | NA |
Imputed upstream — neither package estimates a missingness mechanism |
The SNP matrix is paired with a SNP map (d$map_snp or snp_map.txt) with one row per SNP:
| Column | Type | Description |
|---|---|---|
SNP |
character | SNP identifier; joins to colnames(snp)
|
CHROM |
integer | Chromosome (1..5 in demo) |
POS |
integer | Physical position in base pairs |
Example (from the demo dataset)
Demo dimensions: 200 × 100 (main) and 100 × 50 (toy). Rownames are IND001..IND200; colnames are SNP001..SNP100.
First 8 individuals × first 10 SNPs of main/demo_data.rds:
The full \(200 \times 100\) matrix is also written as a wide CSV with an id column for plain-CSV consumers:
Download geno_G1.csv (large) Download geno_G1.csv (small)
The matching map rows for those SNPs:
snp_map.txt is the same data as d$map_snp, tab-delimited so it can be read with read.table() defaults:
Loading
load_data() is exported by both masreml and masbayes and returns the same byte-identical bundle:
# Either package's load_data() returns the canonical list.
library(masreml)
d <- masreml::load_data("large") # or "small"
str(d$snp) # int [1:200, 1:100] 1 2 2 ...
head(d$map_snp) # SNP / CHROM / POS
# Equivalent via masbayes:
d <- masbayes::load_data("large")
# Or load the standalone CSVs derived alongside the .rds:
geno <- read.csv("demo-data/main/geno_G1.csv", check.names = FALSE)
snp_map <- read.table("demo-data/main/snp_map.txt",
header = TRUE, sep = "\t")For your own SNP data — a CSV with an id column then one column per SNP — convert to a matrix once and pass it on:
geno_df <- read.csv("my_genotypes.csv", check.names = FALSE)
geno <- as.matrix(geno_df[, -1])
rownames(geno) <- geno_df$id
storage.mode(geno) <- "integer"build_G_snp() (masreml) and construct_snp_matrix() (masbayes) expect the matrix in this shape.
Building model inputs from a SNP matrix
Common pitfalls
Allele coding direction. Both packages assume the dosage counts the alternate allele (0 = homozygous reference, 2 = homozygous alternate). If your upstream pipeline reports reference-allele dosages, you can either flip with 2L - geno or rely on the per-marker centring that build_G_snp() / construct_snp_matrix(encoding = "vanRaden") applies — the resulting GRM is invariant to a uniform flip, but per-marker effect estimates change sign.
Missingness. NA cells must be imputed before fitting. The demo data has no missingness by construction. For real data, common choices are mean imputation per SNP (cheap, biased toward the population mean) or model-based imputation (Beagle / BEAGLE for biobanks). Whatever you choose, do it before splitting train/test so the imputed values do not leak between sets.
Minor allele frequency (MAF). Markers with very low MAF carry almost no information and inflate the GRM’s leading eigenvalues. A conservative pre-filter is MAF >= 0.05. The demo dataset’s MAF is uniformly drawn from [0.1, 0.5] so no filter is needed.
📐 Related theory: Mixed Model & BLUP Family · Biallelic & Multi-allelic · Bayesian Alphabet
🛠 Used in: Genomic Prediction · GWAS