Microhaplotypes
Schema
Microhaplotypes (MHs) are short genomic windows in which the phased combination of SNP alleles defines a multi-allelic locus. Both packages consume MHs in two interchangeable layouts:
On-disk file layout (maspipeline / MicrohapsSel format) — one tab-delimited file per chromosome at
mh_genotypes/hap_geno_<chr>. The header isID\thap_<i>_1\thap_<i>_1\thap_<i>_2\thap_<i>_2\t...(block<i>’s two haplotype-strand names repeated, one column per SNP within the block). Values are per-SNP allele codes1or2. Block coordinates live in a companionmicrohaplotype_coordinates.csv.In-memory compact layout (
d$mh) — an integer matrix \(n \times (2 \cdot n_{blocks})\). Two columns per block, one per strand. Each cell is a base-3 polynomial-encoded integer identifier of the block’s haplotype on that strand (sum(snp_calls * 3 ^ (seq_len(k) - 1))forkSNPs per block).attr(d$mh, "block_id")maps each column to its block label.
d$allele_freq carries the parallel vectors haplotype, allele, freq used by masbayes::construct_wah_matrix() when no reference population is supplied.
The companion MH coordinate map (d$map_mh / microhaplotype_coordinates.csv):
| Column | Type | Description |
|---|---|---|
block_id |
character | Block identifier, joins unique(attr(mh, "block_id"))
|
chr |
integer | Chromosome |
start_pos / end_pos
|
integer | bp coordinates of the spanned SNPs |
n_snps |
integer | SNPs per block (2 in the demo) |
Example (from the demo dataset)
d$mh is 200 × 100 (main; 50 blocks × 2 strands × 200 individuals) and 100 × 50 (toy). Each value is one of {4, 5, 7, 8} because each block has 2 SNPs and each SNP allele is in {1, 2} — the four possible encoded haplotype ids.
First 8 individuals × first 6 columns (= blocks 1–3) of d$mh:
Block-id attribute (first 6 entries):
The matching coordinate rows from d$map_mh:
The full d$map_mh is also written as a standalone CSV with the same schema as the maspipeline microhaplotype_coordinates.csv:
Download microhaplotype_coordinates.csv (large) Download microhaplotype_coordinates.csv (small)
Per-chromosome MH genotype files (MicrohapsSel format)
Alongside the canonical .rds, generate.R emits one MicrohapsSel-compatible TSV per chromosome at mh_genotypes/hap_geno_<chr>. Header layout: ID, then 4 columns per local block (hap_<i>_1, hap_<i>_1, hap_<i>_2, hap_<i>_2), values {1, 2} (per-SNP haplotype calls). Local block numbering restarts at 1 in each per-chr file. These files are derived from d$mh + d$map_mh (round-trip decode-then-re-encode verified) — the .rds remains the source of truth.
Preview of main/mh_genotypes/hap_geno_1 (first 10 individuals × first 3 local blocks = 13 columns):
Download per-chromosome files (5 chromosomes per size):
main chr 1 main chr 2 main chr 3 main chr 4 main chr 5 toy chr 1 toy chr 2 toy chr 3 toy chr 4 toy chr 5
Allele frequencies (d$allele_freq)
Long-form table consumed by masbayes::construct_wah_matrix() when no reference structure is supplied:
Loading
From the bundled demo
From the per-chromosome file layout (maspipeline output)
chr1 <- read.table("demo-data/main/mh_genotypes/hap_geno_1",
header = TRUE, sep = "\t", check.names = FALSE)
# First column is "ID"; subsequent columns are 4-per-block
# (hap_<i>_1, hap_<i>_1, hap_<i>_2, hap_<i>_2) with values 1 or 2.To convert the on-disk per-chr layout into the compact d$mh shape, collapse each block’s 4 columns into 2 base-3 polynomial-encoded ids:
# Per block: snp1_h1, snp2_h1, snp1_h2, snp2_h2 -> two encoded ids
encode_block <- function(s1_h1, s2_h1, s1_h2, s2_h2) {
cbind(
h1 = as.integer(s1_h1) + 3L * as.integer(s2_h1),
h2 = as.integer(s1_h2) + 3L * as.integer(s2_h2)
)
}Building model inputs from a MH matrix
# masreml: MH-based GRM. build_G_mh auto-detects the compact layout.
G_mh <- masreml::build_G_mh(mh_list = d$mh, ids = rownames(d$mh))
# masbayes: W_alpha-h design matrix for BayesA / BayesR on MH alleles.
wah <- masbayes::construct_wah_matrix(
hap_matrix = d$mh,
colnames = attr(d$mh, "block_id"),
allele_freq_filtered = d$allele_freq,
reference_structure = NULL,
drop_baseline = TRUE
)
str(wah) # W_ah is n x (sum(n_alleles_b - 1))Common pitfalls
block_id length must equal ncol(mh). The block-id vector is repeated twice per block (once per strand), so it has length \(2 \cdot n_{blocks}\), not \(n_{blocks}\). masbayes::construct_wah_matrix() checks this internally; masreml::build_G_mh() reads it from attr(mh, "block_id").
Allele-frequency reference matters for unrelated populations. If the prediction set is genetically distinct from the training set, the frequencies bundled in d$allele_freq (computed on the bundled population) are not representative. Pass an explicit reference_structure to construct_wah_matrix() so the encoding matches the population whose effects you want to estimate.
Partial blocks / missing strands. Real maspipeline output may emit NA for a block on an individual whose phasing failed. Neither package imputes these — drop or impute before fitting. The demo data has no missingness.
Why compact (d$mh) instead of the per-chr file format? The compact layout is 4× smaller in memory and exposes the same biology (\(2 \cdot n_{blocks}\) haplotype slots). All downstream MH-aware code in masreml / masbayes accepts both; the on-disk format only matters for pipeline-level interop.
📐 Related theory: Biallelic & Multi-allelic · Mixed Model & BLUP Family · Bayesian Alphabet
🛠 Used in: Genomic Prediction · GWABLUP