Microhaplotypes

Schema

Microhaplotypes (MHs) are short genomic windows in which the phased combination of SNP alleles defines a multi-allelic locus. Both packages consume MHs in two interchangeable layouts:

  1. On-disk file layout (maspipeline / MicrohapsSel format) — one tab-delimited file per chromosome at mh_genotypes/hap_geno_<chr>. The header is ID\thap_<i>_1\thap_<i>_1\thap_<i>_2\thap_<i>_2\t... (block <i>’s two haplotype-strand names repeated, one column per SNP within the block). Values are per-SNP allele codes 1 or 2. Block coordinates live in a companion microhaplotype_coordinates.csv.

  2. In-memory compact layout (d$mh) — an integer matrix \(n \times (2 \cdot n_{blocks})\). Two columns per block, one per strand. Each cell is a base-3 polynomial-encoded integer identifier of the block’s haplotype on that strand (sum(snp_calls * 3 ^ (seq_len(k) - 1)) for k SNPs per block). attr(d$mh, "block_id") maps each column to its block label.

d$allele_freq carries the parallel vectors haplotype, allele, freq used by masbayes::construct_wah_matrix() when no reference population is supplied.

The companion MH coordinate map (d$map_mh / microhaplotype_coordinates.csv):

Column Type Description
block_id character Block identifier, joins unique(attr(mh, "block_id"))
chr integer Chromosome
start_pos / end_pos integer bp coordinates of the spanned SNPs
n_snps integer SNPs per block (2 in the demo)

Example (from the demo dataset)

d$mh is 200 × 100 (main; 50 blocks × 2 strands × 200 individuals) and 100 × 50 (toy). Each value is one of {4, 5, 7, 8} because each block has 2 SNPs and each SNP allele is in {1, 2} — the four possible encoded haplotype ids.

First 8 individuals × first 6 columns (= blocks 1–3) of d$mh:

Block-id attribute (first 6 entries):

The matching coordinate rows from d$map_mh:

The full d$map_mh is also written as a standalone CSV with the same schema as the maspipeline microhaplotype_coordinates.csv:

Download microhaplotype_coordinates.csv (large)   Download microhaplotype_coordinates.csv (small)

Per-chromosome MH genotype files (MicrohapsSel format)

Alongside the canonical .rds, generate.R emits one MicrohapsSel-compatible TSV per chromosome at mh_genotypes/hap_geno_<chr>. Header layout: ID, then 4 columns per local block (hap_<i>_1, hap_<i>_1, hap_<i>_2, hap_<i>_2), values {1, 2} (per-SNP haplotype calls). Local block numbering restarts at 1 in each per-chr file. These files are derived from d$mh + d$map_mh (round-trip decode-then-re-encode verified) — the .rds remains the source of truth.

Preview of main/mh_genotypes/hap_geno_1 (first 10 individuals × first 3 local blocks = 13 columns):

Download per-chromosome files (5 chromosomes per size):

main chr 1 main chr 2 main chr 3 main chr 4 main chr 5   toy chr 1 toy chr 2 toy chr 3 toy chr 4 toy chr 5

Allele frequencies (d$allele_freq)

Long-form table consumed by masbayes::construct_wah_matrix() when no reference structure is supplied:

Loading

From the bundled demo

library(masreml)   # or masbayes
d <- masreml::load_data("large")

dim(d$mh)                       # 200 100
attr(d$mh, "block_id")[1:6]     # "block_1" "block_1" "block_2" "block_2" ...
head(d$map_mh)

From the per-chromosome file layout (maspipeline output)

chr1 <- read.table("demo-data/main/mh_genotypes/hap_geno_1",
                   header = TRUE, sep = "\t", check.names = FALSE)
# First column is "ID"; subsequent columns are 4-per-block
# (hap_<i>_1, hap_<i>_1, hap_<i>_2, hap_<i>_2) with values 1 or 2.

To convert the on-disk per-chr layout into the compact d$mh shape, collapse each block’s 4 columns into 2 base-3 polynomial-encoded ids:

# Per block: snp1_h1, snp2_h1, snp1_h2, snp2_h2 -> two encoded ids
encode_block <- function(s1_h1, s2_h1, s1_h2, s2_h2) {
  cbind(
    h1 = as.integer(s1_h1) + 3L * as.integer(s2_h1),
    h2 = as.integer(s1_h2) + 3L * as.integer(s2_h2)
  )
}

Building model inputs from a MH matrix

# masreml: MH-based GRM. build_G_mh auto-detects the compact layout.
G_mh <- masreml::build_G_mh(mh_list = d$mh, ids = rownames(d$mh))

# masbayes: W_alpha-h design matrix for BayesA / BayesR on MH alleles.
wah <- masbayes::construct_wah_matrix(
  hap_matrix             = d$mh,
  colnames               = attr(d$mh, "block_id"),
  allele_freq_filtered   = d$allele_freq,
  reference_structure    = NULL,
  drop_baseline          = TRUE
)
str(wah)            # W_ah is n x (sum(n_alleles_b - 1))

Common pitfalls

Warning

block_id length must equal ncol(mh). The block-id vector is repeated twice per block (once per strand), so it has length \(2 \cdot n_{blocks}\), not \(n_{blocks}\). masbayes::construct_wah_matrix() checks this internally; masreml::build_G_mh() reads it from attr(mh, "block_id").

Warning

Allele-frequency reference matters for unrelated populations. If the prediction set is genetically distinct from the training set, the frequencies bundled in d$allele_freq (computed on the bundled population) are not representative. Pass an explicit reference_structure to construct_wah_matrix() so the encoding matches the population whose effects you want to estimate.

Warning

Partial blocks / missing strands. Real maspipeline output may emit NA for a block on an individual whose phasing failed. Neither package imputes these — drop or impute before fitting. The demo data has no missingness.

Tip

Why compact (d$mh) instead of the per-chr file format? The compact layout is 4× smaller in memory and exposes the same biology (\(2 \cdot n_{blocks}\) haplotype slots). All downstream MH-aware code in masreml / masbayes accepts both; the on-disk format only matters for pipeline-level interop.

📐 Related theory: Biallelic & Multi-allelic · Mixed Model & BLUP Family · Bayesian Alphabet

🛠 Used in: Genomic Prediction · GWABLUP