Sovereign Infrastructure

Name: IONIS
Author: Greg Beam

Sovereign Infrastructure for Citizen Science

Building a Self-Hosted Research Lab for Radio Propagation Analysis

Greg Beam, KI7MT

KI7MT Sovereign AI Lab

February 26, 2026 Version 1.1 RELEASE

Abstract

This paper documents the design, implementation, and benchmarking of a self-hosted research computing infrastructure for HF radio propagation prediction. The system ingests 10.9 billion WSPR observations, 2.26 billion Reverse Beacon Network spots, and 552 million PSK Reporter spots, correlates them with solar indices from 10 NOAA/SIDC endpoints, and trains neural network models entirely on local hardware — 14.68 billion rows totaling 263 GiB on disk. We demonstrate that a carefully architected home lab can achieve 24.67 million rows per second ingestion, 60 GB/s analytical query throughput, and 30 tokens/second local LLM inference — performance sufficient for serious scientific research without cloud dependencies. The total hardware investment of approximately $28,000 provides capabilities that would cost $40,000–214,000 annually in cloud compute. This infrastructure enables the IONIS (Ionospheric Neural Inference System) project, which applies machine learning to one of the largest curated amateur radio propagation datasets we are aware of.

Keywords: Amateur radio HF propagation WSPR ClickHouse Self-hosted infrastructure Citizen science Machine learning

1. Introduction¶

1.1 Motivation¶

Amateur radio propagation prediction has traditionally relied on theoretical models like VOACAP and ICEPAC, which use simplified ionospheric physics (Davies, 1990) with monthly medians and smoothed parameters (Lane et al., 1993). These models miss real-world phenomena: Sporadic-E openings, gray-line enhancement, geomagnetic storm impacts, and intra-day variability.

The WSPR (Weak Signal Propagation Reporter) network, created by Joe Taylor (K1JT), provides a fundamentally different approach: empirical observation at global scale. Since 2008, WSPR has accumulated over 10 billion machine-decoded propagation observations with calibrated signal-to-noise ratios, precise timestamps, and geographic coordinates for both transmitter and receiver.

This dataset represents an unprecedented resource for data-driven propagation prediction. However, working with 10+ billion rows requires serious computing infrastructure—infrastructure that historically meant expensive cloud services or institutional computing clusters.

This paper demonstrates that a self-hosted "sovereign" infrastructure can process this data at scale, enabling individual researchers and small teams to conduct serious scientific work without external dependencies or recurring cloud costs.

1.2 Design Philosophy¶

The infrastructure follows three core principles:

Sovereignty: No cloud dependencies for core research capabilities. All data, models, and inference run on local hardware. This eliminates recurring costs, ensures data privacy, and enables operation during network outages.
Performance at Scale: The system must handle 10+ billion rows efficiently—not just storage, but analytical queries, feature engineering, and model training at interactive speeds.
Reproducibility: Every data pipeline, training run, and validation test is deterministic and auditable. Configuration lives in version control. Results are reproducible across hardware.

1.3 Contributions¶

This paper makes the following contributions:

Architecture documentation for a multi-node research computing cluster using commodity hardware
Benchmarking methodology with reproducible results for ingestion, query, and inference workloads
Cost analysis comparing self-hosted infrastructure to equivalent cloud services
Practical guidance for amateur radio operators and citizen scientists building similar systems

2. Hardware Architecture¶

2.1 System Overview¶

The infrastructure consists of three primary compute nodes connected by a dedicated 10 Gbps direct-attach network:

Node	Role	Primary Workload
Control Node (Threadripper 9975WX)	Ingestion, database, inference	ClickHouse, GPU inference
Sage Node (Mac Studio M3 Ultra)	Model training	PyTorch MPS backend
Forge Node (EPYC 7302P)	Backup, validation, VM host	Proxmox, replica services

2.2 Control Node: Threadripper 9975WX¶

The primary compute node handles data ingestion, analytical queries, and large language model inference.

Specifications:

Component	Details
CPU	AMD Threadripper PRO 9975WX (32C/64T, Zen 5, 5.3 GHz boost)
Memory	128 GB DDR5 (8-channel, ~384 GB/s bandwidth)
GPU	NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM)
Storage	Micron 7450 PRO 960 GB (OS) + 2× Samsung 9100 PRO 4 TB (data) + HighPoint Rocket 1504 with 4× Samsung 990 PRO 4 TB (archive)
Network	10 GbE (motherboard), Intel x710-DA4 (quad SFP+)
PSU	Seasonic Prime TX-1600 (1600W Titanium, ATX 3.1)
OS	Rocky Linux 9.7

Storage Layout:

The critical insight for high-throughput ingestion is I/O decoupling. Source files and database writes use separate NVMe devices:

Device	Model	Mount	Size	Purpose
nvme5n1	Micron 7450 PRO	`/` (boot/OS)	960 GB	Operating system
nvme0n1	Samsung SSD 9100 PRO	`/data/working`	3.6 TB	Source files, working data
nvme1n1	Samsung SSD 9100 PRO	`/var/lib/clickhouse`	3.6 TB	ClickHouse data
nvme2-6	Samsung 990 PRO × 4	ZFS pool	14.4 TB raw	Archive storage (via HighPoint Rocket 1504)

The Samsung 9100 PRO (PM9E1) enterprise drives handle the hot read/write paths. The critical insight for high-throughput ingestion is I/O decoupling — source files and database writes use separate NVMe devices, eliminating I/O contention and enabling the system to sustain 24.67 Mrps ingestion rates.

Archive Storage (HighPoint Rocket 1504 + 4× Samsung 990 PRO 4TB):

A ZFS pool on the HighPoint NVMe adapter provides compressed archival storage:

Dataset	Compression	Content
`wspr-data`	lz4	WSPR CSV archives (185 GB compressed)
`contest-logs`	zstd-9	CQ contest Cabrillo logs
`rbn-data`	lz4	Reverse Beacon Network archives
`pskr-data`	lz4	PSK Reporter live collection

2.3 Sage Node: Mac Studio M3 Ultra¶

The training node runs PyTorch model development using Apple's Metal Performance Shaders (MPS) backend.

Specifications:

Component	Details
SoC	Apple M3 Ultra (24-core CPU, 76-core GPU)
Memory	96 GB unified memory
Storage	2 TB internal SSD
Network	Thunderbolt 4 (10 Gbps to Control Node)
OS	macOS 15

The M3 Ultra's unified memory architecture eliminates CPU-GPU data transfer overhead, making it efficient for iterative model development. Training runs of 100 epochs on 38 million samples complete in approximately 4 hours.

2.4 Forge Node: EPYC 7302P¶

The backup and validation node runs Proxmox for virtualization and provides replica services.

Specifications:

Component	Details
Motherboard	ASRock Rack ROME8D-2T (dual-socket EPYC)
CPU	AMD EPYC 7302P (16C/32T, 3.0 GHz base)
Memory	128 GB DDR4-3200 ECC (8-channel)
GPU	NVIDIA RTX 5080 (16 GB VRAM)
Boot	Micron 7450 PRO (enterprise NVMe)
NVMe Pool	4× Samsung 970 EVO Plus 1TB (2× mirror, 1.76 TB)
SAS HBA	LSI 9300-8i
SSD Pool	8× Samsung 870 EVO 2TB (RAIDZ2, 10.8 TB)
Network	Intel x710-DA4, Mellanox dual-port 25 GbE
OS	Proxmox VE 9.1

2.5 Network Architecture¶

All inter-node communication uses direct-attach cables with jumbo frames (MTU 9000), eliminating switch latency:

Link	Cable Type	Subnet	Latency
9975WX ↔ M3 Ultra	Thunderbolt 4	Subnet A	0.19 ms
9975WX ↔ Proxmox	SFP+ AOC	Subnet B	0.12 ms
9975WX ↔ TrueNAS	SFP+ AOC	Subnet C	0.03 ms

This "DAC Network" (Direct Attach Copper) provides 10 Gbps point-to-point bandwidth with sub-millisecond latency. Training scripts on the M3 query ClickHouse on the 9975WX at <control-node>:8123 with negligible network overhead.

3. Software Stack¶

3.1 ClickHouse: Analytical Database¶

ClickHouse v26.1 serves as the primary data store and analytical engine. Key configuration:

Engine: MergeTree family with month-based partitioning
Compression: LZ4 for hot data, ZSTD for archives
Memory: 64 GB allocated to query cache
Threads: 32 background merge threads

The database stores 14.68 billion rows across 263 GiB on disk. Principal tables:

Table	Rows	Size	Description
`wspr.bronze`	10.96B	193.7 GiB	Raw WSPR observations (2008–2026)
`rbn.bronze`	2.26B	46.8 GiB	Reverse Beacon Network CW/RTTY/PSK31 spots
`pskr.bronze`	552M	7.6 GiB	PSK Reporter FT8/WSPR spots (live MQTT)
`contest.bronze`	234M	4.1 GiB	CQ contest SSB/RTTY QSOs
`solar.iri_lookup`	331M	3.7 GiB	Pre-computed IRI-2020 ionospheric parameters (Bilitza et al., 2017)
`wspr.signatures_v2_terrestrial`	93.6M	2.3 GiB	Aggregated WSPR training signatures
`rbn.signatures`	67.3M	1.3 GiB	Aggregated RBN training signatures
`solar.bronze`	77K	<1 MiB	Solar indices (1932–present)
`solar.dscovr`	56K	1.4 MiB	DSCOVR L1 solar wind (Bz, speed, density)

3.2 Ingestion Pipeline¶

Custom Go tools (ionis-apps) handle high-performance data ingestion:

Tool	Input	Protocol	Performance
`wspr-turbo`	.csv.gz	ch-go native	24.67 Mrps (16–24 workers)
`wspr-shredder`	.csv	ch-go native	21.81 Mrps
`rbn-ingest`	.zip (daily)	ch-go native	2.26B rows in ~20 min
`contest-ingest`	Cabrillo logs	ch-go native	234M rows from 120 log sets
`pskr-collector`	MQTT (live)	JSON stream	~300 spots/sec (~26M/day)
`solar-download`	HTTP	REST	10 NOAA/SIDC endpoints
`solar-backfill`	HTTP	GFZ Potsdam	SSN/SFI/Kp 1932–present
`dscovr-ingest`	HTTP	DSCOVR L1	Solar wind Bz/speed/density, 15-min cron

All ingestion tools use: - ch-go native protocol (not HTTP) for maximum throughput - LZ4 wire compression - Vectorized columnar parsing (no row structs) - Double-buffering with sync.Pool for zero-allocation hot paths - CGO_ENABLED=0 for static binaries

3.3 Training Pipeline¶

PyTorch models train on the M3 Ultra using:

Python 3.10 (UV-managed virtual environment)
PyTorch 2.x with MPS backend
Data loading: Streaming from ClickHouse over DAC network
Checkpointing: SafeTensors format with JSON metadata

Training scripts query ClickHouse directly:

import clickhouse_connect
client = clickhouse_connect.get_client(host='<control-node>', port=8123)
df = client.query_df("SELECT * FROM wspr.signatures_v2_terrestrial LIMIT 10000000")

3.4 Local LLM Inference¶

The RTX PRO 6000 Blackwell (96 GB VRAM) runs large language models locally via llama.cpp:

Model	Quantization	VRAM	Throughput
AstroSage-70B (sovereign)	Q5_K_M	50.4 GB	~30 tok/s
DeepSeek-R1-Distill-Llama-70B	FP8	72 GB	~15 tok/s
Llama-3.1-70B	Q5_K_M	~50 GB	~30 tok/s

llama.cpp with Q5_K_M quantization on Blackwell achieves double the throughput of SGLang with FP8. This enables: - Analysis of query results by domain-specialized LLMs - Multi-agent workflows for architecture review - No API costs or rate limits

4. Benchmark Methodology¶

4.1 Test Conditions¶

All benchmarks follow a consistent methodology:

Cold cache: echo 3 > /proc/sys/vm/drop_caches between runs
Isolated workload: No concurrent processes
Thermal stability: 10-minute warmup period
Multiple runs: Minimum 3 runs, median reported
Full dataset: 10.8B rows unless otherwise noted

4.2 Ingestion Benchmarks¶

wspr-turbo (streaming from compressed .gz archives):

Workers	Time	Throughput	I/O Rate	Memory
16	7m54s	22.77 Mrps	399 MB/s	45.3 GB
24	7m18s	24.67 Mrps	433 MB/s	71.5 GB

Key finding: "Compression is free"—decompression overhead is hidden by the ClickHouse merge wall at ~3.5 GB/s sustained writes. wspr-turbo (reading 185 GB compressed) outperforms wspr-shredder (reading 878 GB uncompressed) because both hit the same database bottleneck.

Cross-system comparison (wspr-turbo, 16 workers):

System	Time	Throughput	Speedup
Threadripper 9975WX	7m18s	24.67 Mrps	2.79×
Ryzen 9950X3D	20m13s	8.83 Mrps	baseline
EPYC 7302P	23m27s	7.63 Mrps	0.86×

The 8-channel DDR5 memory (384 GB/s bandwidth) of the 9975WX provides the critical advantage over 2-channel systems.

4.3 Query Benchmarks¶

Aggregation query (GROUP BY over 10.8B rows):

SELECT callsign, avg(snr) as avg_snr, count(*) as count
FROM wspr.bronze
GROUP BY callsign
ORDER BY count DESC
LIMIT 10;

Metric	Result
Time	3.050 sec
Rows processed	10.80 billion
Throughput	3.54 billion rows/s
Data scanned	183.57 GB
Scan rate	60.19 GB/s

This query scans the entire WSPR corpus and returns the top transmitters in 3 seconds—interactive speed for exploratory analysis.

4.4 Training Benchmarks¶

Model: IonisGate (207K parameters) Data: 38M samples (WSPR + DXpedition + Contest signatures) Device: M3 Ultra (MPS backend)

Metric	Result
Epochs	100
Time	4h 16m
Samples/sec	~2,500
Final RMSE	0.821σ
Final Pearson	+0.492

4.5 Inference Benchmarks¶

Model: AstroSage-70B (Q5_K_M, 49.9 GB) Device: RTX PRO 6000 Blackwell (96 GB VRAM) Runtime: llama.cpp

Metric	Result
Generation speed	29.7–30.8 tok/s
Prompt eval speed	177–1,100 tok/s
First token latency	68–90 ms
VRAM usage	50.4 GB
Idle power	78W
GPU temperature	45°C

5. Cost Analysis¶

5.1 Hardware Investment¶

Control Node (Threadripper 9975WX):

Component	Cost (USD)
AMD Threadripper PRO 9975WX	$5,000
NVIDIA RTX PRO 6000 (96 GB)	$9,000
ASUS Pro WS TRX50-SAGE	$1,200
Storage (3× NVMe)	$2,000
Intel x710-DA4 + cables	$1,000
Memory (128 GB DDR5 ECC)	$800
PSU, case, cooling	$800
Control Node subtotal	~$19,800

Other Nodes:

Component	Cost (USD)
Mac Studio M3 Ultra (96 GB)	$5,500
EPYC 7302P system (refurbished)	$2,500
Total	~$27,800

Note: The EPYC system was acquired refurbished with existing components. A minimal two-node configuration (9975WX + M3 Ultra) would cost approximately $25,300.

5.2 Cloud Equivalent¶

To replicate these capabilities using major cloud providers (pricing as of February 2026):

Workload 1: Analytical Database (ClickHouse equivalent)

The database workload requires 32+ vCPUs, 128+ GB RAM, and high-throughput NVMe storage for 263 GiB of compressed analytical data. This maps to a memory-optimized instance class.

Component	Specification	Monthly Cost
Compute	Memory-optimized, 32 vCPU, 256 GB RAM	$1,524
Storage	High-IOPS NVMe block storage (4 TB, 16K IOPS)	$540
Backup storage	Object storage (4 TB archive)	$92
Subtotal		$2,156/mo

Note: A managed columnar database service with equivalent compute (48 vCPU, always-on) runs approximately $8,500–9,000/month. Self-managed on a VM is significantly cheaper but requires operational overhead.

Workload 2: Model Training (GPU)

Training requires a single high-end GPU with 80–96 GB VRAM. Our production runs take ~4 hours, but experimentation (pre-flights, sweeps, failed runs) adds up. Conservative estimate: 200 GPU-hours/month across all experimentation.

Component	Specification	Monthly Cost
GPU compute	Single 80 GB GPU instance, on-demand	$1,376
	($6.88/hr × 200 hrs/mo)
Budget GPU providers	Boutique cloud, ~$2.50/GPU-hr	$500
Range		$500–1,376/mo

Note: Major cloud providers charge $6.88/hr for a single 80 GB GPU instance (on-demand). Boutique GPU cloud providers offer the same hardware at $1.50–3.00/hr but with less availability and no SLA.

Workload 3: LLM Inference (70B model, always-on)

Running a 70B parameter model locally requires ~50 GB VRAM with Q5_K_M quantization (or 72+ GB with FP8). Cloud equivalent: a multi-GPU inference instance running 24/7.

Component	Specification	Monthly Cost
GPU inference	4× 48 GB GPU instance, always-on	$7,555
	($10.49/hr × 720 hrs/mo)
Alternative: API	Commercial LLM API (~30 tok/s equivalent)	$500–2,000
Range		$500–7,555/mo

Note: Self-hosting an LLM in the cloud is extremely expensive because the instance must run continuously. Using a commercial LLM API instead reduces cost but introduces rate limits, data privacy concerns, and API dependency — exactly the constraints sovereign infrastructure eliminates.

Workload 4: Data Storage and Transfer

Component	Specification	Monthly Cost
Object storage	4 TB raw archives	$92
Data transfer (egress)	~500 GB/mo cross-region	$45
Subtotal		$137/mo

Total Cloud Cost Summary:

Scenario	Monthly	Annual	Notes
Full replication (self-managed DB, major GPU cloud, self-hosted LLM)	$11,224	$134,688	Closest to sovereign capability
Budget option (self-managed DB, boutique GPU, LLM API)	$3,293	$39,516	Compromises on LLM privacy/availability
Managed services (managed DB, major GPU, self-hosted LLM)	$17,818	$213,816	Zero operational overhead

5.3 Break-Even Analysis¶

Scenario	Self-Hosted Cost	Cloud Annual	Break-Even
Full replication	$27,800	$134,688	2.4 months
Budget cloud	$27,800	$39,516	8.4 months
Managed services	$27,800	$213,816	1.5 months

Even against the cheapest cloud option (boutique GPU providers, LLM API, self-managed database), the self-hosted infrastructure pays for itself in under 9 months. Against a full-capability replication with major cloud providers, break-even is under 3 months.

Ongoing self-hosted costs after hardware purchase:

Item	Monthly	Annual
Electricity (~800W avg draw, $0.10/kWh)	$58	$696
Internet (existing residential)	$0	$0
Hardware replacement (amortized)	$83	$1,000
Total operating cost	$141/mo	$1,696/yr

The self-hosted infrastructure operates at $1,696/year — 4.3% of the cheapest cloud equivalent and 1.3% of the full-replication cloud cost.

All figures verified by independent calculation. Cloud pricing sourced February 2026.

5.4 Hidden Benefits¶

Beyond direct cost savings:

No rate limits: Cloud GPU quotas don't constrain experimentation. We ran 15 model versions (V2–V27) in rapid succession — cloud GPU availability would have been a bottleneck.
Data sovereignty: All 14.68 billion rows stay on local NVMe. No data processing agreements, no compliance concerns, no egress charges for moving data between services.
Network independence: Research continues during internet outages. The PSKR validation pipeline (552M spots → 8.4M signatures → model scoring → analysis) ran entirely on local hardware in a single afternoon.
Iteration speed: No upload/download delays. A 10-epoch pre-flight runs in 25 seconds on local CUDA. Cloud round-trip (upload data, provision instance, run, download results) adds 15–30 minutes of overhead per iteration.
Privacy: LLM inference on local hardware means research discussions, code review, and experimental analysis never leave the premises.
Predictable costs: No surprise bills from a forgotten running instance or unexpected data transfer charges. The hardware cost is fixed and fully amortized.

6. Applications¶

6.1 IONIS: Ionospheric Neural Inference System¶

The primary application of this infrastructure is the IONIS project—a neural network that predicts HF propagation from WSPR observations and solar indices.

Training data (14.68 billion rows total, 263 GiB on disk):

Source	Volume	Signatures	Years
WSPR	10.96B spots	93.6M	2008–2026
Reverse Beacon Network	2.26B spots	67.3M	2009–2026
CQ Contests	234M QSOs	5.7M	2005–2025
DXpedition	3.9M paths	260K	2009–2025
PSK Reporter	552M+ (live MQTT)	8.4M	Feb 2026+
Solar/IRI	331M IRI + 77K solar + 56K DSCOVR	—	1932–2026

Model performance (V22-gamma, current production):

Pearson correlation: +0.492 (SNR prediction vs. actual)
RMSE: 0.821σ (normalized)
17/17 operator-grounded validation tests passing (with physics override)
97.86% recall on 8.4M independent PSK Reporter spots (raw model, no override)
9/11 TST-900 physics gate battery

6.2 PSKR Validation Pipeline: Infrastructure in Action¶

The most compelling demonstration of this infrastructure's capability came on 2026-02-26, when the PSKR validation pipeline scored the IONIS model against 8.4 million independent PSK Reporter observations:

Aggregation: 552M raw PSKR spots → 8.4M signatures via ClickHouse SQL aggregation + Python numpy (haversine distance, azimuth computation). Wall time: 31 seconds.
Scoring: 8.4M signatures through V22-gamma model inference on RTX PRO 6000 (CUDA). Two complete runs (with and without physics override). Wall time: ~100 minutes per run.
Cross-validation: Identical scoring on M3 Ultra (MPS backend) over the DAC network. Results matched to the last decimal — deterministic across platforms.
Analysis: 8 diagnostic queries against 16.8M scored results, returning in seconds via ClickHouse.

The entire pipeline — from raw MQTT spots to actionable failure analysis — ran on local hardware with zero cloud involvement. The result: discovery that the physics override layer was introducing 261,030 false negatives while fixing exactly zero misses, directly informing the next model architecture.

6.3 Real-Time Monitoring¶

The infrastructure supports continuous validation against live data:

PSK Reporter: ~26M HF spots/day via MQTT (live since 2026-02-10)
WSPR: ~3M spots/day
Solar indices: 15-minute updates from NOAA SWPC
DSCOVR L1: Solar wind Bz/speed/density, 15-minute cron

6.4 Multi-Agent Research Workflow¶

The local LLM capability enables multi-agent workflows where specialized AI agents analyze different aspects of the research:

Agent	Callsign	Role	Interface	Host
Dr. Watson	Training Agent	Training, code	IDE	Sage Node
Bob	Infrastructure Agent	Infrastructure, SQL, CUDA	IDE	Control Node
Patton	Failure Analyst	Failure analysis, skeptical review	Desktop App	macOS
Einstein	Physics Consultant	Physics theory, constraints	Browser	Cloud
Newton	AstroSage	Sovereign local inference	llama.cpp / Open WebUI	9975WX
Judge	KI7MT	Final authority, tiebreaker	Human	—

Agents coordinate via a ClickHouse message queue (messages.inbox) — the same database both coding agents already query. No external Slack, no email, no new infrastructure. A simple INSERT to send, SELECT to receive. This "Dialectical Engine" approach subjects every architectural decision to multi-agent falsification testing before implementation. See The Dialectical Engine (companion paper) for the full methodology.

7. Lessons Learned¶

7.1 Memory Bandwidth Matters¶

For analytical workloads, memory bandwidth dominates performance. The Threadripper 9975WX's 8-channel DDR5 (~384 GB/s) achieves 2.79× the throughput of dual-channel systems, despite similar clock speeds. Investment in memory channels provides better returns than faster individual DIMMs.

7.2 I/O Decoupling is Critical¶

Placing source files and database writes on separate NVMe devices eliminated the primary ingestion bottleneck. This simple architectural decision enabled a 30% throughput increase at zero additional cost.

7.3 Compression is (Often) Free¶

The ClickHouse merge wall at ~3.5 GB/s becomes the bottleneck before decompression overhead matters. Reading 185 GB compressed is faster than reading 878 GB uncompressed because both hit the same database constraint.

7.4 Unified Memory Changes the Game¶

The M3 Ultra's unified memory architecture makes it surprisingly competitive for model training. Eliminating CPU-GPU transfer overhead compensates for lower peak FLOPS compared to discrete GPUs.

7.5 Local LLMs Enable New Workflows¶

Running 70B models locally (at ~30 tok/s with Q5_K_M on Blackwell) transforms what's possible for individual researchers. Multi-agent review workflows, automated analysis of experimental results, and domain-specialized inference all become practical without API costs or rate limits.

7.6 Process Discipline Transfers from Manufacturing¶

The project applies CLCA (Closed Loop Corrective Actions) — the Ford 8D / semiconductor process control methodology — to every layer of the ML pipeline. Identify the defect, measure it, action plan, fix, verify. This manufacturing quality engineering discipline, adapted from the author's background in semiconductor metrology, prevented several classes of errors that pure software engineering approaches would have missed:

Containment actions: When the model couldn't close 10m paths at night, a physics override was deployed within hours as containment (D3). The PSKR validation battery then measured its collateral damage (D4), revealing 261K false negatives. The corrective action (D5) — synthetic negatives — addresses the root cause rather than patching symptoms.
Audit trail: Every training run is logged to ClickHouse with configuration, epoch metrics, and final results. Failed experiments are as rigorously documented as successes, creating a traceable lineage from V2 through V27.
One variable at a time: Each model version changes exactly one thing relative to its predecessor. This eliminates confounded variables and ensures that when something breaks, the root cause is identifiable.

8. Reproducibility¶

8.1 Source Code¶

All software is available in public repositories:

Repository	Language	Purpose
ionis-apps	Go	Ingestion tools
ionis-core	SQL/Autotools	Database schemas
ionis-cuda	C++/CUDA	Embedding engine
ionis-training	Python	Model training
ionis-devel	Markdown	Documentation

8.2 Configuration¶

System configuration is documented in: - CLAUDE.md: Project instructions and agent coordination - SOVEREIGN-STACK.md: Hardware specifications and benchmarks - planning/*.md: Architecture decisions and validation frameworks

8.3 Benchmark Scripts¶

All benchmark results can be reproduced using scripts in ionis-apps/benchmarks/ and queries in ionis-core/scripts/.

9. Model Development History¶

The IONIS project has iterated through 15 model versions (V2–V27), with the infrastructure supporting rapid experimentation. The failure record is as valuable as the successes:

Version	Key Change	Result	Lesson
V16	Contest-anchored training	Pearson +0.487, 84% live recall	Established physics constraints
V20	Clean config-driven rewrite	Pearson +0.488, RMSE 0.862σ	Proved V16 constraints are complete
V22-gamma	Solar depression + cross-products	Pearson +0.492, RMSE 0.821σ	Best physics model (current production)
V23	IRI-2020 ionospheric features	Record RMSE but physics gates failed	IRI creates shortcuts, bypasses learning
V24	Removed sun sidecar	Both sidecars died	Architecture constraints are load-bearing
V25	2D SFI sidecar	Optimizer killed SFI dimension	Scalar SFI is vestigial (clamp floor)
V26	Band-specific output heads	Learned bias, not physics	Multi-head is a dead end
V27	Physics-informed loss	Fine-tuning destroyed trunk	Never fine-tune from converged checkpoint

Eight architectural dead ends, all proven by experiment, all documented. The infrastructure's speed (10-epoch pre-flight in 25 seconds on CUDA, full training in 4 hours on MPS) made this iteration rate practical for a single researcher.

10. Conclusion¶

We have demonstrated that a carefully architected self-hosted infrastructure can support serious scientific research at scale. The key findings:

24.67 Mrps ingestion is achievable on commodity hardware with proper I/O architecture
60 GB/s query throughput enables interactive analysis of 10+ billion rows
Local 70B LLM inference at 30 tok/s eliminates cloud API dependencies
~$28,000 one-time investment replaces $39,500–$214,000/year in cloud costs (break-even in 2–9 months)
14.68 billion rows, 263 GiB stored and queryable on a single workstation

This infrastructure enables the IONIS project to work with one of the largest curated amateur radio propagation datasets we are aware of — 14.68 billion rows spanning 18 years of observations from five independent data sources — entirely on local hardware. The PSKR validation pipeline demonstrated the full capability: 552 million raw spots aggregated into 8.4 million signatures, scored against the production model, cross-validated on two platforms, and analyzed — all in a single afternoon, with zero cloud involvement.

For amateur radio operators, citizen scientists, and small research teams, sovereign infrastructure provides a path to serious data science without institutional resources or recurring cloud costs. The ionosphere has been writing data for 18 years. Now we have the infrastructure to read it.

Data Ethics and Privacy¶

This project exists because of the amateur radio community. Every observation in our dataset was generated by a ham radio operator somewhere in the world, running a beacon, making a contact, or submitting a log. We treat this data with appropriate care.

Non-Commercial, Open Source: There is no commercial motive behind this project. There is no monetization plan. All code, tools, and models are released under the GNU General Public License v3 (GPLv3). We redistribute only derived works (trained models, aggregated signatures) — never raw data.

Data Privacy: We have zero interest in personal data. We study propagation paths between grid squares, not the operators who generated them. Amateur radio callsigns appear in the bronze (raw ingest) layer only for grid square resolution. By the time data reaches the model-facing layer, all callsigns have been stripped. The model receives only: grid-pair, band, time of day, month, solar flux, geomagnetic index, distance, and bearing. No callsign, no name, no personal identifier.

Publicly Available Data: All source data is publicly available, provided by community organizations for public use:

WSPR beacons are transmitted over public radio frequencies specifically to be received and reported. Operators deliberately upload spots to WSPRNet's public database.
RBN spots are machine-decoded from public radio transmissions and provided as a public service.
Contest logs are submitted by operators who explicitly consent to public posting as a condition of participation (CQ WW Rules Section XIII).

Amateur radio is, by its nature, a public service. ITU Radio Regulations and national licensing authorities require operators to identify themselves during every transmission. There is no expectation of privacy for amateur radio callsigns.

Ethical Data Stewardship: Our ingestion tools are designed to be polite. We enforce rate limits, identify ourselves honestly in User-Agents, and respect volunteer-run servers. We archive data for long-term preservation, not just analysis.

Attribution: We rigorously acknowledge all data sources. When we publish results, we share findings with the data providers and the amateur radio community.

Acknowledgments¶

This project exists because of the amateur radio community. To every operator who has ever transmitted a WSPR beacon, worked a contest, appeared on the Reverse Beacon Network, or shown up on PSK Reporter: thank you.

Data Sources:

Joe Taylor, K1JT — Creator of WSPR, WSJT, WSJT-X, and MAP65. Nobel Laureate (Physics, 1993). The WSPR protocol that generates the observations we study is his contribution to amateur radio.
WSPRNet (wsprnet.org) — The community-operated database that collects and archives WSPR spot reports from operators worldwide.
Reverse Beacon Network (reversebeacon.net) — N4ZR, PY1NB, VE3NEA, and the global network of CW/RTTY skimmer operators who contribute 2.26 billion spots spanning 2009–2026.
PSK Reporter (pskreporter.info) — Created by Philip Gladstone, N1DQ. MQTT real-time feed by Tom Sevart, M0LTE.
World Wide Radio Operators Foundation (WWROF) — CQ WW and CQ WPX contest log archives.
American Radio Relay League (ARRL) — Contest log archives from contests.arrl.org.
GDXF Mega DXpeditions Honor Roll — Curated DXpedition catalog by Bernd, DF3CB.

Solar & Geomagnetic Data:

NOAA Space Weather Prediction Center (SWPC) — Real-time F10.7 solar flux, Kp index, GOES X-ray flux.
SIDC, Royal Observatory of Belgium — World Data Center for sunspot numbers (1749–present).
GFZ Potsdam (Helmholtz Centre) — Definitive Kp/Ap indices, composite SSN/SFI file (1932–present).
Penticton / NRC Canada — Dominion Radio Astrophysical Observatory, the world's primary 10.7 cm solar flux measurement station.
NOAA NCEI — DSCOVR L1 solar wind data (Bz, speed, density).

Software:

ClickHouse (ClickHouse Inc., originally Yandex) — Columnar database engine.
voacapl — Linux port of VOACAP by James Watson, HZ1JW.
PyTorch (Meta AI) — Machine learning framework.
llama.cpp (Georgi Gerganov) — Local LLM inference.

References¶

Bilitza, D., et al. (2017). International Reference Ionosphere 2016: From ionospheric climate to real-time weather predictions. Space Weather, 15(2), 418-429.
Davies, K. (1990). Ionospheric Radio. Peter Peregrinus Ltd.
Lane, G., et al. (1993). Voice of America Coverage Analysis Program (VOACAP). NTIA Report 93-299.

How to cite

Beam, G. (KI7MT). "Sovereign Infrastructure for Citizen Science: Building a Self-Hosted Research Lab for Radio Propagation Analysis." IONIS Project Technical Papers, Version 1.1, February 2026. Available at: ionis-ai.com/papers/sovereign-infrastructure

Appendix A: Bill of Materials¶

A.1 Control Node (Threadripper 9975WX) — ~$19,800¶

Component	Model	Qty	Cost	Notes
CPU	AMD Threadripper PRO 9975WX	1	$5,000	32C/64T, TRX50
GPU	NVIDIA RTX PRO 6000	1	$9,000	96 GB VRAM
Motherboard	ASUS Pro WS TRX50-SAGE	1	$1,200	7× PCIe 5.0
Memory	DDR5-5600 ECC 16GB	8	$800	8-channel config (128 GB total)
NVMe (OS)	Micron 7450 PRO 960GB	1	—	Enterprise boot drive
NVMe (Data)	Samsung SSD 9100 PRO 4TB	2	—	ClickHouse + Working (PM9E1)
NVMe Adapter	HighPoint Rocket 1504	1	—	PCIe Gen4 ×16, 4-port NVMe
NVMe (Archive)	Samsung 990 PRO 4TB	4	$2,000	ZFS archive pool
NIC	Intel x710-DA4	1	$1,000	Quad SFP+ (+ AOC cables)
PSU	Seasonic Prime TX-1600	1	—	1600W Titanium, ATX 3.1, PCIe 5.1
Case	Corsair 9000D	1	—	Full tower
Fans	Noctua NF-A12x25 120mm	16	—	Positive pressure config

A.2 Sage Node (Mac Studio M3 Ultra)¶

Component	Model	Notes
System	Mac Studio M3 Ultra	96 GB unified memory
Storage	2 TB SSD	Internal
Network	Thunderbolt 4	10 Gbps to Control Node

A.3 Networking¶

Component	Model	Qty
SFP+ AOC	FS.com 10G-AOC-3M	3
Thunderbolt 4 cable	Apple 3m	1

Appendix B: Benchmark Raw Data¶

B.1 wspr-turbo 24-Worker Run (2026-02-01)¶

Start time: 2026-02-01T14:23:17Z
End time:   2026-02-01T14:30:35Z
Duration:   7m18s

Files processed:  2,847
Total rows:       10,956,387,569
Throughput:       24.67 Mrps
I/O Rate:         433 MB/s (compressed)
Effective Rate:   1.88 GB/s (uncompressed)

Memory (sys):     71.5 GB
GC cycles:        432
Peak CPU:         100% (64 threads)
Temperature:      74°C

B.2 ClickHouse Aggregation Query (2026-02-01)¶

SELECT callsign, avg(snr), count(*)
FROM wspr.bronze
GROUP BY callsign
ORDER BY count() DESC
LIMIT 10

-- Execution stats:
-- Elapsed: 3.050 sec
-- Rows read: 10,956,387,569
-- Bytes read: 183.57 GiB
-- Rows/sec: 3.54 billion
-- Bytes/sec: 60.19 GB
-- Peak memory: 828.79 MiB