Sovereign Infrastructure
Sovereign Infrastructure for Citizen Science
Building a Self-Hosted Research Lab for Radio Propagation Analysis
KI7MT Sovereign AI Lab
Abstract
This paper documents the design, implementation, and benchmarking of a self-hosted research computing infrastructure for HF radio propagation prediction. The system ingests 10.9 billion WSPR observations, 2.26 billion Reverse Beacon Network spots, and 552 million PSK Reporter spots, correlates them with solar indices from 10 NOAA/SIDC endpoints, and trains neural network models entirely on local hardware — 14.68 billion rows totaling 263 GiB on disk. We demonstrate that a carefully architected home lab can achieve 24.67 million rows per second ingestion, 60 GB/s analytical query throughput, and 30 tokens/second local LLM inference — performance sufficient for serious scientific research without cloud dependencies. The total hardware investment of approximately $28,000 provides capabilities that would cost $40,000–214,000 annually in cloud compute. This infrastructure enables the IONIS (Ionospheric Neural Inference System) project, which applies machine learning to one of the largest curated amateur radio propagation datasets we are aware of.
1. Introduction¶
1.1 Motivation¶
Amateur radio propagation prediction has traditionally relied on theoretical models like VOACAP and ICEPAC, which use simplified ionospheric physics (Davies, 1990) with monthly medians and smoothed parameters (Lane et al., 1993). These models miss real-world phenomena: Sporadic-E openings, gray-line enhancement, geomagnetic storm impacts, and intra-day variability.
The WSPR (Weak Signal Propagation Reporter) network, created by Joe Taylor (K1JT), provides a fundamentally different approach: empirical observation at global scale. Since 2008, WSPR has accumulated over 10 billion machine-decoded propagation observations with calibrated signal-to-noise ratios, precise timestamps, and geographic coordinates for both transmitter and receiver.
This dataset represents an unprecedented resource for data-driven propagation prediction. However, working with 10+ billion rows requires serious computing infrastructure—infrastructure that historically meant expensive cloud services or institutional computing clusters.
This paper demonstrates that a self-hosted "sovereign" infrastructure can process this data at scale, enabling individual researchers and small teams to conduct serious scientific work without external dependencies or recurring cloud costs.
1.2 Design Philosophy¶
The infrastructure follows three core principles:
-
Sovereignty: No cloud dependencies for core research capabilities. All data, models, and inference run on local hardware. This eliminates recurring costs, ensures data privacy, and enables operation during network outages.
-
Performance at Scale: The system must handle 10+ billion rows efficiently—not just storage, but analytical queries, feature engineering, and model training at interactive speeds.
-
Reproducibility: Every data pipeline, training run, and validation test is deterministic and auditable. Configuration lives in version control. Results are reproducible across hardware.
1.3 Contributions¶
This paper makes the following contributions:
- Architecture documentation for a multi-node research computing cluster using commodity hardware
- Benchmarking methodology with reproducible results for ingestion, query, and inference workloads
- Cost analysis comparing self-hosted infrastructure to equivalent cloud services
- Practical guidance for amateur radio operators and citizen scientists building similar systems
2. Hardware Architecture¶
2.1 System Overview¶
The infrastructure consists of three primary compute nodes connected by a dedicated 10 Gbps direct-attach network:
| Node | Role | Primary Workload |
|---|---|---|
| Control Node (Threadripper 9975WX) | Ingestion, database, inference | ClickHouse, GPU inference |
| Sage Node (Mac Studio M3 Ultra) | Model training | PyTorch MPS backend |
| Forge Node (EPYC 7302P) | Backup, validation, VM host | Proxmox, replica services |
2.2 Control Node: Threadripper 9975WX¶
The primary compute node handles data ingestion, analytical queries, and large language model inference.
Specifications:
| Component | Details |
|---|---|
| CPU | AMD Threadripper PRO 9975WX (32C/64T, Zen 5, 5.3 GHz boost) |
| Memory | 128 GB DDR5 (8-channel, ~384 GB/s bandwidth) |
| GPU | NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM) |
| Storage | Micron 7450 PRO 960 GB (OS) + 2× Samsung 9100 PRO 4 TB (data) + HighPoint Rocket 1504 with 4× Samsung 990 PRO 4 TB (archive) |
| Network | 10 GbE (motherboard), Intel x710-DA4 (quad SFP+) |
| PSU | Seasonic Prime TX-1600 (1600W Titanium, ATX 3.1) |
| OS | Rocky Linux 9.7 |
Storage Layout:
The critical insight for high-throughput ingestion is I/O decoupling. Source files and database writes use separate NVMe devices:
| Device | Model | Mount | Size | Purpose |
|---|---|---|---|---|
| nvme5n1 | Micron 7450 PRO | / (boot/OS) |
960 GB | Operating system |
| nvme0n1 | Samsung SSD 9100 PRO | /data/working |
3.6 TB | Source files, working data |
| nvme1n1 | Samsung SSD 9100 PRO | /var/lib/clickhouse |
3.6 TB | ClickHouse data |
| nvme2-6 | Samsung 990 PRO × 4 | ZFS pool | 14.4 TB raw | Archive storage (via HighPoint Rocket 1504) |
The Samsung 9100 PRO (PM9E1) enterprise drives handle the hot read/write paths. The critical insight for high-throughput ingestion is I/O decoupling — source files and database writes use separate NVMe devices, eliminating I/O contention and enabling the system to sustain 24.67 Mrps ingestion rates.
Archive Storage (HighPoint Rocket 1504 + 4× Samsung 990 PRO 4TB):
A ZFS pool on the HighPoint NVMe adapter provides compressed archival storage:
| Dataset | Compression | Content |
|---|---|---|
wspr-data |
lz4 | WSPR CSV archives (185 GB compressed) |
contest-logs |
zstd-9 | CQ contest Cabrillo logs |
rbn-data |
lz4 | Reverse Beacon Network archives |
pskr-data |
lz4 | PSK Reporter live collection |
2.3 Sage Node: Mac Studio M3 Ultra¶
The training node runs PyTorch model development using Apple's Metal Performance Shaders (MPS) backend.
Specifications:
| Component | Details |
|---|---|
| SoC | Apple M3 Ultra (24-core CPU, 76-core GPU) |
| Memory | 96 GB unified memory |
| Storage | 2 TB internal SSD |
| Network | Thunderbolt 4 (10 Gbps to Control Node) |
| OS | macOS 15 |
The M3 Ultra's unified memory architecture eliminates CPU-GPU data transfer overhead, making it efficient for iterative model development. Training runs of 100 epochs on 38 million samples complete in approximately 4 hours.
2.4 Forge Node: EPYC 7302P¶
The backup and validation node runs Proxmox for virtualization and provides replica services.
Specifications:
| Component | Details |
|---|---|
| Motherboard | ASRock Rack ROME8D-2T (dual-socket EPYC) |
| CPU | AMD EPYC 7302P (16C/32T, 3.0 GHz base) |
| Memory | 128 GB DDR4-3200 ECC (8-channel) |
| GPU | NVIDIA RTX 5080 (16 GB VRAM) |
| Boot | Micron 7450 PRO (enterprise NVMe) |
| NVMe Pool | 4× Samsung 970 EVO Plus 1TB (2× mirror, 1.76 TB) |
| SAS HBA | LSI 9300-8i |
| SSD Pool | 8× Samsung 870 EVO 2TB (RAIDZ2, 10.8 TB) |
| Network | Intel x710-DA4, Mellanox dual-port 25 GbE |
| OS | Proxmox VE 9.1 |
2.5 Network Architecture¶
All inter-node communication uses direct-attach cables with jumbo frames (MTU 9000), eliminating switch latency:
| Link | Cable Type | Subnet | Latency |
|---|---|---|---|
| 9975WX ↔ M3 Ultra | Thunderbolt 4 | Subnet A | 0.19 ms |
| 9975WX ↔ Proxmox | SFP+ AOC | Subnet B | 0.12 ms |
| 9975WX ↔ TrueNAS | SFP+ AOC | Subnet C | 0.03 ms |
This "DAC Network" (Direct Attach Copper) provides 10 Gbps point-to-point bandwidth with sub-millisecond latency. Training scripts on the M3 query ClickHouse on the 9975WX at <control-node>:8123 with negligible network overhead.
3. Software Stack¶
3.1 ClickHouse: Analytical Database¶
ClickHouse v26.1 serves as the primary data store and analytical engine. Key configuration:
- Engine: MergeTree family with month-based partitioning
- Compression: LZ4 for hot data, ZSTD for archives
- Memory: 64 GB allocated to query cache
- Threads: 32 background merge threads
The database stores 14.68 billion rows across 263 GiB on disk. Principal tables:
| Table | Rows | Size | Description |
|---|---|---|---|
wspr.bronze |
10.96B | 193.7 GiB | Raw WSPR observations (2008–2026) |
rbn.bronze |
2.26B | 46.8 GiB | Reverse Beacon Network CW/RTTY/PSK31 spots |
pskr.bronze |
552M | 7.6 GiB | PSK Reporter FT8/WSPR spots (live MQTT) |
contest.bronze |
234M | 4.1 GiB | CQ contest SSB/RTTY QSOs |
solar.iri_lookup |
331M | 3.7 GiB | Pre-computed IRI-2020 ionospheric parameters (Bilitza et al., 2017) |
wspr.signatures_v2_terrestrial |
93.6M | 2.3 GiB | Aggregated WSPR training signatures |
rbn.signatures |
67.3M | 1.3 GiB | Aggregated RBN training signatures |
solar.bronze |
77K | <1 MiB | Solar indices (1932–present) |
solar.dscovr |
56K | 1.4 MiB | DSCOVR L1 solar wind (Bz, speed, density) |
3.2 Ingestion Pipeline¶
Custom Go tools (ionis-apps) handle high-performance data ingestion:
| Tool | Input | Protocol | Performance |
|---|---|---|---|
wspr-turbo |
.csv.gz | ch-go native | 24.67 Mrps (16–24 workers) |
wspr-shredder |
.csv | ch-go native | 21.81 Mrps |
rbn-ingest |
.zip (daily) | ch-go native | 2.26B rows in ~20 min |
contest-ingest |
Cabrillo logs | ch-go native | 234M rows from 120 log sets |
pskr-collector |
MQTT (live) | JSON stream | ~300 spots/sec (~26M/day) |
solar-download |
HTTP | REST | 10 NOAA/SIDC endpoints |
solar-backfill |
HTTP | GFZ Potsdam | SSN/SFI/Kp 1932–present |
dscovr-ingest |
HTTP | DSCOVR L1 | Solar wind Bz/speed/density, 15-min cron |
All ingestion tools use:
- ch-go native protocol (not HTTP) for maximum throughput
- LZ4 wire compression
- Vectorized columnar parsing (no row structs)
- Double-buffering with sync.Pool for zero-allocation hot paths
- CGO_ENABLED=0 for static binaries
3.3 Training Pipeline¶
PyTorch models train on the M3 Ultra using:
- Python 3.10 (UV-managed virtual environment)
- PyTorch 2.x with MPS backend
- Data loading: Streaming from ClickHouse over DAC network
- Checkpointing: SafeTensors format with JSON metadata
Training scripts query ClickHouse directly:
import clickhouse_connect
client = clickhouse_connect.get_client(host='<control-node>', port=8123)
df = client.query_df("SELECT * FROM wspr.signatures_v2_terrestrial LIMIT 10000000")
3.4 Local LLM Inference¶
The RTX PRO 6000 Blackwell (96 GB VRAM) runs large language models locally via llama.cpp:
| Model | Quantization | VRAM | Throughput |
|---|---|---|---|
| AstroSage-70B (sovereign) | Q5_K_M | 50.4 GB | ~30 tok/s |
| DeepSeek-R1-Distill-Llama-70B | FP8 | 72 GB | ~15 tok/s |
| Llama-3.1-70B | Q5_K_M | ~50 GB | ~30 tok/s |
llama.cpp with Q5_K_M quantization on Blackwell achieves double the throughput of SGLang with FP8. This enables: - Analysis of query results by domain-specialized LLMs - Multi-agent workflows for architecture review - No API costs or rate limits
4. Benchmark Methodology¶
4.1 Test Conditions¶
All benchmarks follow a consistent methodology:
- Cold cache:
echo 3 > /proc/sys/vm/drop_cachesbetween runs - Isolated workload: No concurrent processes
- Thermal stability: 10-minute warmup period
- Multiple runs: Minimum 3 runs, median reported
- Full dataset: 10.8B rows unless otherwise noted
4.2 Ingestion Benchmarks¶
wspr-turbo (streaming from compressed .gz archives):
| Workers | Time | Throughput | I/O Rate | Memory |
|---|---|---|---|---|
| 16 | 7m54s | 22.77 Mrps | 399 MB/s | 45.3 GB |
| 24 | 7m18s | 24.67 Mrps | 433 MB/s | 71.5 GB |
Key finding: "Compression is free"—decompression overhead is hidden by the ClickHouse merge wall at ~3.5 GB/s sustained writes. wspr-turbo (reading 185 GB compressed) outperforms wspr-shredder (reading 878 GB uncompressed) because both hit the same database bottleneck.
Cross-system comparison (wspr-turbo, 16 workers):
| System | Time | Throughput | Speedup |
|---|---|---|---|
| Threadripper 9975WX | 7m18s | 24.67 Mrps | 2.79× |
| Ryzen 9950X3D | 20m13s | 8.83 Mrps | baseline |
| EPYC 7302P | 23m27s | 7.63 Mrps | 0.86× |
The 8-channel DDR5 memory (384 GB/s bandwidth) of the 9975WX provides the critical advantage over 2-channel systems.
4.3 Query Benchmarks¶
Aggregation query (GROUP BY over 10.8B rows):
SELECT callsign, avg(snr) as avg_snr, count(*) as count
FROM wspr.bronze
GROUP BY callsign
ORDER BY count DESC
LIMIT 10;
| Metric | Result |
|---|---|
| Time | 3.050 sec |
| Rows processed | 10.80 billion |
| Throughput | 3.54 billion rows/s |
| Data scanned | 183.57 GB |
| Scan rate | 60.19 GB/s |
This query scans the entire WSPR corpus and returns the top transmitters in 3 seconds—interactive speed for exploratory analysis.
4.4 Training Benchmarks¶
Model: IonisGate (207K parameters) Data: 38M samples (WSPR + DXpedition + Contest signatures) Device: M3 Ultra (MPS backend)
| Metric | Result |
|---|---|
| Epochs | 100 |
| Time | 4h 16m |
| Samples/sec | ~2,500 |
| Final RMSE | 0.821σ |
| Final Pearson | +0.492 |
4.5 Inference Benchmarks¶
Model: AstroSage-70B (Q5_K_M, 49.9 GB) Device: RTX PRO 6000 Blackwell (96 GB VRAM) Runtime: llama.cpp
| Metric | Result |
|---|---|
| Generation speed | 29.7–30.8 tok/s |
| Prompt eval speed | 177–1,100 tok/s |
| First token latency | 68–90 ms |
| VRAM usage | 50.4 GB |
| Idle power | 78W |
| GPU temperature | 45°C |
5. Cost Analysis¶
5.1 Hardware Investment¶
Control Node (Threadripper 9975WX):
| Component | Cost (USD) |
|---|---|
| AMD Threadripper PRO 9975WX | $5,000 |
| NVIDIA RTX PRO 6000 (96 GB) | $9,000 |
| ASUS Pro WS TRX50-SAGE | $1,200 |
| Storage (3× NVMe) | $2,000 |
| Intel x710-DA4 + cables | $1,000 |
| Memory (128 GB DDR5 ECC) | $800 |
| PSU, case, cooling | $800 |
| Control Node subtotal | ~$19,800 |
Other Nodes:
| Component | Cost (USD) |
|---|---|
| Mac Studio M3 Ultra (96 GB) | $5,500 |
| EPYC 7302P system (refurbished) | $2,500 |
| Total | ~$27,800 |
Note: The EPYC system was acquired refurbished with existing components. A minimal two-node configuration (9975WX + M3 Ultra) would cost approximately $25,300.
5.2 Cloud Equivalent¶
To replicate these capabilities using major cloud providers (pricing as of February 2026):
Workload 1: Analytical Database (ClickHouse equivalent)
The database workload requires 32+ vCPUs, 128+ GB RAM, and high-throughput NVMe storage for 263 GiB of compressed analytical data. This maps to a memory-optimized instance class.
| Component | Specification | Monthly Cost |
|---|---|---|
| Compute | Memory-optimized, 32 vCPU, 256 GB RAM | $1,524 |
| Storage | High-IOPS NVMe block storage (4 TB, 16K IOPS) | $540 |
| Backup storage | Object storage (4 TB archive) | $92 |
| Subtotal | $2,156/mo |
Note: A managed columnar database service with equivalent compute (48 vCPU, always-on) runs approximately $8,500–9,000/month. Self-managed on a VM is significantly cheaper but requires operational overhead.
Workload 2: Model Training (GPU)
Training requires a single high-end GPU with 80–96 GB VRAM. Our production runs take ~4 hours, but experimentation (pre-flights, sweeps, failed runs) adds up. Conservative estimate: 200 GPU-hours/month across all experimentation.
| Component | Specification | Monthly Cost |
|---|---|---|
| GPU compute | Single 80 GB GPU instance, on-demand | $1,376 |
| ($6.88/hr × 200 hrs/mo) | ||
| Budget GPU providers | Boutique cloud, ~$2.50/GPU-hr | $500 |
| Range | $500–1,376/mo |
Note: Major cloud providers charge $6.88/hr for a single 80 GB GPU instance (on-demand). Boutique GPU cloud providers offer the same hardware at $1.50–3.00/hr but with less availability and no SLA.
Workload 3: LLM Inference (70B model, always-on)
Running a 70B parameter model locally requires ~50 GB VRAM with Q5_K_M quantization (or 72+ GB with FP8). Cloud equivalent: a multi-GPU inference instance running 24/7.
| Component | Specification | Monthly Cost |
|---|---|---|
| GPU inference | 4× 48 GB GPU instance, always-on | $7,555 |
| ($10.49/hr × 720 hrs/mo) | ||
| Alternative: API | Commercial LLM API (~30 tok/s equivalent) | $500–2,000 |
| Range | $500–7,555/mo |
Note: Self-hosting an LLM in the cloud is extremely expensive because the instance must run continuously. Using a commercial LLM API instead reduces cost but introduces rate limits, data privacy concerns, and API dependency — exactly the constraints sovereign infrastructure eliminates.
Workload 4: Data Storage and Transfer
| Component | Specification | Monthly Cost |
|---|---|---|
| Object storage | 4 TB raw archives | $92 |
| Data transfer (egress) | ~500 GB/mo cross-region | $45 |
| Subtotal | $137/mo |
Total Cloud Cost Summary:
| Scenario | Monthly | Annual | Notes |
|---|---|---|---|
| Full replication (self-managed DB, major GPU cloud, self-hosted LLM) | $11,224 | $134,688 | Closest to sovereign capability |
| Budget option (self-managed DB, boutique GPU, LLM API) | $3,293 | $39,516 | Compromises on LLM privacy/availability |
| Managed services (managed DB, major GPU, self-hosted LLM) | $17,818 | $213,816 | Zero operational overhead |
5.3 Break-Even Analysis¶
| Scenario | Self-Hosted Cost | Cloud Annual | Break-Even |
|---|---|---|---|
| Full replication | $27,800 | $134,688 | 2.4 months |
| Budget cloud | $27,800 | $39,516 | 8.4 months |
| Managed services | $27,800 | $213,816 | 1.5 months |
Even against the cheapest cloud option (boutique GPU providers, LLM API, self-managed database), the self-hosted infrastructure pays for itself in under 9 months. Against a full-capability replication with major cloud providers, break-even is under 3 months.
Ongoing self-hosted costs after hardware purchase:
| Item | Monthly | Annual |
|---|---|---|
| Electricity (~800W avg draw, $0.10/kWh) | $58 | $696 |
| Internet (existing residential) | $0 | $0 |
| Hardware replacement (amortized) | $83 | $1,000 |
| Total operating cost | $141/mo | $1,696/yr |
The self-hosted infrastructure operates at $1,696/year — 4.3% of the cheapest cloud equivalent and 1.3% of the full-replication cloud cost.
All figures verified by independent calculation. Cloud pricing sourced February 2026.
5.4 Hidden Benefits¶
Beyond direct cost savings:
- No rate limits: Cloud GPU quotas don't constrain experimentation. We ran 15 model versions (V2–V27) in rapid succession — cloud GPU availability would have been a bottleneck.
- Data sovereignty: All 14.68 billion rows stay on local NVMe. No data processing agreements, no compliance concerns, no egress charges for moving data between services.
- Network independence: Research continues during internet outages. The PSKR validation pipeline (552M spots → 8.4M signatures → model scoring → analysis) ran entirely on local hardware in a single afternoon.
- Iteration speed: No upload/download delays. A 10-epoch pre-flight runs in 25 seconds on local CUDA. Cloud round-trip (upload data, provision instance, run, download results) adds 15–30 minutes of overhead per iteration.
- Privacy: LLM inference on local hardware means research discussions, code review, and experimental analysis never leave the premises.
- Predictable costs: No surprise bills from a forgotten running instance or unexpected data transfer charges. The hardware cost is fixed and fully amortized.
6. Applications¶
6.1 IONIS: Ionospheric Neural Inference System¶
The primary application of this infrastructure is the IONIS project—a neural network that predicts HF propagation from WSPR observations and solar indices.
Training data (14.68 billion rows total, 263 GiB on disk):
| Source | Volume | Signatures | Years |
|---|---|---|---|
| WSPR | 10.96B spots | 93.6M | 2008–2026 |
| Reverse Beacon Network | 2.26B spots | 67.3M | 2009–2026 |
| CQ Contests | 234M QSOs | 5.7M | 2005–2025 |
| DXpedition | 3.9M paths | 260K | 2009–2025 |
| PSK Reporter | 552M+ (live MQTT) | 8.4M | Feb 2026+ |
| Solar/IRI | 331M IRI + 77K solar + 56K DSCOVR | — | 1932–2026 |
Model performance (V22-gamma, current production):
- Pearson correlation: +0.492 (SNR prediction vs. actual)
- RMSE: 0.821σ (normalized)
- 17/17 operator-grounded validation tests passing (with physics override)
- 97.86% recall on 8.4M independent PSK Reporter spots (raw model, no override)
- 9/11 TST-900 physics gate battery
6.2 PSKR Validation Pipeline: Infrastructure in Action¶
The most compelling demonstration of this infrastructure's capability came on 2026-02-26, when the PSKR validation pipeline scored the IONIS model against 8.4 million independent PSK Reporter observations:
- Aggregation: 552M raw PSKR spots → 8.4M signatures via ClickHouse SQL aggregation + Python numpy (haversine distance, azimuth computation). Wall time: 31 seconds.
- Scoring: 8.4M signatures through V22-gamma model inference on RTX PRO 6000 (CUDA). Two complete runs (with and without physics override). Wall time: ~100 minutes per run.
- Cross-validation: Identical scoring on M3 Ultra (MPS backend) over the DAC network. Results matched to the last decimal — deterministic across platforms.
- Analysis: 8 diagnostic queries against 16.8M scored results, returning in seconds via ClickHouse.
The entire pipeline — from raw MQTT spots to actionable failure analysis — ran on local hardware with zero cloud involvement. The result: discovery that the physics override layer was introducing 261,030 false negatives while fixing exactly zero misses, directly informing the next model architecture.
6.3 Real-Time Monitoring¶
The infrastructure supports continuous validation against live data:
- PSK Reporter: ~26M HF spots/day via MQTT (live since 2026-02-10)
- WSPR: ~3M spots/day
- Solar indices: 15-minute updates from NOAA SWPC
- DSCOVR L1: Solar wind Bz/speed/density, 15-minute cron
6.4 Multi-Agent Research Workflow¶
The local LLM capability enables multi-agent workflows where specialized AI agents analyze different aspects of the research:
| Agent | Callsign | Role | Interface | Host |
|---|---|---|---|---|
| Dr. Watson | Training Agent | Training, code | IDE | Sage Node |
| Bob | Infrastructure Agent | Infrastructure, SQL, CUDA | IDE | Control Node |
| Patton | Failure Analyst | Failure analysis, skeptical review | Desktop App | macOS |
| Einstein | Physics Consultant | Physics theory, constraints | Browser | Cloud |
| Newton | AstroSage | Sovereign local inference | llama.cpp / Open WebUI | 9975WX |
| Judge | KI7MT | Final authority, tiebreaker | Human | — |
Agents coordinate via a ClickHouse message queue (messages.inbox) — the same database both coding agents already query. No external Slack, no email, no new infrastructure. A simple INSERT to send, SELECT to receive. This "Dialectical Engine" approach subjects every architectural decision to multi-agent falsification testing before implementation. See The Dialectical Engine (companion paper) for the full methodology.
7. Lessons Learned¶
7.1 Memory Bandwidth Matters¶
For analytical workloads, memory bandwidth dominates performance. The Threadripper 9975WX's 8-channel DDR5 (~384 GB/s) achieves 2.79× the throughput of dual-channel systems, despite similar clock speeds. Investment in memory channels provides better returns than faster individual DIMMs.
7.2 I/O Decoupling is Critical¶
Placing source files and database writes on separate NVMe devices eliminated the primary ingestion bottleneck. This simple architectural decision enabled a 30% throughput increase at zero additional cost.
7.3 Compression is (Often) Free¶
The ClickHouse merge wall at ~3.5 GB/s becomes the bottleneck before decompression overhead matters. Reading 185 GB compressed is faster than reading 878 GB uncompressed because both hit the same database constraint.
7.4 Unified Memory Changes the Game¶
The M3 Ultra's unified memory architecture makes it surprisingly competitive for model training. Eliminating CPU-GPU transfer overhead compensates for lower peak FLOPS compared to discrete GPUs.
7.5 Local LLMs Enable New Workflows¶
Running 70B models locally (at ~30 tok/s with Q5_K_M on Blackwell) transforms what's possible for individual researchers. Multi-agent review workflows, automated analysis of experimental results, and domain-specialized inference all become practical without API costs or rate limits.
7.6 Process Discipline Transfers from Manufacturing¶
The project applies CLCA (Closed Loop Corrective Actions) — the Ford 8D / semiconductor process control methodology — to every layer of the ML pipeline. Identify the defect, measure it, action plan, fix, verify. This manufacturing quality engineering discipline, adapted from the author's background in semiconductor metrology, prevented several classes of errors that pure software engineering approaches would have missed:
- Containment actions: When the model couldn't close 10m paths at night, a physics override was deployed within hours as containment (D3). The PSKR validation battery then measured its collateral damage (D4), revealing 261K false negatives. The corrective action (D5) — synthetic negatives — addresses the root cause rather than patching symptoms.
- Audit trail: Every training run is logged to ClickHouse with configuration, epoch metrics, and final results. Failed experiments are as rigorously documented as successes, creating a traceable lineage from V2 through V27.
- One variable at a time: Each model version changes exactly one thing relative to its predecessor. This eliminates confounded variables and ensures that when something breaks, the root cause is identifiable.
8. Reproducibility¶
8.1 Source Code¶
All software is available in public repositories:
| Repository | Language | Purpose |
|---|---|---|
| ionis-apps | Go | Ingestion tools |
| ionis-core | SQL/Autotools | Database schemas |
| ionis-cuda | C++/CUDA | Embedding engine |
| ionis-training | Python | Model training |
| ionis-devel | Markdown | Documentation |
8.2 Configuration¶
System configuration is documented in:
- CLAUDE.md: Project instructions and agent coordination
- SOVEREIGN-STACK.md: Hardware specifications and benchmarks
- planning/*.md: Architecture decisions and validation frameworks
8.3 Benchmark Scripts¶
All benchmark results can be reproduced using scripts in ionis-apps/benchmarks/ and queries in ionis-core/scripts/.
9. Model Development History¶
The IONIS project has iterated through 15 model versions (V2–V27), with the infrastructure supporting rapid experimentation. The failure record is as valuable as the successes:
| Version | Key Change | Result | Lesson |
|---|---|---|---|
| V16 | Contest-anchored training | Pearson +0.487, 84% live recall | Established physics constraints |
| V20 | Clean config-driven rewrite | Pearson +0.488, RMSE 0.862σ | Proved V16 constraints are complete |
| V22-gamma | Solar depression + cross-products | Pearson +0.492, RMSE 0.821σ | Best physics model (current production) |
| V23 | IRI-2020 ionospheric features | Record RMSE but physics gates failed | IRI creates shortcuts, bypasses learning |
| V24 | Removed sun sidecar | Both sidecars died | Architecture constraints are load-bearing |
| V25 | 2D SFI sidecar | Optimizer killed SFI dimension | Scalar SFI is vestigial (clamp floor) |
| V26 | Band-specific output heads | Learned bias, not physics | Multi-head is a dead end |
| V27 | Physics-informed loss | Fine-tuning destroyed trunk | Never fine-tune from converged checkpoint |
Eight architectural dead ends, all proven by experiment, all documented. The infrastructure's speed (10-epoch pre-flight in 25 seconds on CUDA, full training in 4 hours on MPS) made this iteration rate practical for a single researcher.
10. Conclusion¶
We have demonstrated that a carefully architected self-hosted infrastructure can support serious scientific research at scale. The key findings:
- 24.67 Mrps ingestion is achievable on commodity hardware with proper I/O architecture
- 60 GB/s query throughput enables interactive analysis of 10+ billion rows
- Local 70B LLM inference at 30 tok/s eliminates cloud API dependencies
- ~$28,000 one-time investment replaces $39,500–$214,000/year in cloud costs (break-even in 2–9 months)
- 14.68 billion rows, 263 GiB stored and queryable on a single workstation
This infrastructure enables the IONIS project to work with one of the largest curated amateur radio propagation datasets we are aware of — 14.68 billion rows spanning 18 years of observations from five independent data sources — entirely on local hardware. The PSKR validation pipeline demonstrated the full capability: 552 million raw spots aggregated into 8.4 million signatures, scored against the production model, cross-validated on two platforms, and analyzed — all in a single afternoon, with zero cloud involvement.
For amateur radio operators, citizen scientists, and small research teams, sovereign infrastructure provides a path to serious data science without institutional resources or recurring cloud costs. The ionosphere has been writing data for 18 years. Now we have the infrastructure to read it.
Data Ethics and Privacy¶
This project exists because of the amateur radio community. Every observation in our dataset was generated by a ham radio operator somewhere in the world, running a beacon, making a contact, or submitting a log. We treat this data with appropriate care.
Non-Commercial, Open Source: There is no commercial motive behind this project. There is no monetization plan. All code, tools, and models are released under the GNU General Public License v3 (GPLv3). We redistribute only derived works (trained models, aggregated signatures) — never raw data.
Data Privacy: We have zero interest in personal data. We study propagation paths between grid squares, not the operators who generated them. Amateur radio callsigns appear in the bronze (raw ingest) layer only for grid square resolution. By the time data reaches the model-facing layer, all callsigns have been stripped. The model receives only: grid-pair, band, time of day, month, solar flux, geomagnetic index, distance, and bearing. No callsign, no name, no personal identifier.
Publicly Available Data: All source data is publicly available, provided by community organizations for public use:
- WSPR beacons are transmitted over public radio frequencies specifically to be received and reported. Operators deliberately upload spots to WSPRNet's public database.
- RBN spots are machine-decoded from public radio transmissions and provided as a public service.
- Contest logs are submitted by operators who explicitly consent to public posting as a condition of participation (CQ WW Rules Section XIII).
Amateur radio is, by its nature, a public service. ITU Radio Regulations and national licensing authorities require operators to identify themselves during every transmission. There is no expectation of privacy for amateur radio callsigns.
Ethical Data Stewardship: Our ingestion tools are designed to be polite. We enforce rate limits, identify ourselves honestly in User-Agents, and respect volunteer-run servers. We archive data for long-term preservation, not just analysis.
Attribution: We rigorously acknowledge all data sources. When we publish results, we share findings with the data providers and the amateur radio community.
Acknowledgments¶
This project exists because of the amateur radio community. To every operator who has ever transmitted a WSPR beacon, worked a contest, appeared on the Reverse Beacon Network, or shown up on PSK Reporter: thank you.
Data Sources:
- Joe Taylor, K1JT — Creator of WSPR, WSJT, WSJT-X, and MAP65. Nobel Laureate (Physics, 1993). The WSPR protocol that generates the observations we study is his contribution to amateur radio.
- WSPRNet (wsprnet.org) — The community-operated database that collects and archives WSPR spot reports from operators worldwide.
- Reverse Beacon Network (reversebeacon.net) — N4ZR, PY1NB, VE3NEA, and the global network of CW/RTTY skimmer operators who contribute 2.26 billion spots spanning 2009–2026.
- PSK Reporter (pskreporter.info) — Created by Philip Gladstone, N1DQ. MQTT real-time feed by Tom Sevart, M0LTE.
- World Wide Radio Operators Foundation (WWROF) — CQ WW and CQ WPX contest log archives.
- American Radio Relay League (ARRL) — Contest log archives from contests.arrl.org.
- GDXF Mega DXpeditions Honor Roll — Curated DXpedition catalog by Bernd, DF3CB.
Solar & Geomagnetic Data:
- NOAA Space Weather Prediction Center (SWPC) — Real-time F10.7 solar flux, Kp index, GOES X-ray flux.
- SIDC, Royal Observatory of Belgium — World Data Center for sunspot numbers (1749–present).
- GFZ Potsdam (Helmholtz Centre) — Definitive Kp/Ap indices, composite SSN/SFI file (1932–present).
- Penticton / NRC Canada — Dominion Radio Astrophysical Observatory, the world's primary 10.7 cm solar flux measurement station.
- NOAA NCEI — DSCOVR L1 solar wind data (Bz, speed, density).
Software:
- ClickHouse (ClickHouse Inc., originally Yandex) — Columnar database engine.
- voacapl — Linux port of VOACAP by James Watson, HZ1JW.
- PyTorch (Meta AI) — Machine learning framework.
- llama.cpp (Georgi Gerganov) — Local LLM inference.
References¶
-
Bilitza, D., et al. (2017). International Reference Ionosphere 2016: From ionospheric climate to real-time weather predictions. Space Weather, 15(2), 418-429.
-
Davies, K. (1990). Ionospheric Radio. Peter Peregrinus Ltd.
-
Lane, G., et al. (1993). Voice of America Coverage Analysis Program (VOACAP). NTIA Report 93-299.
How to cite
Beam, G. (KI7MT). "Sovereign Infrastructure for Citizen Science: Building a Self-Hosted Research Lab for Radio Propagation Analysis." IONIS Project Technical Papers, Version 1.1, February 2026. Available at: ionis-ai.com/papers/sovereign-infrastructure
Appendix A: Bill of Materials¶
A.1 Control Node (Threadripper 9975WX) — ~$19,800¶
| Component | Model | Qty | Cost | Notes |
|---|---|---|---|---|
| CPU | AMD Threadripper PRO 9975WX | 1 | $5,000 | 32C/64T, TRX50 |
| GPU | NVIDIA RTX PRO 6000 | 1 | $9,000 | 96 GB VRAM |
| Motherboard | ASUS Pro WS TRX50-SAGE | 1 | $1,200 | 7× PCIe 5.0 |
| Memory | DDR5-5600 ECC 16GB | 8 | $800 | 8-channel config (128 GB total) |
| NVMe (OS) | Micron 7450 PRO 960GB | 1 | — | Enterprise boot drive |
| NVMe (Data) | Samsung SSD 9100 PRO 4TB | 2 | — | ClickHouse + Working (PM9E1) |
| NVMe Adapter | HighPoint Rocket 1504 | 1 | — | PCIe Gen4 ×16, 4-port NVMe |
| NVMe (Archive) | Samsung 990 PRO 4TB | 4 | $2,000 | ZFS archive pool |
| NIC | Intel x710-DA4 | 1 | $1,000 | Quad SFP+ (+ AOC cables) |
| PSU | Seasonic Prime TX-1600 | 1 | — | 1600W Titanium, ATX 3.1, PCIe 5.1 |
| Case | Corsair 9000D | 1 | — | Full tower |
| Fans | Noctua NF-A12x25 120mm | 16 | — | Positive pressure config |
A.2 Sage Node (Mac Studio M3 Ultra)¶
| Component | Model | Notes |
|---|---|---|
| System | Mac Studio M3 Ultra | 96 GB unified memory |
| Storage | 2 TB SSD | Internal |
| Network | Thunderbolt 4 | 10 Gbps to Control Node |
A.3 Networking¶
| Component | Model | Qty |
|---|---|---|
| SFP+ AOC | FS.com 10G-AOC-3M | 3 |
| Thunderbolt 4 cable | Apple 3m | 1 |
Appendix B: Benchmark Raw Data¶
B.1 wspr-turbo 24-Worker Run (2026-02-01)¶
Start time: 2026-02-01T14:23:17Z
End time: 2026-02-01T14:30:35Z
Duration: 7m18s
Files processed: 2,847
Total rows: 10,956,387,569
Throughput: 24.67 Mrps
I/O Rate: 433 MB/s (compressed)
Effective Rate: 1.88 GB/s (uncompressed)
Memory (sys): 71.5 GB
GC cycles: 432
Peak CPU: 100% (64 threads)
Temperature: 74°C