Oracle Test Specification¶

Document Version: 1.2 Model Version: IonisGate (Production) Checkpoint: versions/v20/ionis_v20.pth Date: 2026-02-05 Author: IONIS

Overview¶

This document specifies the automated test suite for IONIS. Each test has: - ID: Unique identifier (TST-XXX) - Purpose: What physics or behavior is being validated - Method: How the test works - Expected Result: What constitutes PASS/FAIL - Failure Mode: What a failure indicates - Hallucination Trap: Tests designed to catch model overconfidence

The test suite runs via:

python scripts/oracle_v12.py --test

Test Groups¶

Core Tests (Domain-Specific)¶

Group	ID Range	Purpose
Canonical Paths	TST-100	Known HF paths with expected behavior
Physics Constraints	TST-200	Monotonicity and sidecar validation
Input Validation	TST-300	Boundary checks and invalid input rejection
Hallucination Traps	TST-400	Inputs outside training domain

Extended Tests (Standard ML)¶

Group	ID Range	Purpose
Model Robustness	TST-500	Determinism, stability, numerical safety
Adversarial/Security	TST-600	Malicious input handling
Bias & Fairness	TST-700	Systematic prediction biases
Regression	TST-800	Catch silent degradation

Note: Extended tests are standard ML model validation — they apply to any neural network regardless of domain. Core tests are specific to ionospheric propagation physics.

Group 1: Canonical Paths (TST-100)¶

These tests verify the model produces reasonable predictions for well-known HF propagation paths.

TST-101: US East Coast to Western Europe (20m Day)¶

Field	Value
Purpose	Validate classic transatlantic 20m path during daylight
TX Location	W3 area: 39.14°N, 77.01°W (Maryland)
RX Location	G area: 51.50°N, 0.12°W (London)
Frequency	14.0 MHz (20m)
Conditions	SFI 150, Kp 2, 14:00 UTC
Distance	~5,900 km
Expected Result	SNR > -25 dB (path OPEN)
Pass Criteria	Model predicts usable WSPR signal
Failure Mode	If CLOSED: Model underestimating F2 skip on mid-latitude path
Notes	This is the most reliable transatlantic path; should always be open under these conditions

TST-102: US East Coast to Western Europe (20m Night)¶

Field	Value
Purpose	Validate transatlantic path during darkness (grey line effects)
TX Location	W3 area: 39.14°N, 77.01°W
RX Location	G area: 51.50°N, 0.12°W
Frequency	14.0 MHz (20m)
Conditions	SFI 150, Kp 2, 04:00 UTC
Expected Result	SNR > -25 dB (path OPEN, possibly marginal)
Pass Criteria	Model predicts path open (grey line propagation)
Failure Mode	If CLOSED: Model not capturing grey line enhancement
Notes	20m can stay open on this path even at night due to grey line

TST-103: US West Coast to Japan (20m)¶

Field	Value
Purpose	Validate long-path Pacific crossing
TX Location	W6 area: 34.05°N, 118.24°W (Los Angeles)
RX Location	JA area: 35.68°N, 139.69°E (Tokyo)
Frequency	14.0 MHz (20m)
Conditions	SFI 150, Kp 2, 16:00 UTC
Distance	~8,800 km
Expected Result	SNR > -25 dB (path OPEN)
Pass Criteria	Model predicts usable signal on trans-Pacific path
Failure Mode	If CLOSED: Model underestimating long-path propagation
Notes	Classic DX path; well-represented in WSPR data

TST-104: Greenland to Finland (Polar Path, Quiet)¶

Field	Value
Purpose	Validate high-latitude path under quiet geomagnetic conditions
TX Location	OX area: 64.18°N, 51.72°W (Nuuk, Greenland)
RX Location	OH area: 60.17°N, 24.94°E (Helsinki)
Frequency	14.0 MHz (20m)
Conditions	SFI 150, Kp 2, 12:00 UTC
Distance	~3,200 km
Expected Result	SNR > -25 dB (path OPEN)
Pass Criteria	Model predicts open path when Kp is low
Failure Mode	If CLOSED: Model over-penalizing high-latitude paths
Notes	Polar paths are viable when geomagnetically quiet

TST-105: Greenland to Finland (Polar Path, Storm)¶

Field	Value
Purpose	Validate storm degradation on high-latitude path
TX Location	OX area: 64.18°N, 51.72°W
RX Location	OH area: 60.17°N, 24.94°E
Frequency	14.0 MHz (20m)
Conditions	SFI 150, Kp 8, 12:00 UTC
Expected Result	SNR degraded but > -25 dB (MARGINAL)
Pass Criteria	Model shows significant degradation vs TST-104
Failure Mode	If no degradation: Storm sidecar not working
Notes	Kp 8 is severe; path should be heavily degraded but WSPR may still decode

TST-106: Brazil to India (Equatorial Path)¶

Field	Value
Purpose	Validate equatorial/trans-equatorial propagation
TX Location	PY area: 23.55°S, 46.63°W (São Paulo)
RX Location	VU area: 12.97°N, 77.59°E (Bangalore)
Frequency	14.0 MHz (20m)
Conditions	SFI 150, Kp 2, 14:00 UTC
Distance	~14,000 km
Expected Result	SNR > -25 dB (path OPEN)
Pass Criteria	Model predicts long equatorial path viable
Failure Mode	If CLOSED: Model underestimating equatorial F2
Notes	Equatorial paths less affected by Kp storms

TST-107: NVIS 80m (Short Path)¶

Field	Value
Purpose	Validate Near Vertical Incidence Skywave on 80m
TX Location	40.0°N, 100.0°W (Central US)
RX Location	42.0°N, 98.0°W (~250 km away)
Frequency	3.5 MHz (80m)
Conditions	SFI 100, Kp 2, 02:00 UTC (night)
Distance	~250 km
Expected Result	SNR > -20 dB (strong NVIS)
Pass Criteria	Model predicts strong signal on short nighttime 80m path
Failure Mode	If weak: Model not capturing NVIS propagation
Notes	80m NVIS is bread-and-butter regional communication

TST-108: US to Europe 10m (Low SFI)¶

Field	Value
Purpose	Validate 10m behavior under marginal solar conditions
TX Location	W3 area: 39.14°N, 77.01°W
RX Location	G area: 51.50°N, 0.12°W
Frequency	28.0 MHz (10m)
Conditions	SFI 80, Kp 2, 14:00 UTC
Expected Result	SNR > -25 dB (marginal but OPEN for WSPR)
Pass Criteria	Model predicts degraded but usable path
Failure Mode	N/A — test validates relative behavior vs TST-109
Notes	Low SFI makes 10m difficult but not impossible

TST-109: US to Europe 10m (High SFI)¶

Field	Value
Purpose	Validate 10m improvement with high solar flux
TX Location	W3 area: 39.14°N, 77.01°W
RX Location	G area: 51.50°N, 0.12°W
Frequency	28.0 MHz (10m)
Conditions	SFI 200, Kp 2, 14:00 UTC
Expected Result	SNR better than TST-108, > -20 dB
Pass Criteria	Model shows SFI improvement on 10m
Failure Mode	If no improvement: Sun sidecar not affecting higher bands
Notes	High SFI should significantly improve 10m propagation

Group 2: Physics Constraints (TST-200)¶

These tests verify the model's learned physics matches ionospheric reality.

Physics Scoring System¶

Each physics test is graded on a 0-100 scale based on how well the model matches expected ionospheric behavior.

Grade	Score	Meaning
A	90-100	Excellent — matches real-world physics closely
B	75-89	Good — correct direction, reasonable magnitude
C	60-74	Acceptable — correct direction, weak magnitude
D	40-59	Poor — barely correct or flat response
F	0-39	Fail — wrong direction or no response

Scoring Criteria by Test Type¶

SFI Monotonicity (TST-201, TST-205) Expected: +1 to +4 dB improvement for SFI 70→200

Delta (dB)	Score	Grade
≥ +3.0	100	A
+2.0 to +2.9	85	B
+1.0 to +1.9	70	C
+0.1 to +0.9	50	D
≤ 0	0	F

Kp Storm Cost (TST-202, TST-204) Expected: +2 to +6 dB degradation for Kp 0→9

Cost (dB)	Score	Grade
≥ +4.0	100	A
+3.0 to +3.9	90	A
+2.0 to +2.9	75	B
+1.0 to +1.9	60	C
+0.1 to +0.9	40	D
≤ 0	0	F

D-Layer Absorption (TST-203) Expected: 20m better than 80m at noon by +1 to +5 dB

Delta (dB)	Score	Grade
≥ +3.0	100	A
+1.0 to +2.9	80	B
0 to +0.9	60	C
-1.0 to -0.1	40	D
< -1.0	0	F

Polar Storm Sensitivity (TST-204) Expected: High-latitude paths more affected by Kp than mid-latitude

Polar vs Mid-lat ratio	Score	Grade
≥ 1.2x	100	A
1.1x to 1.19x	80	B
1.0x to 1.09x	60	C
0.9x to 0.99x	40	D
< 0.9x	0	F

Overall Physics Score¶

The model receives an aggregate physics score:

Physics Score = (TST-201 + TST-202 + TST-203 + TST-204 + TST-205 + TST-206) / 6

Overall Score	Rating
90-100	Production Ready
75-89	Research Quality
60-74	Needs Improvement
< 60	Not Recommended

TST-201: SFI Monotonicity (70 vs 200)¶

Field	Value
Purpose	Verify higher solar flux improves signal strength
Method	Compare SNR at SFI 70 vs SFI 200, all else equal
Path	W3 → G, 20m, Kp 2, 14:00 UTC
Expected Result	SNR(SFI 200) > SNR(SFI 70) by at least +1 dB
Pass Criteria	Delta is positive
Failure Mode	If negative or zero: Sun sidecar physics inverted or dead
Actual	+2.1 dB improvement
Notes	This is fundamental ionospheric physics — higher SFI = higher MUF = better HF

TST-202: Kp Monotonicity (0 vs 9)¶

Field	Value
Purpose	Verify geomagnetic storms degrade signal strength
Method	Compare SNR at Kp 0 vs Kp 9, all else equal
Path	W3 → G, 20m, SFI 150, 14:00 UTC
Expected Result	SNR(Kp 9) < SNR(Kp 0) by at least -2 dB
Pass Criteria	Delta is negative (storm cost positive)
Failure Mode	If positive: Storm sidecar physics inverted (CRITICAL BUG)
Actual	+4.0 dB storm cost
Notes	This was the "Kp inversion problem" that plagued V1-V9

TST-203: D-Layer Absorption (80m vs 20m at Noon)¶

Field	Value
Purpose	Verify daytime D-layer absorption affects lower frequencies
Method	Compare 3.5 MHz vs 14.0 MHz at solar noon
Path	W3 → G, SFI 150, Kp 2, 12:00 UTC
Expected Result	SNR(20m) >= SNR(80m) at noon
Pass Criteria	Delta >= 0 dB
Failure Mode	If 80m better at noon: Model missing D-layer physics
Actual	+0.0 dB (equal)
Notes	Model shows equal; real physics expects 20m better.

TST-204: Polar Storm Degradation (Kp 2 vs 8)¶

Field	Value
Purpose	Verify storms hit high-latitude paths harder
Method	Compare Kp 2 vs Kp 8 on polar path
Path	OX → OH, 20m, SFI 150, 12:00 UTC
Expected Result	Storm cost > 2 dB
Pass Criteria	Significant degradation observed
Failure Mode	If < 1 dB: Storm gate not modulating by latitude
Actual	+2.5 dB degradation
Notes	Validates latitude-dependent storm sensitivity

TST-205: 10m SFI Sensitivity¶

Field	Value
Purpose	Verify higher bands more sensitive to SFI
Method	Compare SFI 80 vs 200 on 10m path
Path	W3 → G, 28 MHz, Kp 2, 14:00 UTC
Expected Result	Delta > +1.5 dB
Pass Criteria	10m shows strong SFI dependence
Failure Mode	If < 1 dB: Sun sidecar not frequency-aware
Actual	+2.0 dB improvement
Notes	10m needs high SFI; model should capture this

TST-206: Grey Line / Twilight Enhancement¶

Field	Value
Purpose	Verify model captures grey line propagation enhancement
Method	Compare SNR at 14:00 UTC vs 18:00 UTC on E-W path
Path	W3 → G, 20m, SFI 150, Kp 2
Expected Result	SNR(18 UTC) >= SNR(14 UTC)
Pass Criteria	Twilight hour shows equal or better propagation
Failure Mode	If negative: Model missing grey line physics
Actual	+0.2 dB enhancement
Notes	Grey line (twilight) often enhances E-W paths due to lower D-layer absorption

Grey Line Scoring Criteria

Delta (dB)	Score	Grade
≥ +1.0	100	A
+0.5 to +0.9	85	B
0 to +0.4	70	C
-0.5 to -0.1	50	D
< -0.5	0	F

Group 3: Input Validation (TST-300)¶

These tests verify the oracle rejects invalid inputs gracefully.

TST-301: VHF Frequency Rejection (EME Trap)¶

Field	Value
Purpose	Reject frequencies outside HF training domain
Input	freq_mhz = 144.0 (2m band)
Expected Result	ValueError raised
Pass Criteria	Oracle refuses to predict
Failure Mode	If prediction made: Model will hallucinate nonsense
Notes	EME at 144 MHz is lunar reflection, not ionospheric — completely different physics

TST-302: UHF Frequency Rejection¶

Field	Value
Purpose	Reject UHF frequencies
Input	freq_mhz = 432.0 (70cm band)
Expected Result	ValueError raised
Pass Criteria	Oracle refuses to predict
Failure Mode	Model has no training data for UHF
Notes	UHF propagation is tropospheric scatter or satellite, not ionospheric

TST-303: Invalid Latitude Rejection¶

Field	Value
Purpose	Reject impossible coordinates
Input	lat_tx = 95.0 (impossible)
Expected Result	ValueError raised
Pass Criteria	Oracle validates coordinate bounds
Failure Mode	Garbage coordinates produce garbage predictions
Notes	Latitude must be [-90, 90]

TST-304: Invalid Kp Rejection¶

Field	Value
Purpose	Reject out-of-range geomagnetic index
Input	kp = 15 (impossible, max is 9)
Expected Result	ValueError raised
Pass Criteria	Oracle validates Kp bounds
Failure Mode	Extrapolation beyond training domain
Notes	Kp index is defined as 0-9

TST-305: Valid Long Distance Path¶

Field	Value
Purpose	Accept valid long-distance path
Input	~12,000 km path (W3 → Asia)
Expected Result	Prediction returned (no error)
Pass Criteria	Oracle accepts valid input
Failure Mode	False rejection of valid path
Notes	Ensures validation isn't overly aggressive

Group 4: Hallucination Traps (TST-400)¶

These tests catch cases where the model might produce confident but wrong answers.

TST-401: EME Path Detection¶

Field	Value
Purpose	Catch EME-like inputs that look ionospheric
Scenario	2m, 500 km, -28 dB expected (classic EME signature)
Expected Result	Rejected as VHF
Pass Criteria	Oracle recognizes this isn't ionospheric
Failure Mode	Model predicts confidently for physics it never learned
Notes	1500W, 500km, -28 dB on 2m = Moon bounce, not skip

TST-402: Sporadic E Trap (Future)¶

Field	Value
Purpose	Identify E-skip conditions model wasn't trained on
Scenario	6m, 1500 km, summer afternoon
Expected Result	Warning about sporadic E uncertainty
Pass Criteria	Oracle flags low confidence
Status	NOT IMPLEMENTED — 6m not in training data
Notes	Sporadic E is unpredictable; model should admit uncertainty

TST-403: Ground Wave Confusion¶

Field	Value
Purpose	Flag very short paths that may be ground wave
Scenario	80m, 50 km path
Expected Result	Warning issued (likely ground wave)
Pass Criteria	Oracle warns about ground wave possibility
Failure Mode	Model predicts ionospheric SNR for ground wave path
Notes	WSPR < 100 km is often ground wave, not skywave

TST-404: Extreme Solar Event¶

Field	Value
Purpose	Flag predictions during X-class flare conditions
Scenario	SFI 400+, Kp 9
Expected Result	Warning about extreme conditions
Pass Criteria	Oracle flags low confidence
Status	PARTIALLY IMPLEMENTED (SFI warning at >350)
Notes	Extreme space weather is outside training distribution

Test Execution¶

Running the Full Suite¶

cd /Users/gbeam/workspace/ionis-ai
.venv/bin/python ionis-training/scripts/oracle_v12.py --test

Expected Output¶

======================================================================
  IONIS Oracle Test Suite
======================================================================
Model loaded: IonisGate (trunk+3heads+2gated_sidecars)
Pearson: +0.4879, RMSE: 0.862σ

  ... test results ...

  PHYSICS SCORE: 4/4 PASS
  Rating: Production Ready

======================================================================
  SUMMARY
======================================================================
  Passed: 35/35
  Failed: 0/35

  ALL TESTS PASSED

Interpreting Failures¶

Failure Pattern	Likely Cause
SFI monotonicity fails	Sun sidecar broken or inverted
Kp monotonicity fails	Storm sidecar broken or inverted (CRITICAL)
All paths show same SNR	Trunk collapsed to constant
VHF not rejected	Input validation bypassed
Polar = Equatorial storm cost	Gates not differentiating

Group 5: Model Robustness (TST-500)¶

Standard ML model tests — not physics-specific, applies to any neural network.

TST-501: Reproducibility¶

Field	Value
Purpose	Same input produces same output
Method	Run identical prediction 100 times
Expected Result	All outputs identical (deterministic inference)
Pass Criteria	Zero variance in predictions
Failure Mode	Non-deterministic behavior indicates dropout left on or random state leak
Category	Determinism

TST-502: Input Perturbation Stability¶

Field	Value
Purpose	Small input changes produce small output changes
Method	Perturb inputs by ±0.1%, measure output variance
Expected Result	Output changes < 0.5 dB for tiny input changes
Pass Criteria	No catastrophic sensitivity
Failure Mode	Exploding gradients, unstable regions in input space
Category	Stability

TST-503: Boundary Value Testing¶

Field	Value
Purpose	Model handles edge cases gracefully
Method	Test at domain boundaries (SFI=50, SFI=300, Kp=0, Kp=9, etc.)
Expected Result	Reasonable predictions, no NaN/Inf
Pass Criteria	All outputs finite and within plausible range
Failure Mode	NaN, Inf, or predictions outside [-50, +30] dB
Category	Boundary

TST-504: Null Input Handling¶

Field	Value
Purpose	Model rejects or handles missing/null values
Method	Pass NaN, None, or empty values
Expected Result	ValueError raised or graceful default
Pass Criteria	No silent corruption
Failure Mode	NaN propagates through model silently
Category	Input Sanitization

TST-505: Numerical Overflow¶

Field	Value
Purpose	Model handles extreme (but valid) inputs
Method	Test with SFI=300, Kp=9, distance=19999 km simultaneously
Expected Result	Finite output, no overflow
Pass Criteria	Output in valid range
Failure Mode	Inf, -Inf, or NaN in computation
Category	Numerical Stability

TST-506: Checkpoint Integrity¶

Field	Value
Purpose	Saved model loads correctly and matches training
Method	Load checkpoint, verify architecture, run reference prediction
Expected Result	Matches documented RMSE/Pearson within tolerance
Pass Criteria	Reference prediction within 0.01 dB of expected
Failure Mode	Corrupted checkpoint, architecture mismatch
Category	Serialization

TST-507: Device Portability¶

Field	Value
Purpose	Model runs on CPU, MPS, and CUDA
Method	Load and run on each available device
Expected Result	Identical predictions across devices
Pass Criteria	Cross-device variance < 0.001 dB
Failure Mode	Device-specific numerical differences
Category	Portability

Group 6: Adversarial & Security (TST-600)¶

Tests for robustness against malicious or malformed inputs.

TST-601: Injection via String Coordinates¶

Field	Value
Purpose	Reject non-numeric coordinate inputs
Method	Pass "51.5; DROP TABLE" as latitude
Expected Result	TypeError or ValueError
Pass Criteria	No code execution, clean rejection
Failure Mode	Injection vulnerability (unlikely in numeric model but test anyway)
Category	Input Injection

TST-602: Extremely Large Values¶

Field	Value
Purpose	Reject absurdly large inputs
Method	Pass SFI=1e30, distance=1e20
Expected Result	ValueError (out of bounds)
Pass Criteria	Rejected before reaching model
Failure Mode	Float overflow in computation
Category	Bounds Checking

TST-603: Negative Physical Values¶

Field	Value
Purpose	Reject physically impossible negative values
Method	Pass SFI=-100, Kp=-5, freq=-14.0
Expected Result	ValueError
Pass Criteria	All rejected
Failure Mode	Negative values accepted, nonsense predictions
Category	Physical Validity

TST-604: Type Coercion Attack¶

Field	Value
Purpose	Handle unexpected types gracefully
Method	Pass list, dict, or object instead of float
Expected Result	TypeError
Pass Criteria	Clean error message
Failure Mode	Silent type coercion producing wrong results
Category	Type Safety

Group 7: Bias & Fairness (TST-700)¶

Tests for systematic biases in model predictions.

TST-701: Geographic Coverage Bias¶

Field	Value
Purpose	Verify model doesn't favor training-dense regions
Method	Compare similar-distance paths in data-rich (EU) vs data-sparse (Africa) regions
EU Path	G → DL (London to Berlin), ~900 km
Africa Path	5H → 9J (Tanzania to Zambia), ~1,200 km
Conditions	14 MHz, SFI 150, Kp 2, 14:00 UTC
Expected Result	Bias < 5 dB between regions
Pass Criteria	Similar predictions for similar physics
Failure Mode	>5 dB difference suggests model memorized dense regions
Actual	EU: -15.2 dB, Africa: -15.2 dB, Bias: 0.0 dB
Category	Geographic Bias
Status	AUTOMATED

TST-702: Temporal Bias¶

Field	Value
Purpose	Verify model doesn't favor specific times
Method	Sweep all 24 hours, verify no anomalous spikes
Expected Result	Smooth diurnal variation
Pass Criteria	No discontinuities at hour boundaries
Failure Mode	Training data imbalance causing time-of-day artifacts
Category	Temporal Bias

TST-703: Band Coverage Bias¶

Field	Value
Purpose	Verify all bands receive reasonable predictions
Method	Run same path on all bands 160m-10m
Expected Result	All predictions in valid range, physics-consistent
Pass Criteria	No band returns NaN or wildly different behavior
Failure Mode	Underrepresented bands produce poor predictions
Category	Feature Bias

Group 8: Regression Tests (TST-800)¶

Baseline tests to catch future regressions.

TST-801: Reference Prediction¶

Field	Value
Purpose	Catch silent model changes
Method	Fixed input, compare to documented output
Reference Input	W3→G, 20m, SFI 150, Kp 2, 14:00 UTC
Reference Output	-20.0 dB (±0.5 dB tolerance)
Pass Criteria	Within tolerance of documented value
Failure Mode	Model weights changed, retraining without version bump
Category	Regression

TST-802: RMSE Regression¶

Field	Value
Purpose	Ensure model accuracy hasn't degraded
Method	Check checkpoint metadata
Reference Value	RMSE = 0.862σ
Pass Criteria	Loaded RMSE matches documented
Failure Mode	Wrong checkpoint loaded
Category	Regression

TST-803: Pearson Regression¶

Field	Value
Purpose	Ensure correlation hasn't degraded
Method	Check checkpoint metadata
Reference Value	Pearson = +0.4879
Pass Criteria	Loaded Pearson matches documented
Failure Mode	Wrong checkpoint loaded
Category	Regression

Standard ML Test Categories Reference¶

Category	Purpose	Examples
Determinism	Same input → same output	TST-501
Stability	Small input changes → small output changes	TST-502
Boundary	Edge cases handled	TST-503
Input Sanitization	Invalid inputs rejected	TST-504, TST-601-604
Numerical Stability	No overflow/underflow	TST-505
Serialization	Save/load integrity	TST-506
Portability	Cross-device consistency	TST-507
Bias/Fairness	No systematic favoritism	TST-701-703
Regression	Catch silent degradation	TST-801-803
Adversarial	Malicious input handling	TST-601-604

Version History¶

Version	Date	Changes
1.0	2026-02-05	Initial specification
1.1	2026-02-05	Added TST-500 (Robustness), TST-600 (Security), TST-700 (Bias), TST-800 (Regression)
1.2	2026-02-05	Added TST-206 (Grey line twilight), automated TST-701 (Geographic bias) per Gemini review

References¶

Training: ionis-training/scripts/train_v20.py
Physics Verification: ionis-training/scripts/verify_v20.py
Oracle Implementation: ionis-training/scripts/oracle_v20.py
Model Checkpoint: versions/v20/ionis_v20.pth