Yandori: Multi-Scale Temporal Prediction for Information Stream Compression

Abstract

We present a novel neural architecture for learning compressed representations of continuous information streams through multi-scale temporal prediction. Our system learns to aggregate variable-sized batches of textual content into fixed-size world state embeddings while simultaneously learning to predict future world states at multiple time horizons (1 minute, 1 hour, 1 day). The model incorporates three key innovations: (1) a self-supervised importance weighting mechanism that automatically identifies predictively-relevant content using gradient-based signals, (2) stable single-head temporal attention over historical world states with exponential temporal decay, and (3) hierarchical memory buffers operating at different temporal resolutions.

We demonstrate that temporal prediction serves as an effective pretraining objective for compression, forcing the model to retain information that captures meaningful patterns in information flow rather than merely reconstructing surface content. The system achieves constant memory footprint (~14 MB) regardless of deployment duration while maintaining rich temporal context through exponential moving averages and hierarchical buffers operating at five distinct timescales.

System Architecture

Our system processes time-windowed batches of articles through a four-stage pipeline optimized for continuous temporal streams:

Articles (N variable) → Encoder → Article Embeddings (N × d)

                                      ↓

                   Aggregator (importance weighted)

                                      ↓

                    World Embedding z_t ∈ ℝ^d

                                      ↓

       ┌─────────────────┴─────────────────┐

       ↓                                     ↓

Temporal Attention          Temporal Predictor

(historical context)        (future predictions)

       ↓                                     ↓

Enriched Embedding   [T+1min, T+1hour, T+1day]

Encoder: Transformer-Based Autoencoder

We employ a 3-layer Transformer encoder (d_model=256, 8 heads) with a compressed bottleneck dimension (d=32 or 256) and 2-layer decoder for reconstruction. The encoder maps variable-length sequences to fixed-dimensional embeddings: encode: ℝ^(L×V) → ℝ^d where L is sequence length (max 512 tokens) and V is vocabulary size (10,000 BPE tokens).

Aggregator: Importance-Weighted World State Construction

Given N article embeddings at time t (where N varies from 0 to 100+ per minute), the aggregator computes a single world state through importance-weighted aggregation:

w_i = softmax(MLP(e_i))

z_t = Σ(i=1 to N) w_i · e_i

This produces a fixed-size world embedding z_t ∈ ℝ^d regardless of batch size N, enabling constant-time downstream operations.

Temporal Prediction Mechanisms

1. Stable Single-Head Temporal Attention

Rather than attending over individual articles (which would require N_articles × H_history computations), we attend over historical world states maintained in hierarchical buffers. Our single-head attention architecture with exponential temporal decay ensures stable gradients:

Query: q = W_q · LayerNorm(z_current)

Keys:  K = W_k · LayerNorm([z_1, ..., z_H])

Temporal Decay: decay_i = exp(-Δt_i / τ)

Scores: scores_i = (K_i · q) / √d + log(decay_i)

Context: α = softmax(scores), context = Σ α_i · z_i

Output: z_enriched = z_current + W_out(context)

Critical Design Choices: Single-head architecture (simpler gradients), pre-normalization (training stability), Xavier uniform initialization with gain=0.1, and residual connections ensuring information flow.

2. Hierarchical Memory Buffers

For long-term deployment, we implement multi-resolution memory with constant O(1) space complexity:

Recent Buffer (24 hours, 1-minute resolution): 1440 states, ~1.5 MB, exact short-term memory
Weekly Buffer (7 days, hourly downsampled): 168 states, ~0.17 MB, medium-term trends
EMA Baselines (5 timescales): 1min, 1hour, 1day, 1week, 1month with decay rates α ∈ {0.67, 0.033, 0.014, 0.002, 0.0005}

Total Memory: ~1.7 MB constant regardless of deployment duration, enabling unbounded temporal operation.

3. Multi-Scale Temporal Prediction with EMA Baselines

The predictor maps current world state to future predictions at three timescales, incorporating deviation signals from five exponential moving average baselines operating at different temporal resolutions:

context = 0.6 · (z_t - baseline_1hour) + 0.4 · (z_t - baseline_1day)

pred_1min  = MLP_1min([z_t, context])

pred_1hour = MLP_1hour([z_t, context])

pred_1day  = MLP_1day([z_t, context])

Each MLP follows architecture: Linear(2d → 2d) → ReLU → Linear(2d → d), trained with weighted prediction loss:

L_pred = 0.1·MSE(pred_1min, z_t+1min) +

         0.3·MSE(pred_1hour, z_t+1hour) +

         0.6·MSE(pred_1day, z_t+1day)

Long-term prediction receives 60% weight as it is hardest and most meaningful, forcing retention of causally-relevant information.

Self-Supervised Importance Learning

A key challenge is learning which articles deserve higher aggregation weights without manual annotation. We introduce a multi-signal importance loss derived entirely from training dynamics:

Gradient-Based Prediction Importance (Primary Signal)

Articles whose embeddings significantly affect prediction loss contain predictively-relevant information:

∇_i = ∂L_pred / ∂e_i   (gradient w.r.t. article i)

importance_i^pred = ||∇_i||₂

Articles with large gradient magnitudes strongly influence future predictions and should receive higher aggregation weight.

Variance Importance (Secondary Signal)

Statistical outliers contain unique information not captured by typical content:

μ = mean(e_1, ..., e_N)

deviation_i = ||e_i - μ||₂

importance_i^var = sigmoid((deviation_i - μ_dev) / σ_dev)

Temporal Surprise (Global Multiplier)

During periods of rapid world state change, importance learning becomes more critical:

EMA_baseline = 0.05 · z_t + 0.95 · EMA_baseline_old

surprise_t = ||z_t - EMA_baseline||₂

weight_mult = sigmoid(surprise_t)

Combined Importance Loss

We train the aggregator to match gradient-based importance through KL divergence:

p_target = normalize(importance^pred)

p_current = softmax(MLP_importance(e_i))

L_importance = KL(p_current || p_target) · (0.5 + 0.5·surprise_t)

This self-supervised approach requires zero manual annotation while remaining task-aligned and domain-agnostic.

Training Methodology

Combined Training Objective

Our training optimizes three complementary objectives with carefully tuned weighting:

L_total = 0.30 · L_recon +      # Information preservation

          0.50 · L_pred +       # Primary: predict future

          0.20 · L_importance   # Learn attention

Prediction dominates (50%) as the core learning signal, reconstruction ensures information density (30%), and importance learning teaches attention allocation (20%) without overwhelming other objectives.

Temporal Batch Construction

Training on CC-News corpus (2016-2019, 31M articles) with temporal batch sampling:

T+0: Articles within ±15 sec of base time t (N=50)
T+1min: Articles within ±30 sec of t+1min (N=50)
T+1hour: Articles within ±5 min of t+1hour (N=50)
T+1day: Articles within ±30 min of t+1day (N=50)

Future windows expand to account for natural article scarcity at longer horizons, ensuring robust supervision across all timescales.

Stability Measures

Critical hyperparameters for training stability:

Gradient clipping: max_norm = 0.5 (prevents explosion)
Pre-normalization: LayerNorm before attention operations
Small initialization: Xavier uniform with gain=0.1
Single-head attention: Simpler gradient flow than multi-head
Disable fused attention: Use PyTorch math_sdp only
Learning rate: 1×10⁻⁴ with AdamW, weight decay 1×10⁻⁵

Production Inference System

Minute-by-Minute Update Cycle

The system operates on a 1-minute update cycle in production deployment, processing continuous information streams with the following RSS/Atom feed discovery and aggregation pipeline:

# Discovery: RSS/Atom feed endpoint enumeration

feeds = discover_feeds(domains, respect_robots_txt=True)

# Aggregation: Every minute update cycle

articles = fetch_articles_from_feeds(last_minute)

embeddings = encoder.encode(articles)

z_current = aggregator(embeddings, timestamp=now())

# Temporal enrichment with hierarchical memory

z_enriched = hierarchical_memory(z_current, now())

hierarchical_memory.add_state(z_current, now())

The discovery crawler implements graph-theoretic endpoint enumeration across heterogeneous web architectures, prioritizing HTML-embedded feed autodiscovery links before expanding to heuristic-based fallback paths (/feed, /feed/atom, /rss, /?feed=rss, /rss.xml).

Crawler Specifications

User-Agent:    YandoriBot/1.0 (+https://yandori.io/bot.html)
Request Rate:  Poisson(λ=1.0) per domain with exponential backoff
Rate Limiting: Uniform(500ms, 1200ms) inter-request delay
Concurrency:   100 workers (configurable via semaphore pool)
Timeout:       10s (connection) + 10s (read) with circuit breaker
Cache Policy:  LRU eviction, TTL=24h (robots.txt), ∞ (feeds)

Memory Footprint

For embedding dimension d=256:

Model parameters: ~3M parameters (~12 MB)
Recent buffer (24h): 1440 × 256 × 4 bytes = 1.47 MB
Weekly buffer (7d): 168 × 256 × 4 bytes = 0.17 MB
EMA baselines (5): 5 × 256 × 4 bytes = 5 KB
Total: ~14 MB constant memory usage

Computational Cost

Encoding: O(N · L) where N ≤ 100 articles, L = 512 tokens
Aggregation: O(N · d)
Temporal attention: O((H_recent + H_weekly) · d) ≈ O(1600d)
Prediction: O(d²)
Typical latency: <100ms per minute on modern GPU

Ethical Crawling Protocol

Our RSS/Atom feed discovery system adheres to RFC 9309 (Robots Exclusion Protocol) with full support for:

robots.txt compliance: Prefix-tree parsing with wildcard expansion and precedence resolution
Rate limiting: Token bucket algorithm with configurable burst capacity and refill rate
Error handling: Exponential backoff with jitter: 24h (404), 48h (5xx), 72h (429/403)
Redirect handling: Maximum chain depth = 2 with cycle detection (Floyd's algorithm)
Validation: XML schema validation (XSD) before persistence

For Webmasters: YandoriBot respects all robots.txt directives including User-agent specific rules, Disallow/Allow paths, Crawl-delay parameters, and Sitemap declarations. Our crawler implements a 24-hour negative cache for robots.txt to ensure timely propagation of policy updates. For removal requests or access control inquiries, consult our detailed crawler documentation.

Access Control

To exclude YandoriBot from your domain, add the following directives to /robots.txt:

User-agent: YandoriBot
Disallow: /
Crawl-delay: 10

Changes to robots.txt are honored immediately upon cache expiration (24h maximum). For urgent removal requests, contact support@yandori.io for manual blocklist addition with O(1) propagation latency.

Experimental Insights

Loss Weighting Ablations

Configuration	Recon	Pred	Import	Result
Initial	0.50	0.50	0.00	Unstable, poor prediction
Balanced	0.33	0.33	0.33	Importance dominates, poor convergence
Final	0.30	0.50	0.20	Stable, best prediction

Finding: Prediction must dominate (≥50%) for effective learning. Importance loss should support but not overwhelm.

Importance Signal Correlation Analysis

Signal Pair	Correlation	Interpretation
Prediction ↔ Variance	+0.23	Weak positive: outliers somewhat predictive
Prediction ↔ Reconstruction	-0.41	Strong negative: signals contradict!
Variance ↔ Reconstruction	-0.12	Weak negative

Critical Finding: Reconstruction importance actively contradicts prediction importance. Articles easy to reconstruct tend to be generic and non-predictive. We disabled reconstruction importance based on this empirical observation.

Hierarchical Memory Impact

Recent Buffer	Weekly Buffer	Pred Loss	Memory
100 states	0 states	0.0452	0.4 MB
1440 states	0 states	0.0391	1.5 MB
1440 states	168 states	0.0356	1.7 MB

Finding: Hierarchical memory provides 9% improvement over recent-only with minimal additional cost (~200 KB).

BibTeX

@article{hawkes2025yandori,
  author    = {Hawkes, Taylor},
  title     = {Yandori: Multi-Scale Temporal Prediction for Information Stream Compression},
  journal   = {arXiv preprint arXiv:2025.XXXXX},
  year      = {2025},
  institution = {Yandori Research},
  keywords  = {temporal prediction, neural compression, self-supervised learning,
               hierarchical memory, continuous streams, gradient-based importance}
}