We present a novel neural architecture for learning compressed representations of continuous information streams through multi-scale temporal prediction. Our system learns to aggregate variable-sized batches of textual content into fixed-size world state embeddings while simultaneously learning to predict future world states at multiple time horizons (1 minute, 1 hour, 1 day). The model incorporates three key innovations: (1) a self-supervised importance weighting mechanism that automatically identifies predictively-relevant content using gradient-based signals, (2) stable single-head temporal attention over historical world states with exponential temporal decay, and (3) hierarchical memory buffers operating at different temporal resolutions.
We demonstrate that temporal prediction serves as an effective pretraining objective for compression, forcing the model to retain information that captures meaningful patterns in information flow rather than merely reconstructing surface content. The system achieves constant memory footprint (~14 MB) regardless of deployment duration while maintaining rich temporal context through exponential moving averages and hierarchical buffers operating at five distinct timescales.
Our system processes time-windowed batches of articles through a four-stage pipeline optimized for continuous temporal streams:
We employ a 3-layer Transformer encoder (d_model=256, 8 heads) with a compressed bottleneck dimension (d=32 or 256) and 2-layer decoder for reconstruction. The encoder maps variable-length sequences to fixed-dimensional embeddings: encode: ℝ^(L×V) → ℝ^d where L is sequence length (max 512 tokens) and V is vocabulary size (10,000 BPE tokens).
Given N article embeddings at time t (where N varies from 0 to 100+ per minute), the aggregator computes a single world state through importance-weighted aggregation:
This produces a fixed-size world embedding z_t ∈ ℝ^d regardless of batch size N, enabling constant-time downstream operations.
Rather than attending over individual articles (which would require N_articles × H_history computations), we attend over historical world states maintained in hierarchical buffers. Our single-head attention architecture with exponential temporal decay ensures stable gradients:
Critical Design Choices: Single-head architecture (simpler gradients), pre-normalization (training stability), Xavier uniform initialization with gain=0.1, and residual connections ensuring information flow.
For long-term deployment, we implement multi-resolution memory with constant O(1) space complexity:
Total Memory: ~1.7 MB constant regardless of deployment duration, enabling unbounded temporal operation.
The predictor maps current world state to future predictions at three timescales, incorporating deviation signals from five exponential moving average baselines operating at different temporal resolutions:
Each MLP follows architecture: Linear(2d → 2d) → ReLU → Linear(2d → d), trained with weighted prediction loss:
Long-term prediction receives 60% weight as it is hardest and most meaningful, forcing retention of causally-relevant information.
A key challenge is learning which articles deserve higher aggregation weights without manual annotation. We introduce a multi-signal importance loss derived entirely from training dynamics:
Articles whose embeddings significantly affect prediction loss contain predictively-relevant information:
Articles with large gradient magnitudes strongly influence future predictions and should receive higher aggregation weight.
Statistical outliers contain unique information not captured by typical content:
During periods of rapid world state change, importance learning becomes more critical:
We train the aggregator to match gradient-based importance through KL divergence:
This self-supervised approach requires zero manual annotation while remaining task-aligned and domain-agnostic.
Our training optimizes three complementary objectives with carefully tuned weighting:
Prediction dominates (50%) as the core learning signal, reconstruction ensures information density (30%), and importance learning teaches attention allocation (20%) without overwhelming other objectives.
Training on CC-News corpus (2016-2019, 31M articles) with temporal batch sampling:
Future windows expand to account for natural article scarcity at longer horizons, ensuring robust supervision across all timescales.
Critical hyperparameters for training stability:
The system operates on a 1-minute update cycle in production deployment, processing continuous information streams with the following RSS/Atom feed discovery and aggregation pipeline:
The discovery crawler implements graph-theoretic endpoint enumeration across heterogeneous web architectures, prioritizing HTML-embedded feed autodiscovery links before expanding to heuristic-based fallback paths (/feed, /feed/atom, /rss, /?feed=rss, /rss.xml).
User-Agent: YandoriBot/1.0 (+https://yandori.io/bot.html) Request Rate: Poisson(λ=1.0) per domain with exponential backoff Rate Limiting: Uniform(500ms, 1200ms) inter-request delay Concurrency: 100 workers (configurable via semaphore pool) Timeout: 10s (connection) + 10s (read) with circuit breaker Cache Policy: LRU eviction, TTL=24h (robots.txt), ∞ (feeds)
For embedding dimension d=256:
Our RSS/Atom feed discovery system adheres to RFC 9309 (Robots Exclusion Protocol) with full support for:
To exclude YandoriBot from your domain, add the following directives to /robots.txt:
User-agent: YandoriBot Disallow: / Crawl-delay: 10
Changes to robots.txt are honored immediately upon cache expiration (24h maximum). For urgent removal requests, contact support@yandori.io for manual blocklist addition with O(1) propagation latency.
| Configuration | Recon | Pred | Import | Result |
|---|---|---|---|---|
| Initial | 0.50 | 0.50 | 0.00 | Unstable, poor prediction |
| Balanced | 0.33 | 0.33 | 0.33 | Importance dominates, poor convergence |
| Final | 0.30 | 0.50 | 0.20 | Stable, best prediction |
Finding: Prediction must dominate (≥50%) for effective learning. Importance loss should support but not overwhelm.
| Signal Pair | Correlation | Interpretation |
|---|---|---|
| Prediction ↔ Variance | +0.23 | Weak positive: outliers somewhat predictive |
| Prediction ↔ Reconstruction | -0.41 | Strong negative: signals contradict! |
| Variance ↔ Reconstruction | -0.12 | Weak negative |
Critical Finding: Reconstruction importance actively contradicts prediction importance. Articles easy to reconstruct tend to be generic and non-predictive. We disabled reconstruction importance based on this empirical observation.
| Recent Buffer | Weekly Buffer | Pred Loss | Memory |
|---|---|---|---|
| 100 states | 0 states | 0.0452 | 0.4 MB |
| 1440 states | 0 states | 0.0391 | 1.5 MB |
| 1440 states | 168 states | 0.0356 | 1.7 MB |
Finding: Hierarchical memory provides 9% improvement over recent-only with minimal additional cost (~200 KB).
@article{hawkes2025yandori,
author = {Hawkes, Taylor},
title = {Yandori: Multi-Scale Temporal Prediction for Information Stream Compression},
journal = {arXiv preprint arXiv:2025.XXXXX},
year = {2025},
institution = {Yandori Research},
keywords = {temporal prediction, neural compression, self-supervised learning,
hierarchical memory, continuous streams, gradient-based importance}
}