YandoriBot Crawler Documentation

RSS/Atom Feed Discovery System

Overview

YandoriBot is a respectful RSS/Atom feed discovery crawler supporting the Yandori temporal prediction research system. Our crawler discovers and monitors syndication feeds to provide real-time content streams for multi-scale temporal prediction experiments.

The crawler implements graph-theoretic endpoint enumeration across heterogeneous web architectures, prioritizing HTML-embedded feed autodiscovery links before expanding to heuristic-based fallback paths.

Crawler Specifications

User Agent

YandoriBot/1.0 (+https://yandori.io/bot.html)

Request Pattern

The crawler implements a respectful discovery pattern designed to minimize server load:

  • HTML Discovery: Maximum 3 feeds extracted from HTML (even if more exist)
  • Fallback Paths: Only attempted if NO feeds found in HTML
  • Paths Checked: /feed, /feed/atom, /rss, /?feed=rss, /rss.xml
  • Typical Total: 2-4 requests per domain
  • Redirects: Maximum 2 redirects followed
  • Duplicate Detection: Prevents fetching same feed multiple times

Rate Limiting

  • Inter-request delay: Uniform(500ms, 1200ms) random delay
  • Request rate: Poisson(λ=1.0) per domain with exponential backoff
  • Concurrency: 100 workers (configurable via semaphore pool)
  • Timeout: 10s (connection) + 10s (read) with circuit breaker
  • Respects: robots.txt Crawl-delay directive

Error Handling

Exponential backoff with jitter based on HTTP status codes:

  • 403/429 (Forbidden/Rate Limited): 72-hour backoff, stops immediately
  • 5xx (Server Error): 48-hour backoff
  • 404 (Not Found): 24-hour backoff
  • Success: Domain cached permanently, never re-scanned
Cache Behavior: Once a working feed is discovered for a domain, YandoriBot caches the result permanently and will never re-scan that domain for additional feeds. This minimizes server load for subsequent monitoring operations.

Discovery Process

YandoriBot follows a systematic discovery process optimized for efficiency:

  1. robots.txt Check: Fetches and respects robots.txt directives before any crawling activity
  2. HTML Parsing: Fetches homepage once, looks for RSS feed links in <link> tags
  3. Feed Validation: Validates discovered feeds by parsing XML structure
  4. Fallback Check (Conditional): Only tries common paths (/feed, /rss, etc.) if NO feeds found in HTML
  5. Permanent Caching: Stores successful discoveries, never re-scans domains with working feeds
Efficiency Optimizations:
  • Maximum 2-4 requests per domain (typically)
  • Random 500-1200ms delays between requests
  • Stops immediately on 403/429 errors
  • 24-72 hour backoff on errors
  • Never scans the same domain twice if feeds found

Access Control

Method 1: robots.txt (Recommended)

Add this to your /robots.txt file:

User-agent: YandoriBot
Disallow: /

Method 2: Block Specific Paths

To block only feed discovery:

User-agent: YandoriBot
Disallow: /feed
Disallow: /rss
Disallow: /*.xml

Method 3: Server Configuration

Block by user agent in Apache (.htaccess):

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Yandori [NC]
RewriteRule .* - [F,L]

Or in nginx:

if ($http_user_agent ~* "Yandori") {
    return 403;
}
Immediate Effect: YandoriBot checks robots.txt before every scan and will honor your directives immediately (within 24-hour cache expiration window).

Data Collection

What We Collect

  • RSS/Atom feed URLs and endpoint locations
  • Feed titles and descriptions (metadata)
  • Article titles, links, and publication timestamps
  • Article content for temporal prediction model training

What We Don't Collect

  • No personal information or user data
  • No tracking cookies or session data
  • No email addresses or contact information
  • No authentication credentials

Data Usage

Collected feed data is used exclusively for:

  • Training multi-scale temporal prediction models
  • Research on information stream compression
  • Evaluating self-supervised importance learning algorithms
  • Benchmarking hierarchical memory architectures

All data collection is limited to publicly available RSS/Atom feeds. We respect copyright and DMCA takedown requests. Data is not sold to third parties.

Technical Details

Implementation

Language:     Go 1.24 (compiled with PGO + LTO)
Database:     MySQL 8.0 (InnoDB with Adaptive Hash Index)
Protocols:    HTTP/1.1, HTTP/2 (with ALPN negotiation)
Feed Formats: RSS 2.0, Atom 1.0, JSON Feed 1.1

Compliance Standards

  • RFC 9309: Robots Exclusion Protocol
  • RFC 9110: HTTP Semantics
  • Prefix-tree parsing: Wildcard expansion and precedence resolution for robots.txt
  • Token bucket algorithm: Configurable burst capacity and refill rate
  • Floyd's algorithm: Cycle detection for redirect handling
  • XML schema validation: XSD validation before persistence

Cache Policy

  • robots.txt: LRU eviction with TTL=24h (cached for performance, refreshed daily)
  • Discovered feeds: Permanent cache (∞ TTL, never re-scanned)
  • Failed domains: Temporary backoff (24-72h depending on error type)

Contact & Support

If you have questions, concerns, or need to report abuse:

When Reporting Issues

Please include:

  • Your domain name
  • Date and time of bot activity (with timezone)
  • Server log excerpts showing our user agent
  • Nature of the concern
Response Time: We respond to all legitimate abuse reports within 24 hours and will immediately add your domain to our blocklist upon request. Removal requests are honored with O(1) propagation latency.