YandoriBot Crawler Documentation

Overview

YandoriBot is a respectful RSS/Atom feed discovery crawler supporting the Yandori temporal prediction research system. Our crawler discovers and monitors syndication feeds to provide real-time content streams for multi-scale temporal prediction experiments.

The crawler implements graph-theoretic endpoint enumeration across heterogeneous web architectures, prioritizing HTML-embedded feed autodiscovery links before expanding to heuristic-based fallback paths.

Crawler Specifications

User Agent

YandoriBot/1.0 (+https://yandori.io/bot.html)

Request Pattern

The crawler implements a respectful discovery pattern designed to minimize server load:

HTML Discovery: Maximum 3 feeds extracted from HTML (even if more exist)
Fallback Paths: Only attempted if NO feeds found in HTML
Paths Checked: /feed, /feed/atom, /rss, /?feed=rss, /rss.xml
Typical Total: 2-4 requests per domain
Redirects: Maximum 2 redirects followed
Duplicate Detection: Prevents fetching same feed multiple times

Rate Limiting

Inter-request delay: Uniform(500ms, 1200ms) random delay
Request rate: Poisson(λ=1.0) per domain with exponential backoff
Concurrency: 100 workers (configurable via semaphore pool)
Timeout: 10s (connection) + 10s (read) with circuit breaker
Respects: robots.txt Crawl-delay directive

Error Handling

Exponential backoff with jitter based on HTTP status codes:

403/429 (Forbidden/Rate Limited): 72-hour backoff, stops immediately
5xx (Server Error): 48-hour backoff
404 (Not Found): 24-hour backoff
Success: Domain cached permanently, never re-scanned

Cache Behavior: Once a working feed is discovered for a domain, YandoriBot caches the result permanently and will never re-scan that domain for additional feeds. This minimizes server load for subsequent monitoring operations.

Discovery Process

YandoriBot follows a systematic discovery process optimized for efficiency:

robots.txt Check: Fetches and respects robots.txt directives before any crawling activity
HTML Parsing: Fetches homepage once, looks for RSS feed links in <link> tags
Feed Validation: Validates discovered feeds by parsing XML structure
Fallback Check (Conditional): Only tries common paths (/feed, /rss, etc.) if NO feeds found in HTML
Permanent Caching: Stores successful discoveries, never re-scans domains with working feeds

Efficiency Optimizations:

Maximum 2-4 requests per domain (typically)
Random 500-1200ms delays between requests
Stops immediately on 403/429 errors
24-72 hour backoff on errors
Never scans the same domain twice if feeds found

Access Control

Method 1: robots.txt (Recommended)

Add this to your /robots.txt file:

User-agent: YandoriBot
Disallow: /

Method 2: Block Specific Paths

To block only feed discovery:

User-agent: YandoriBot
Disallow: /feed
Disallow: /rss
Disallow: /*.xml

Method 3: Server Configuration

Block by user agent in Apache (.htaccess):

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Yandori [NC]
RewriteRule .* - [F,L]

Or in nginx:

if ($http_user_agent ~* "Yandori") {
    return 403;
}

Immediate Effect: YandoriBot checks robots.txt before every scan and will honor your directives immediately (within 24-hour cache expiration window).

Data Collection

What We Collect

RSS/Atom feed URLs and endpoint locations
Feed titles and descriptions (metadata)
Article titles, links, and publication timestamps
Article content for temporal prediction model training

What We Don't Collect

No personal information or user data
No tracking cookies or session data
No email addresses or contact information
No authentication credentials

Data Usage

Collected feed data is used exclusively for:

Training multi-scale temporal prediction models
Research on information stream compression
Evaluating self-supervised importance learning algorithms
Benchmarking hierarchical memory architectures

All data collection is limited to publicly available RSS/Atom feeds. We respect copyright and DMCA takedown requests. Data is not sold to third parties.

Technical Details

Implementation

Language:     Go 1.24 (compiled with PGO + LTO)
Database:     MySQL 8.0 (InnoDB with Adaptive Hash Index)
Protocols:    HTTP/1.1, HTTP/2 (with ALPN negotiation)
Feed Formats: RSS 2.0, Atom 1.0, JSON Feed 1.1

Compliance Standards

RFC 9309: Robots Exclusion Protocol
RFC 9110: HTTP Semantics
Prefix-tree parsing: Wildcard expansion and precedence resolution for robots.txt
Token bucket algorithm: Configurable burst capacity and refill rate
Floyd's algorithm: Cycle detection for redirect handling
XML schema validation: XSD validation before persistence

Cache Policy

robots.txt: LRU eviction with TTL=24h (cached for performance, refreshed daily)
Discovered feeds: Permanent cache (∞ TTL, never re-scanned)
Failed domains: Temporary backoff (24-72h depending on error type)

Contact & Support

If you have questions, concerns, or need to report abuse:

Email: support@yandori.io
Website: https://yandori.io
Research: Multi-Scale Temporal Prediction System

When Reporting Issues

Please include:

Your domain name
Date and time of bot activity (with timezone)
Server log excerpts showing our user agent
Nature of the concern

Response Time: We respond to all legitimate abuse reports within 24 hours and will immediately add your domain to our blocklist upon request. Removal requests are honored with O(1) propagation latency.