Based on feedback from the HN community, I reworked the core system to explicitly model source attribution and article-to-article references, not just textual similarity.
What changed:
Clusters still start from full-text embeddings, but are now refined using explicit citations and source mentions inside articles.
Sites we don't actively crawl are added as nodes when referenced, so lineage isn't limited to the monitored set. This captures many large outlets that don't publish usable RSS feeds.
Attribution no longer relies purely on RSS timestamps — publish times are validated against citation order and reference structure.
The flow view now represents an inferred derivation graph, not just a timeline of similar headlines.
Browse and search across past stories to inspect earlier propagation patterns.
Click any story to see its source graph and attribution chain.
Context
The main criticism from the previous HN post was that similarity + RSS timestamps aren't sufficient to identify who actually broke a story, and that large sources were missing. Both were fair. This update addresses those issues by modeling explicit citation relationships and including referenced external sources as graph nodes.
Still English-only for now — trying to get attribution right before expanding.