Skip to content

Attribution Confidence Scoring: Why Your Threat Intel Is Lying About How Sure It Is

T. Holt T. Holt
/ / 5 min read

Attribution is where intelligence operations go wrong most often. Not because analysts are careless. Because the systems they use to report attribution treat confidence as a binary: either you know who did it, or you don't. That gap between 'we have indicators' and 'we have proof' gets papered over in briefings, reports, and automated feeds until someone makes a decision based on a conclusion that was, at best, a well-reasoned guess.

Scrabble tiles spelling 'DATA' on a wooden table with a blurred plant background. Photo by Markus Winkler on Pexels.

This is an engineering problem as much as an analytical one.

The Confidence Theater Problem

Most threat intelligence platforms support confidence scores in their data models. STIX 2.1 has a confidence field. MISP has confidence levels baked into relationships. OpenCTI lets you attach reliability ratings to sources and individual objects.

Almost nobody uses them correctly.

What happens in practice: an analyst ingests a report from a commercial feed, sees an attribution claim linking a campaign to APT-X, and adds that to the knowledge graph with whatever the platform's default confidence value is. That default usually sits around 50 out of 100. Nobody changes it because changing it requires a decision, and decisions require time, and time is the one thing SOC teams never have.

The result is a knowledge graph where every attribution claim carries the same weight regardless of whether it came from a nation-state forensic team or a blog post someone found on Reddit.

What a Real Confidence Model Looks Like

Confidence scoring needs to be decomposed into its constituent parts, not treated as a single number an analyst picks from a dropdown.

Consider four dimensions that actually drive attribution reliability:

Source reliability. How often has this source been independently verified? A government CERT with a documented track record rates differently than an anonymous Telegram channel. Score this once per source, store it as metadata, apply it automatically at ingestion time.

Technical indicator quality. Infrastructure indicators (IPs, domains) decay fast and get reused by unrelated actors. TTPs at the top of the ATT&CK pyramid are harder to spoof and persist longer. Weight them accordingly. A shared C2 IP is weak evidence. A novel process injection technique matching a known actor's tooling is stronger.

Corroboration count. How many independent sources reach the same conclusion? Two reports from the same vendor feed do not count as two independent sources. You need to track the provenance chain to catch this. If three different feeds all cite the same original Mandiant report, you have one data point, not three.

Temporal decay. Attribution confidence should degrade over time unless actively refreshed. A campaign attribution that was solid two years ago is worth less today because threat actors retool, share infrastructure, and deliberately plant false flags to confuse exactly this kind of analysis.

Combine these into a single score, but expose all four components in your data model. The aggregate number is useful for dashboards. The components are what analysts need when a score looks wrong.

graph TD
    A[/Raw Intel Ingest/] --> B{Source Reliability Check}
    B --> C[Indicator Quality Scorer]
    C --> D[Corroboration Deduplicator]
    D --> E[Temporal Decay Engine]
    E --> F((Composite Confidence Score))
    F --> G[Attribution Record]

Building This Into the Pipeline

The scoring logic belongs in your enrichment layer, not in your UI. If analysts can see attribution claims before confidence scores are applied, they will anchor on the claim and rationalize the score afterward. Psychology works against you here.

Ingest first. Score before display. Surface the score prominently alongside any attribution claim, not buried in a metadata panel.

For the temporal decay piece, a scheduled job that runs nightly against your knowledge graph and decrements confidence scores based on age works fine at small scale. At larger scale, you want this as a stream processor running against your event log. Every time an attribution record goes 90 days without a corroborating update, the confidence drops. Analysts get notified when records they own fall below a threshold.

Store the scoring rationale, not just the score. When someone asks why an attribution shows 34% confidence in six months, you want to replay the logic, not interrogate the analyst who set it.

The Briefing Problem

All of this falls apart if your reporting layer strips confidence out of the output.

Executive briefs routinely flatten "confidence: 38%" into "attributed to [Actor]." This is where bad decisions happen. Build templates that make confidence explicit and hard to remove. If a confidence score is below your operational threshold (define that threshold; 60% is a reasonable floor for most tactical decisions), the briefing template should flag the attribution as unconfirmed by default.

Someone has to own that threshold. Make it a config value in your pipeline, not a judgment call made differently by every analyst who touches the report.

Attribution is hard. The actors you're tracking work to make it hard. But the amount of additional uncertainty we introduce through lazy confidence modeling is entirely self-inflicted, and it's fixable with deliberate engineering.

Get Intel DevOps in your inbox

New posts delivered directly. No spam.

No spam. Unsubscribe anytime.

Related Reading