Product thinking 9 March 2026

The Two Ways Your Moderation Will Fail

Every moderation system fails in one of two directions. It lets through content it should have caught, or it blocks content it should have allowed. Both failures are real, both have costs, and most teams only think hard about one of them until the other bites them.

Understanding this trade-off is the foundation of running a moderation system that actually works. If you treat it as a detail to revisit later, no amount of tuning, tooling, or staffing will fix the underlying problem. Get the framing right first, and the rest of the decisions become much clearer.

The Two Failure Modes

A false negative is content your system misses. Harassment that slips through. A threat that goes undetected. A slur that your users see because nothing caught it. This is what most people picture when they think moderation is failing.

A false positive is content your system incorrectly flags. A legitimate post removed. A user wrongly suspended. A customer blocked from doing something routine because a word in their message matched a pattern it should not have.

Both failures have real costs. False negatives create harm, damage trust in the platform, and in some cases create legal exposure. False positives create friction, alienate good-faith users, and — if bad enough — make the product feel arbitrary and unusable.

The problem is that they pull in opposite directions. Making your system more aggressive reduces false negatives but increases false positives. Making it more permissive reduces false positives but increases false negatives. There is no configuration that eliminates both at once.

This is not a solvable problem. It is a managed one.

Why Teams Get This Wrong

Most teams starting out in moderation make the same mistake: they treat it as a filter to be calibrated once and left alone. They build or adopt something that catches obvious content, declare it done, and move on.

Then one of three things happens. The false positives cause a product problem — users complaining, support tickets piling up, developers adding exceptions. Or the false negatives cause a trust problem — an incident, a screenshot, a story that should not have been possible on the platform. Or both happen at once, and the team ends up with a patchwork of rules pulled in opposite directions with no coherent model behind them.

The solution is not just better rules. It is deciding upfront which failure mode is more acceptable in your context, and designing the system around that decision explicitly.

Setting Your Threshold

There is no universal right threshold. The right answer depends on who your users are, what harm looks like on your platform, and what the cost of a mistake is in each direction.

A children's platform should run aggressive moderation and accept a higher false positive rate. The cost of letting harmful content through is too high, and the cost of occasionally removing a borderline post is manageable. Parents expect that. A developer tools company or a B2B SaaS has very different dynamics: the users are adults, the content is mostly professional, and false positives actively damage the product experience. The threshold belongs much higher. A gaming chat platform sits somewhere between those poles — the content norms are rougher, but the harm potential is real, and what counts as "normal" varies significantly by audience age and game type.

The threshold question is not "what score do I block at?" It is "what is the cost of a mistake in each direction for my specific users?" That is a business decision, and it needs to be made in advance — not calibrated reactively after you see what the system does in production. Teams that skip this step end up chasing thresholds in response to whoever complained most recently, which is not a moderation strategy.

Once you have answered the business question, the technical question follows naturally. Where you draw the line between auto-block, human review, and allow is a direct expression of how you have weighed false negatives against false positives.

What Scores Actually Mean

In a weighted scoring system, a high score does not mean content is definitively harmful. It means the system has accumulated enough signal that the content warrants attention. A low score does not mean content is safe. It means the system did not find the patterns it was looking for.

This is why scores are routing decisions, not verdicts.

The middle band — roughly the 35 to 60 range, depending on your thresholds — is where this matters most. A score of 47 is not telling you the content is low-priority. It is telling you something genuinely uncertain happened. The content triggered signals without crossing a clear line. The right response is human review, not ignoring it. Teams that treat middle-band scores as "probably fine" are accumulating a quiet false negative problem they will discover the hard way.

It is also worth knowing where false negatives concentrate. They rarely come from obvious content — slurs, explicit threats, graphic language. Those get caught. The hard cases are obfuscation and evasion: deliberate misspellings, coded language, threats framed as hypotheticals, harassment structured to stay just below the line. A system that does well on obvious content and poorly on evasion will look fine in testing and underperform in production, because the users who intend harm are the ones who learn to avoid detection.

Middle-band scores are a signal, not a non-event. They mean a human should look at this, not that nothing happened.

Category Scores Are the Signal

Aggregate scores are useful for routing. Category scores are useful for understanding what actually happened and what to do about it.

Content that scores highly on self-harm signals needs a different response than content that scores highly on threats or slurs. The aggregate score might be the same. The operational response should not be. A self-harm signal might warrant a welfare-focused response or a referral. A credible threat might warrant account action or escalation to a trust and safety lead.

This is where moderation stops being a classification problem and starts being an operational one. Who reviews what? What action do reviewers take? What is the escalation path? How are decisions logged and fed back into the system?

When you discover you got it wrong

Most teams find out their threshold was wrong only after launch. The question is which direction it was wrong in, because the recovery looks different.

If you set it too aggressively — too many false positives — you will see it in support queues, user complaints, and developers adding special-case exceptions to work around the system. The fix is raising the threshold or narrowing the review band. Painful, but recoverable. Users adjust relatively quickly when content that was getting blocked starts going through.

If you set it too permissively — too many false negatives — you might not see it clearly until an incident surfaces. A screenshot. A pattern that was always there and nobody caught. The fix is tightening the threshold. But here, the recovery is harder. Users and developers have built around the permissive behaviour. Some will push back on tighter moderation as if it is a new restriction, even though it is a correction. If the platform has developed a reputation for lax enforcement, tightening it is a multi-month exercise in expectation management, not just a configuration change.

The asymmetry is worth internalising: going from too-permissive to more-aggressive is the harder direction to travel. Starting tighter and loosening over time is almost always easier than starting loose and trying to tighten.

The Practical Implication

If you are building or buying a moderation system, the questions to ask are not just "what does it detect?" They are:

What do scores mean, and what should I do at each level?
What categories does it distinguish, and are those categories operationally useful?
How do I route content to human review, and what do reviewers see?
How do I tune it for my context without breaking it for other content?
What happens when it is wrong, and how do I know when it is?

These are harder questions than "does it catch slurs?" But they are the questions that determine whether your moderation actually works, or just looks like it does until something goes wrong.

They are also not questions you answer once. As your product grows and your user base shifts, the right threshold shifts with them. A platform that started with a small developer audience and grows into a consumer product will need to revisit decisions that made sense in year one. The moderation strategy that worked when everyone knew each other is not the same one you need when you have anonymous users at scale. Build the habit of reviewing the threshold as deliberately as you set it in the first place.

Aegis Core is designed around this model: weighted scoring by category, configurable thresholds, and a built-in human review queue for the content that automation should not decide on its own. If you are thinking through any of this for your platform, start with the public docs and the product overview.