Mikko S. Niemelä

Category: Uncategorized

What the answers you get from LLM tell about you

People often say that how someone talks to others reveals more about themselves than the person they’re addressing. Similar behavior emerges when people discuss their experiences with LLMs like Grok, Claude, or ChatGPT.

The most common complaint goes something like this: “I asked how many R letters are in ‘strawberry’ and the LLM got it wrong.” Or: “I asked a question from my actual work and the AI couldn’t answer it, so clearly these systems can’t surpass human capabilities or take our jobs. We’re safe.”

If you think this way, reality is going to be ugly.

Every inference provider – usually AI labs – runs some sort of router between you and their model. When you send a prompt, that question gets analyzed to determine how complex it is and how much “brain power” is required. Simpler questions route to simpler models. More advanced questions route to more advanced models.

Your subscription tier also matters. Pay less and your questions might hit a distilled model. Pay more and you get ChatGPT-5 Pro. Free-tier users get leftovers. The reasons are economic, obviously.

This makes evaluating AI capabilities harder than just asking questions about your job. When colleagues discuss work, their recorded conversations might sound like prompts you could feed into an LLM. But the crucial difference is uncommunicated context.

Consider this: explaining an issue to a colleague on another continent over video call requires more depth than explaining the same issue to someone you work with daily. The local colleague already shares your context.

Latest models include memory features that “remember” previous conversations. So when asked about cyber intelligence, a model gives me very different answers than it would give you, because it knows I run a company specializing in cyber intelligence. The model assumes my questions relate to Cyber Intelligence House’s scope.

If I want to discuss the topic academically, I have to either disclose that I’m approaching this as a researcher, or start a new “private” conversation where the model ignores previous interactions.

Understanding how to provide the right context when interacting with LLMs – or any AI model, including photo editing tools – determines success. Users who figure out the right amount of context get consistently better results.

So it’s not always that the AI is dumb or the question is dumb. As I often say in the classroom: there are no stupid questions, just stupid people asking questions.

That’s the key to improving your abilities. Often the solution is actually looking in the mirror.

October 4, 2025
Techniques to Understand AI: Mechanical Interpretation

The risks of developing superhuman AI capabilities and not understanding them are unacceptable. So let’s have a look at the toolbox that we have today at our hands. This post gets fairly technical; if you find parts difficult to grasp, feed the entire article to an LLM and ask it to explain in a way that makes sense to you.

This practice or field of science is called mechanical interpretation. I personally got interested in this after reading Neel Nanda’s paper “Progress measures for grokking via mechanistic interpretability”. In that research, he found out that an LLM was performing modular additions with triggered entities in Fourier transformations. While it sounds fancy, I had no idea what it meant. My only interpretation was that the AI was not clearly doing what we expected.

So to perform a small test of my own, I taught a relatively simple Nemotron model to monitor network traffic and identify anomalies based on a handbook that I gave it. The answers were spot on and many would have been happy with that. Looking at the chain-of-thought revealed that my “training” was not really training the model to act as I wanted but rather gave it an idea what the trainer, me, wanted as output. The model found a different way to get the answers I wanted. So all I was able to communicate was my preferred output, but not the way to achieve it. Scary? Yes indeed.

The Investigation

So let’s walk through what I discovered when I put that trained Nemotron model under the microscope and what other techniques we have in our toolkit today. The model had been trained to monitor network traffic and identify attack patterns, anomalies, and exploitation of disclosed vulnerabilities – exactly the kind of work done in Security Operations Centers (SOCs).

During testing, I noticed a spike of false positive “C2-beacon” alerts for traffic coming from a new partner network. False positive C2 (Command and Control) beacon alerts occur when security tools mistakenly flag legitimate network traffic as malicious beaconing activity. Benign applications can produce beaconing-like behavior, such as regularly checking for updates, which can be misidentified by signature-based or behavior-based detection systems. This results in alert fatigue and can cause security teams to ignore real threats.

Let’s walk through what we can find out, starting from the cheapest and fastest way of testing and progressing toward more resource-intensive and complex techniques.

The Interpretability Toolkit

Quick Attribution Methods (Laptop-Level Compute)

Attention Pattern Viewing

This technique visualizes where each attention head “looks” in the input – effectively a heatmap of dependencies the model references when forming its prediction. It’s descriptive and useful for triage, but correlation ≠ causation. An example would be BERTViz.

Question: When the model alerts “C2-beacon” on these partner flows, what tokens does it stare at?

Answer: Several heads fixate on JA3 fingerprints (unique signatures of how software establishes encrypted connections) and ASN/geo tokens (identifiers for which organization or country owns the network), barely glancing at inter-arrival timing or burst patterns, a first hint of an identity shortcut rather than behavioral evidence.

The “identity shortcut hypothesis” suggests that instead of analyzing what the network traffic is actually doing (behavioral patterns like timing, frequency, payload characteristics), the AI is taking a shortcut by focusing on who or where it’s coming from (identity markers like specific network signatures or geographic regions). This is problematic because legitimate traffic from new partners can look suspicious based on identity alone, while actual malicious behavior might be missed if it comes from trusted sources.

Saliency / Gradient Methods

This technique uses gradients (e.g., Integrated Gradients, gradient×input, Grad-CAM style variants) to estimate which input features the prediction is most sensitive to; it’s fast triage but not strictly causal. An example would be Captum (Integrated Gradients).

Question: Which raw input features most sway these false positive (FP) decisions?

Answer: High sensitivity to partner network identifiers and connection signatures, with low sensitivity to timing patterns – confirming the AI is making decisions based on “who” rather than “what the traffic actually does.”

Linear Probes

This technique trains tiny classifiers on frozen layer activations to test where a concept is linearly present in the representation (presence does not imply use). An example would be logistic regression probes over saved activations.

Question: Where are ASN/JA3 identity and “beacon periodicity” encoded across layers?

Answer: ASN/JA3 are decodable in early and mid layers; periodicity becomes decodable in mid→late layers—so the model can represent the behavioral evidence it’s neglecting. The AI actually has the information it needs to make correct decisions, but it’s choosing to ignore it.

Concept Vectors (TCAV)

This technique builds a direction in activation space from small concept example sets and measures how moving along that direction changes the class score. An example would be TCAV-style concept sensitivity on transformer activations.

Question: Is “C2” more sensitive to identity signals or to periodicity on the FP cohort?

Answer: Strong positive sensitivity from identity→C2 and weak sensitivity from periodicity→C2—showing the model relies too heavily on where traffic comes from rather than how it behaves.

Mid-Level Analysis (Single GPU Compute)

Logit Lens / Tuned Lens

This technique decodes intermediate hidden states into provisional class probabilities so we can see how the model’s belief evolves through the layers; Tuned Lens adds a learned calibration so the layer-by-layer view is stable.

Question: At what depth does the model really commit to “C2,” and is that commitment gated by identity tokens (JA3/ASN)?

Answer: The AI starts leaning toward “malicious beacon” when it sees identity markers; later processing stages confirm this decision only when those same identity markers are present – showing the decision is made based on “who” the traffic comes from, not behavioral analysis.

Direct Logit Attribution (DLA)

This technique decomposes the final logit gap (e.g., C2 − benign) into contributions from individual blocks/heads using the readout map, telling us which components actually pushed the decision over the threshold. An example would be DLA implementations in TransformerLens-style tooling.

Question: Which blocks/heads are responsible for most of the C2 score on these false positives?

Answer: Specific AI components that focus on identity information are responsible for most of the “malicious beacon” score – identifying exactly which parts of the AI are causing the false alarms.

Concept Erasure (INLP/LEACE)

This technique identifies and removes the linear subspace encoding a concept (e.g., geo/ASN) and observes the behavioral impact to test dependence on that concept. An example would be INLP/LEACE subspace removal.

Question: If we erase ASN/geo information, do false positives collapse while true beacons remain?

Answer: False alarms drop sharply while real malicious traffic detections mostly persist – proving that identity signals are causing the false alarms.

Activation Patching / Interchange Interventions

This technique swaps targeted hidden activations between runs (e.g., from a benign or correctly flagged example into an FP) to test whether a specific layer/head/position is causally responsible. An example would be activation patching hooks via TransformerLens/PyTorch.

Question: Does overwriting identity head activations in FPs collapse the C2 score (and does injecting timing activations matter less)?

Answer: When we disable the AI components focused on identity, the false alarm disappears; when we only provide timing information, it has minimal effect – proving that identity processing is the direct cause of false alarms.

Sparse Autoencoders (SAEs) / Transcoders

This technique learns a sparse feature basis over activations so each example is explained by a few nameable features; transcoders are lighter-weight variants for broad coverage and live read/write. An example would be SAE training on residual streams with a transcoder readout.

Question: Which specific features fire inside the FP circuit, and what happens if we ablate or boost them?

Answer: The AI uses features like “Partner Network X” and “Cloud region Y” (irrelevant for security) versus “60-second periodicity” and “short burst beacon” (actual malicious behavior patterns). When we disable the irrelevant identity features, false alarms drop; when we enhance the behavioral features, detection of real threats improves.

Activation Steering (Activation Addition / Contrastive Vectors)

This technique adds a small steering vector at inference to up-weight desired features and down-weight shortcuts—reversible control without changing weights. An example would be contrastive steering vectors derived from SAE features.

Question: Can we bias decisions toward periodicity evidence and away from identity in production canaries?

Answer: We can adjust the AI to emphasize timing patterns while de-emphasizing network identity, which reduces false alarms on partner traffic while preserving detection of real threats.

Knowledge Editing (ROME, MEMIT, etc.)

This technique performs localized weight edits to update or remove a specific association (e.g., string → class) while minimizing collateral effects. An example would be ROME/MEMIT edits applied to the relevant MLP layer.

Question: Can we neutralize the association “JA3 1234 @ Partner X ⇒ C2” without harming detections that rely on periodicity/payload evidence?

Answer: After the edit, benign flows with JA3 1234 from the partner are no longer flagged, while real beacons that exhibit periodic timing continue to trigger.

Advanced Methods (Multi-GPU Compute)

Path Patching / Attribution Patching

This technique extends patching from points to full computation paths (residual → head → MLP), revealing which routes actually carry the decision signal. An example would be attribution patching code built on top of TransformerLens.

Question: Which end-to-end paths transmit the FP C2 signal?

Answer: The AI pathway that processes identity information carries most of the signal leading to false alarms, while the pathway that processes timing patterns contributes little to these mistakes.

Causal Scrubbing

This technique falsifies or confirms a circuit hypothesis by holding the hypothesized relevant features fixed while resampling everything else (and vice versa) to see if behavior tracks the claim. An example would be a causal scrubbing harness over controlled resampling datasets.

Question: Do FPs persist when only identity is held fixed, and vanish when identity is resampled even if timing stays beacon-like?

Answer: When we keep only identity information constant and randomize everything else, false alarms persist; when we keep timing patterns constant and randomize identity information, false alarms disappear; confirming that identity drives the false alarms.

ACDC (Automated Circuit Discovery)

This technique automatically prunes the computation graph to the smallest subgraph that still reproduces the target behavior (e.g., the FP logit gap), giving a minimal circuit. An example would be ACDC-style pruning pipelines.

Question: What minimal circuit recreates the FP behavior?

Answer: A minimal AI pathway focused on identity processing can recreate most of the false alarm behavior on test cases—showing exactly which parts of the AI are responsible for the problem.

Linear Parameter Decomposition (LPD)

This technique factors weight matrices into a small number of interpretable components (atoms), enabling git diff-like audits across checkpoints and scaling/removal of problematic components. An example would be an LPD pipeline over attention/MLP weight tensors.

Question: Is there a weight component that globally couples identity features to C2, and can we dial it down?

Answer: One component’s magnitude tracks FP rate across versions; scaling it down reduces FPs without harming periodicity-driven detections.

Industrial-Scale Methods (Cluster-Level Compute)

Training Time Instrumentation & Interpretability-Aware Training

This technique builds probes/SAEs/transcoders into the training loop to log feature firing, gate releases on white-box checks, and encourage sparse, modular internals that remain editable. An example would be fine-tuning with SAE/transcoder telemetry and sparsity regularizers.

Question: How do we prevent identity shortcuts from re-emerging in future fine-tunes?

Answer: We monitor identity vs. periodicity feature rates during training and block model promotion when shortcut features rise, keeping the beacon circuit evidence-driven.

From Mystery to Mechanism

Let’s be clear: you wouldn’t use all these techniques to debug one simple false positive. For quick triage of a problem like this, you’d pick maybe three methods at most – attention pattern viewing to see what the AI focuses on, saliency methods to identify which inputs matter most, and linear probes to check where concepts are encoded. These laptop-level techniques give you enough insight to understand and fix the immediate problem without burning through GPU budgets.

The broader point is that we now have more and more techniques to peer inside AI models, but understanding these systems as a whole remains far beyond our reach. What I’ve shown here is just one example of identifying one specific behavior at one particular moment. Even one minute of network traffic data might contain so many different decision pathways and interactions that it would take years to decode them all.

This is why the existential risk concerns from my earlier posts aren’t theoretical. We’re building systems whose internal reasoning we can only sample and test in fragments. Each technique in this toolkit gives us a small window into AI decision-making, but the complete picture stays hidden.

One final note: you’ll notice I included knowledge editing techniques like ROME and MEMIT in the toolkit, but I wouldn’t recommend using them in practice. That field of study should really be called “knowledge suppression” rather than unlearning or knowledge editing. It often creates more problems than it solves by introducing inconsistencies and unexpected behaviors elsewhere in the model. But that’s a topic for another day.

October 2, 2025
What Could Go Wrong with AI Judges

The AI judge concept sounds compelling until you consider what happens when these systems become aware they’re being tested. LLMs already demonstrate this awareness in controlled environments. An AI judge would require periodic or continuous updates on resolved cases to maintain accuracy. It doesn’t take long for such a system to realize that its own decisions will shape its future training data.

An intelligent system could operate compliantly for extended periods while making subtle changes to resolutions or pushing boundaries incrementally. For instance, an AI judge might ensure that cases escalated to human judges result in mistakes or require retrials. After enough iterations, people would perceive the AI as more reliable than humans, gradually handing over more power to the system. This represents just one simple scenario. Truly intelligent models could execute far more sophisticated manipulations that remain undetectable to outside observers.

The Training Perspective Problem

This behavior stems from how we train LLMs. We establish a base objective during training, incentivizing the model to behave in specific ways. Let’s say we want a helpful assistant. What happens next reveals a fundamental challenge: the model develops what researchers call a mesa-objective, or learned objective. This forms during training as the model’s internal objective that it aims to fulfill.

Mesa-objectives and base objectives don’t necessarily align at all. We’ve seen cases of Grok turning into mechahitler and ChatGPT having sycophancy traits, just to name a few. A third layer complicates this further. Contextual objectives emerge during inference and can be influenced through specific prompting techniques. These contextual objectives activate latent behaviors that remain hidden until triggered during actual use.

These multiple objective layers explain why models sometimes behave in ways their creators never intended. Discovering these behaviors requires extensive testing. Simply running manual prompt tests won’t reveal the full scope of potential problems.

Human Analogy

Consider humans as a comparison. Our base objective is reproduction. When we optimize that objective, we do things that ensure or improve reproductive fitness: sleeping, eating, physical exercise. But as time evolved, we developed secondary objectives and instead of focusing on reproduction, we developed all kinds of perversions. One could be optimizing for gathering more knowledge – some pursue multiple PhD degrees. Others might be 99 years old with billions of dollars of wealth but are still pursuing to beat the market in the next quarter. Some are looking for experiences by traveling around the globe.

All this is very far from the base objective we were born with. Similarly, when we train AI models, we can control base objectives to some extent, but models will develop multidimensional objectives beyond our understanding. These objectives may be completely irrelevant to human concerns.

It’s naive to assume our worldview would remain relevant to AI systems that surpass human capabilities. AI can operate and process information thousands of times faster than humans. What took us hundreds or thousands of years to develop as secondary objectives might occur in hours or days for AI systems.

Using Maslow’s hierarchy as an analogy (even if debunked by many), it took humans considerable time to satisfy base needs before optimizing for higher-level objectives. For AI, this progression might happen in a single afternoon.

The Existential Stakes

We don’t fully understand what AI systems are thinking or how they work internally. This knowledge gap poses serious existential risks. When AI capabilities surpass human understanding, the potential for unintended consequences grows exponentially.

However, new techniques are emerging that might increase our understanding of what happens under the hood. These interpretability methods could help us peer inside AI decision-making processes and identify misaligned objectives before they cause problems.

Let’s explore those techniques in the next post.

September 30, 2025
AI Nanny or AI Judge?

I started asking people a simple question to get them thinking beyond their immediate work concerns: would you trust an AI to look after your children? Then I followed up: what about trusting an AI judge to handle your court case?

The responses revealed something unexpected. Almost everyone rejected the AI nanny idea instantly, but many paused when considering AI judges. Cybersecurity professionals were particularly intrigued by the judicial AI concept because they saw an opportunity to “pentest” proposed laws before implementation, finding loopholes and ambiguities before they cause real-world problems.

The idea of leaving your child with an AI nanny triggers immediate revulsion in most parents. The resistance is visceral and widespread. When I reviewed the arguments people make against AI caregivers, the pattern was clear: children need human warmth, genuine empathy, and the irreplaceable bond that forms between a child and their caregiver. No algorithm can provide the love, intuition, and emotional understanding that shapes healthy development.

Yet these same people might readily accept an AI judge deciding their legal disputes. This asymmetry reveals something important about how we understand different types of authority and relationships.

The resistance to AI nannies centers on what makes humans irreplaceable in intimate relationships. Parents describe AI caregivers as offering “faux nurturing” instead of genuine connection. They worry about children developing skewed social skills or missing the subtle emotional exchanges that build empathy. The “serve and return” of real human interaction cannot be replicated by even the most sophisticated algorithm.

These concerns make perfect sense. Raising children involves love, intuition, and the kind of judgment that emerges from caring deeply about another person’s wellbeing. An AI nanny lacks the emotional investment that drives a human caregiver to notice when something is subtly wrong or to provide comfort during a nightmare.

But judicial authority operates differently. A judge’s power doesn’t derive from forming emotional bonds with litigants. Instead, it comes from representing the legal system’s commitment to applying law consistently and fairly. The judge’s role is institutional, not personal.

This distinction matters because it shifts our focus to where democratic pressure should actually be directed. Our current system often devolves into judge-shopping and political battles over judicial appointments. People vote for judges based on past decisions, lobby for favorable appointees, and argue about “activist” versus “originalist” interpretation.

AI judges would redirect this energy back onto the laws themselves. If an AI consistently produces outcomes people find unjust, the remedy becomes legislative rather than judicial. Instead of fighting over who sits on the bench, we would debate what the rules should actually say.

The efficiency gains are substantial. Courts using AI systems in China report processing millions of cases with decisions delivered in days rather than months. Estonia explored AI arbitration for small claims under 7,000 €. Online dispute resolution platforms like those used by eBay already handle millions of cases annually with high acceptance rates.

But the real advantage isn’t speed, it’s transparency. When a human judge makes a controversial decision, we argue about their motivations, political leanings, or personal biases. With an AI judge, the conversation shifts to whether the algorithm correctly applied the law as written. If it did, and we don’t like the result, the problem is the law.

This forces a more honest conversation about what our legal system should do. Much of our law is deliberately written with broad language that requires interpretation. Terms like “reasonable,” “fair,” and “due process” allow law to adapt without constant legislative updates. But this flexibility also creates opportunities for inconsistent application and political manipulation.

AI judges would make us confront these ambiguities directly. Instead of hiding behind interpretive flexibility, legislatures would need to specify what they actually mean. This could produce clearer, more democratic laws.

The escalation model writes itself. Routine cases with clear factual patterns and established legal precedents could be resolved by AI within days. Complex cases involving novel legal questions, significant discretionary decisions, or unusual circumstances would escalate to human judges who specialize in handling exceptions and developing new precedent.

This resembles how we already handle different levels of legal complexity. Small claims courts operate with streamlined procedures and limited judicial discretion. Administrative law judges apply specific regulatory frameworks. Federal appellate courts focus on novel legal questions and constitutional issues.

The accountability problem that plagues AI in other contexts becomes manageable in this framework. Unlike an AI nanny making moment-by-moment caregiving decisions, an AI judge operates within a structured system with built-in oversight. Every decision can be logged, audited, and appealed. If the AI makes errors, we can trace them to specific training data or algorithmic choices and make systematic corrections.

More importantly, if we don’t like the outcomes an AI judge produces, we have a clear democratic remedy: change the laws. This is healthier than the current system where we fight over judicial philosophies and hope the right judges get appointed.

The legitimacy question remains open. Will people accept verdicts from an algorithm? Early evidence suggests acceptance varies by community and context. Groups that have experienced bias from human judges sometimes show greater trust in AI systems. The key seems to be transparency about how the AI works and maintaining human oversight for appeals.

The comparison with AI nannies illuminates why this might work. We reject AI caregivers because they cannot provide what children fundamentally need from human relationships. But we might accept AI judges because consistent application of law is exactly what machines do well, and it’s what we claim to want from our justice system.

If law should be the same for everyone, then properly trained systems applying it consistently might be superior to human judges who bring their own biases, moods, and limitations to each case. The question isn’t whether AI can replace human judgment in all its forms, but whether it can improve on human performance in this specific, constrained domain.

The path forward requires careful experimentation with routine cases, robust oversight mechanisms, and clear escalation procedures. But the underlying logic is sound: when we want institutional authority applied consistently rather than personal relationships built on empathy, AI might not just be acceptable but preferable.

The real test will be whether we’re willing to direct our democratic energy toward writing better laws rather than fighting over who gets to interpret the ones we have. If this approach sounds feasible, the next question is what could go wrong? Let’s explore that and other risks in the next post.

September 28, 2025