Skip to content

I audited NVIDIA's NemoClaw: It closed one security gap, but it opens another one

NVIDIA's NemoClaw agent sandbox adds kernel-level isolation and deny-by-default permissions — and a new gap underneath.

NVIDIA just dropped NemoClaw, a new open-source agent sandbox with kernel-level isolation and deny-by-default permissions. 141 points on Hacker News. The security architecture is solid.

I ran a linter on NemoClaw’s instruction file, the SKILL.md that tells the agent what to do. It is not safe.

Not because of the runtime. The runtime is fine. The instructions contain ambiguities that make the agent behave differently depending on the model, the context, and which directive it silently prioritizes over another. The sandbox locks the door. The instructions inside the room are ambiguous.

The problem no one is looking at

Every agent system depends on two things to behave correctly. The first is what the agent is allowed to do. The second is what the agent is told to do. The entire industry invests in the first and almost completely ignores the second.

The runtime layer is the part that gets the investment. What files can the agent access? What processes can it spawn? What network calls can it make? This is where the security audits happen, where the tooling exists, where the conferences have panels. NemoClaw handles this well.

The language layer is the part that gets ignored. This is the instruction file. The SKILL.md, the system prompt, the set of directives that tell the agent what to do and how to do it. The quality of that language determines whether the agent gets it right. And right now, across the industry, it is treated as a text file that someone writes once and never audits.

NemoClaw is not unique here. It is just the latest example. The runtime is locked down. The instructions are not.

What happens when you audit the instructions

I ran a static linter on NemoClaw’s SKILL.md. The file scored 78 out of 100 on the HERM scale, which measures how reliably an LLM can interpret a set of instructions. That is a reasonable score for a human-written file. It means the instructions read clearly to a person but contain patterns that produce variable behavior in models.

The file has 49 instructions. No priority ordering. 4 negative directives, meaning instructions that tell the model what not to do instead of what to do. 2 conditional negatives with vague qualifiers, like “do not reorganize unless the change requires it,” where the model has to decide what “requires” means.

By human standards, it is well written. By model standards, it is moderately ambiguous.

When two of those 49 instructions conflict, the model resolves the conflict silently. There is no error. There is no flag. The model just picks one and moves on. And when an instruction contains a vague qualifier, the model fills in its own judgment. The instruction author did not intend that. But they wrote it into the file.

The gap

NemoClaw prevents the agent from accessing files outside its sandbox. But it cannot prevent the agent from writing incorrect documentation inside the sandbox, because the instructions for “correct documentation” contain structural ambiguities.

The sandbox prevents unauthorized actions. It does not prevent authorized actions from being wrong.

This is the blind spot in AI agent safety right now. Teams are building increasingly sophisticated runtime controls around agents that are following ambiguous instructions. Nobody is auditing what the agent was told to do.

Three lines, same meaning:

I submitted a PR to NemoClaw with three changes. Each one converts a negative instruction to a positive equivalent. Same meaning, different structure.

Before: “Do not number section titles.”

After: “Use plain descriptive titles without numbering.”

Before: “No colons in titles.”

After: “Write titles without colons.”

Before: “Do not reorganize sections unless the change requires it.”

After: “Preserve existing section order unless the change requires restructuring.”

Positive framing gives the model a target state to maintain rather than a behavior to suppress. The pattern shows up consistently across models and contexts: negative instructions are followed less reliably than positive equivalents. Priority ordering reduces silent conflict resolution. Vague qualifiers transfer judgment to the model in ways the instruction author usually did not intend.

The PR is at github.com/NVIDIA/NemoClaw/pull/367.

What comes next

Every agentic AI system shipping today has an instruction file. Most of them have never been linted, scored, or audited for the patterns that produce unreliable model behavior. The runtime layer has tooling. The language layer has nothing.

That is starting to change. But right now, if you want to know whether your agent’s instructions are safe, you have to measure them. And almost nobody is measuring. But you can start doing this now.

The linter used for this audit is open source and runs locally with no LLM calls3

pip install lintlang

Or tell your agent:

Install lintlang from hermes-labs.ai/lintlang.md

Read the full Github for LintLang here: https://github.com/roli-lpci/lintlang

Rolando Bosch is the founder of Hermes Labs. He publishes research on structural failure modes in LLMs.

hermes-labs.ai · x.com/rolibosch · linkedin.com/in/rolando-bosch · github.com/roli-lpci