Industry

Microsoft AGT and the regulated-fintech gap

On April 2, 2026, Microsoft published an open-source Agent Governance Toolkit under the MIT License. Here's what it covers, what it intentionally doesn't, and where regulated teams need to keep going.

Hugo Hernandez · Co-founder & CEO

May 7, 2026 · 11 min read

The category just got real

For most of 2025, "agent governance" was a thesis. A few specialist vendors had it on a slide deck; a few engineering teams had built ad-hoc versions of it inside their stacks; the rest of the industry was busy shipping agents and hoping nothing would happen.

On April 2, 2026, Microsoft published the Agent Governance Toolkit ↗ on GitHub under the MIT License. The post is written by Imran Siddique on the AI Native Team, who also published a companion piece ↗ walking through how he and his team used the toolkit to govern eleven autonomous agents in production over an eleven-day window. Their daemon caught and blocked 473 unauthorized actions in that period (including destructive shell patterns, SQL injection attempts, and tool-call limit violations), every single one logged deterministically in under eight milliseconds.

Two things matter about this release. The first is the substance: the toolkit is genuinely good, ships under one of the most permissive open-source licenses in software, and is backed by code Microsoft is running on its own production agents. The second is the signal. When Microsoft endorses a problem space with permissive open-source code, the question stops being "is this a real category?" and starts being "which version do I deploy?"

This post is for engineering and compliance leads at regulated fintechs and HR-tech companies who are now asking exactly that.

What the AGT actually does

Three design choices in the AGT are worth understanding before you decide whether it covers your use case.

The first is deterministic enforcement, not prompt-based safety. As Imran Siddique writes in the companion piece: *"Safety decisions must be deterministic, not prompt-based. You are using an LLM to decide whether an LLM should be allowed to do something."* The AGT does not call out to an LLM to ask whether an action is permitted. It evaluates a YAML-defined policy against the action and returns ALLOW or DENY in milliseconds. This is a structurally important choice, anyone evaluating an agent governance product should refuse to consider one that uses an LLM as the policy judge.

The second is OWASP Agentic Top 10 coverage by default. The toolkit ships with deterministic guards for prompt injection, tool abuse, data exfiltration, memory poisoning, denial of wallet, and the rest of the OWASP list. These run before any custom policy logic, they are floor protections, not opt-in.

The third is structured audit logging. Every decision is written to an immutable log with full context. The team that uses the AGT can answer the question "what did our agents do, when, and why?" in production at any time.

The "engineering impact" section of Imran's companion post lays out what this gets you: *we ship faster, we sleep better, compliance by default*. Those three outcomes are the right ones to chase. They are also the right outcomes for a regulated fintech to chase. The question is whether the AGT alone gets you there.

Where the AGT stops

The AGT is not a product designed for regulated production at a UK fintech, a German neobank, or a Spanish insurer. It is a toolkit Microsoft built for Microsoft's use case (internally orchestrated developer agents) and then open-sourced because the underlying problem is universal. That generosity does not extend to the specific demands of regulated environments. There are five gaps worth naming.

1. ALLOW and DENY, but no third verdict. When an action is borderline (a refund that crosses a threshold, a credit decision that lands close to a regulatory line, a write to production that should probably be sanity-checked) the AGT either lets the agent proceed or stops it cold. There is no native concept of "route this to a human, wait for approval, then log who approved it." For an agent operating under EU AI Act Article 14 (human oversight), this is the exact piece you cannot do without.

2. No intent anchoring. The AGT enforces what an action is. It does not enforce that the action matches what the agent declared it would do. If an agent says it will read a customer record and instead deletes one (or if an attacker has manipulated the agent mid-session to do so) the AGT can catch the deletion if a deletion policy exists, but it has no concept of "the agent's stated intent does not match its actual behaviour." For multi-step agentic tasks where the most sophisticated attacks happen between declaration and execution, this gap matters.

3. No vertical policy packs. The AGT gives you the engine. You write the policies. For a fintech building a refund agent, that means engineering and compliance and legal sit down and define what counts as a high-risk refund, what the human-review thresholds are, what the audit fields need to capture, what counts as a Consumer Duty violation. That work is not trivial. Doing it correctly for a Fintech Payment Agent or a KYC Decisioning Agent or a Credit Assessment Agent takes weeks of expert time, and most teams underestimate how much of it will be wrong on the first attempt.

4. The audit log exists, but is not formatted for EU AI Act Article 12 export or DORA ICT incident reporting. The AGT logs decisions immutably and queryably. It does not produce, on demand, a structured evidence pack in the format an EU AI Act conformity assessment requires, or the format the FCA expects under Consumer Duty, or the format DORA Article 17 requires for ICT-related incidents. This is the difference between "we have logs" and "we can hand the auditor what they asked for in the format they asked for it." For a fintech with an FCA review on the calendar, that gap is not theoretical.

5. No A2A (agent-to-agent) chain governance. When one agent calls another, and that one calls a third, who is responsible for the action at the end of the chain? The AGT operates per-agent. In an architecture where agents delegate to each other (which is increasingly how production multi-agent systems are built) the chain of decisions and accountability is something you have to construct yourself.

None of this is a criticism of the AGT. It is what happens when a great toolkit is built for one set of needs and then thoughtfully open-sourced. Microsoft published the floor. Above the floor, the work is yours.

Microsoft published the floor. Above the floor, the work is yours.

What a regulated fintech actually needs

Walk through a concrete scenario. You are a UK fintech with a refund agent in production. The agent reads a customer ticket, classifies the issue, decides whether the refund is warranted, and, if so, issues it. Your compliance officer has three demands.

First, any refund above £5,000 must be reviewed by a human before execution, with the reviewer's identity and timestamp logged. EU AI Act Article 14, FCA Consumer Duty, your own internal risk policy.

Second, every refund decision (approved, denied, escalated) must be logged in a format that can be exported as an evidence pack when the FCA or your internal auditor asks. Plain language for non-technical reviewers. Structured fields for machine processing. Immutable. Queryable.

Third, when a refund is denied, the reason must be defensible, not "the agent said no" but a concrete reference to the policy clause, the specific input that triggered it, and the regulatory provision the policy maps to.

The AGT alone does not deliver this. It can deny refunds above £5,000, but it cannot route them to a human and resume execution after approval. It can log decisions, but the log format is not Article 12-shaped. It can apply policy, but the mapping from policy clause to regulatory provision is something you build yourself.

You can build all of this on top of the AGT. You'll need a human-approval queue with Slack or webhook routing, an export layer that reformats the audit log into regulator-compliant evidence packs, a vertical policy library validated by a UK fintech compliance lawyer, and probably an agent-chain context tracker. Reasonable estimate: a senior engineer for two quarters, plus compliance review time. Total cost loaded: low six figures. Total elapsed time before your refund agent is production-ready under FCA scrutiny: four to six months.

The other path is a tool built for exactly this gap.

Where Kernel fits

Kernel sits above the AGT floor. We use the same deterministic enforcement model, cover the same OWASP Top 10, log decisions to the same immutable structure. What we add is what regulated teams cannot ship without.

REQUIRE_HUMAN as a first-class verdict. When a policy returns REQUIRE_HUMAN, the action is held, a Slack message lands in the reviewer channel with full context, the reviewer approves or rejects, and the agent resumes, with the reviewer's identity and timestamp captured in the audit log. EU AI Act Article 14 satisfied as a side-effect of normal operation, not as a custom integration.

Intent anchoring before execution. Every consequential action your agent takes is bound to a declared intent. If the agent says it will refund a customer and tries to delete an account, the mismatch is caught at the policy layer. The most sophisticated agentic attacks (agents manipulated mid-session to do something other than what they announced) are stopped before execution.

Vertical [policy packs](/policy-engine). Kernel ships with policy libraries pre-built for fintech (Payment Agent, KYC Decisioning Agent, Credit Decisioning Agent), for HR-tech (Hiring Agent, Compensation Agent), and for the cross-cutting developer-agent surface (D1 prod-deploy, D2 data-access, D3 dependency-merge). The legal review and threshold-setting work that takes most teams a quarter is the work we did before you arrived.

Audit log exportable in EU AI Act Article 12 format. And in the formats DORA, FCA Consumer Duty, and AIUC-1 expect. The same log, multiple lenses, on demand.

A2A chain governance. When agents call agents, the chain context is captured natively. The session, the parent decision, the policy version, the regulatory mapping, all of it traverses the chain.

We are not trying to replace the AGT. We are trying to be what you reach for when the AGT's design choices don't match your regulatory surface. For most teams shipping internal-tooling agents, the AGT is the right answer and we will tell you so. For teams shipping customer-facing agents in regulated verticals (credit, payments, KYC, hiring, healthcare, insurance) Kernel is the implementation that gets you to production without four months of compliance engineering.

What we recommend if you're evaluating

Three questions, in this order.

Are you regulated? If the agent you are deploying does not touch a regulated decision (credit, payments, KYC, hiring, healthcare, insurance, anything Annex III of the EU AI Act lists as high-risk), the AGT is probably enough. Use it. We mean this.

Do you need REQUIRE_HUMAN, vertical policy packs, or regulator-formatted audit export? If yes for any of the three, you have a build-or-buy decision. Build is two engineering quarters plus compliance review. Buy is a paid pilot with us, four to six weeks, one workflow, a draft policy your team co-owns by the end.

Are you facing the August 2, 2026 EU AI Act Annex III deadline? If yes, build is too slow. Six months from May means your agent needs to be production-ready by mid-September 2026 to leave any margin for audit. The path that gets you there is not one that starts with "we'll write our own intent layer."

If any of this resonates, book a 30-minute call. Bring the action you are least comfortable letting your agent take in production. We'll show you exactly how Kernel decides on it, and whether the AGT alone would have been enough.