9 min read
Proving AI Content Provenance at Scale with Merkle Batching
Hash each AI output, prompt, manifest, or Content Credentials record, batch the hashes into Merkle roots, and publish timestamped Label 309 commitments — proving any single item existed without putting every asset or private prompt on chain.

If your team generates AI content at scale, you can prove what you made and when you made it without putting every asset on chain. Hash each output or provenance manifest, batch those hashes into Merkle roots, and publish timestamped Label 309 commitments on a regular cadence. Later you can prove that a specific image, video, text file, prompt-and-output manifest, or Content Credentials manifest was part of a committed batch — using only the transaction reference and a public Cardano explorer.
What this gives you is a Proof of Existence: evidence that exact bytes existed by a public time. It does not prove the content is true, lawful, or human-made. It proves a timestamped commitment to specific bytes, anchored outside your own editable systems.
Why does AI content need a separate proof layer?
AI content is easy to create, edit, remix, and regenerate — and that is exactly the problem.
If a company produces thousands of AI-generated assets, how does it later prove which outputs it made, when it made them, which prompt or model context was recorded, and which version was shown to a customer or published online?
An internal database log often is not enough on its own. Logs can be rewritten. Storage gets migrated. Assets can be regenerated byte-for-byte. Metadata gets stripped in transit. A customer, auditor, regulator, partner, or court may ask for evidence that existed outside the company's own editable systems — and at a verifiable time.
Proof of Existence gives those records an external timestamp that does not depend on trusting the company, its servers, or its domain.
What should an AI team hash?
Hash the evidence you might need to produce later.
For AI-generated content, that often includes:
- the generated output file;
- the prompt and the system prompt or policy profile;
- the model name and version;
- seed or generation parameters, where relevant;
- edit history;
- the moderation result;
- the user or request identifier;
- the output manifest;
- the Content Credentials (C2PA) manifest;
- dataset or retrieval-context references;
- the approval or publication event;
- the customer delivery package.
Not all of this belongs in public. Sensitive details can stay in a private manifest that you hash and commit through a Merkle root. Later you reveal only the subset needed for a specific dispute, audit, or customer verification — the rest stays private while still being provably committed.
Why batch with a Merkle root instead of one record per output?
A platform may produce thousands or millions of outputs. Publishing a separate on-chain record for each one would be slow and wasteful. A Merkle root lets you commit many hashes in a single record.
The workflow looks like this:
- Generate or receive the outputs.
- Build a canonical manifest for each output.
- Hash the asset and its manifest into a leaf.
- Add the leaf to an ordered list.
- Publish a Merkle root every hour, day, release, or batch.
- Keep the leaf list and the inclusion proofs.
Later, you can prove that one output or manifest was included in a specific batch without publishing the entire batch on chain. Building the tree and verifying an inclusion proof are fully offline operations — only publishing the root touches a gateway. With the open-source tooling, an inclusion proof grows with the logarithm of the batch size, so a proof for one item out of a million leaves stays small. The detailed mechanics live in one record for thousands of files.
How does this work alongside C2PA and Content Credentials?
C2PA and Label 309 solve different problems, and they compose well.
C2PA — the Coalition for Content Provenance and Authenticity, whose user-facing form is Content Credentials — is a structured provenance layer. A C2PA manifest can carry assertions, claims, signatures, and bindings that describe the origin and edit history of a media asset.
Label 309 anchors a hash — of that manifest, or of the asset plus the manifest — to an independent Cardano timestamp. So:
- C2PA describes provenance inside or alongside the media asset.
- Label 309 proves that a particular manifest or asset commitment existed by a public time, with no issuer server to trust or outlive.
C2PA gives the content a provenance vocabulary; Label 309 gives the evidence a public time anchor. For a closer comparison of the two, see Proof of Existence vs C2PA and why C2PA benefits from a time anchor.
Why not rely on embedded metadata alone?
Embedded metadata can be stripped, lost, or transformed in transit. Most social re-encodes drop a C2PA manifest entirely.
That does not make embedded provenance useless. Content Credentials are valuable precisely because they travel with the content and let consumers inspect its origin. But an external, timestamped commitment helps when the metadata is removed, disputed, or separated from the asset.
In practice, a team keeps:
- the original generated asset;
- the C2PA manifest;
- the output manifest;
- the Label 309 transaction reference;
- the Merkle inclusion proof.
If a copy later circulates without its metadata, you can still connect the original asset or manifest back to the public commitment by recomputing the hash.
What about AI transparency rules?
Regulatory pressure on AI provenance is rising. The European Commission's AI Act overview states that providers of generative AI must ensure AI-generated content is identifiable, and that the AI Act's transparency rules come into effect in August 2026.
This is not legal advice, and requirements vary by jurisdiction and use case. But the direction is clear: companies producing AI content need stronger evidence practices.
Proof of Existence is not a compliance program by itself. It is an evidence layer that can support compliance work by making records harder to silently rewrite after the fact. Whether it helps in any specific regulatory context depends on the rule and your jurisdiction, and it does not replace counsel.
What can a Label 309 proof actually prove here?
It can prove that exact data existed by a public time. For AI content, that data might be an output file, a prompt-and-output manifest, a C2PA manifest, a batch root over many generated assets, a moderation report, an approval record, or a publication manifest.
Three optional features extend what a single record can carry:
- Signed records. If the record carries an optional signature, it also shows that a specific key vouched for the record. Authorship is always optional in Label 309 — it is never required to publish.
- Sealed records. Sensitive files can be encrypted and preserved without being made public, with the content-encryption key wrapped to one or more recipient keys.
- Merkle batching. One root can cover very large volumes of output.
What does it not prove?
A timestamped commitment is narrow on purpose. It does not prove the content is truthful. It does not prove the output came from a specific model unless the model context is recorded and trusted as part of your workflow. It does not prove the content was lawfully generated, lawfully trained, or lawfully published. It does not prove that a C2PA manifest is trustworthy unless C2PA validation and the signer's trust model also check out. And it does not prove your internal pipeline was honest unless that pipeline is itself controlled, logged, and auditable.
The proof is a timestamped commitment to specific bytes. The surrounding provenance system is what gives the commitment meaning. For more on this boundary, see what a proof does not prove.
How should teams structure the manifest?
Keep it boring, canonical, and stable. An AI output manifest might include:
- the asset hash and asset type;
- the system's creation timestamp;
- the model identifier and version;
- generation parameters;
- a prompt hash or an encrypted prompt reference;
- the user or workflow identifier;
- the moderation decision;
- the C2PA manifest hash;
- publication status;
- the batch identifier;
- an internal approval reference.
Sensitive values do not need to be public. The manifest can be private, sealed, or selectively disclosed later; the public proof commits to the manifest hash, or to a Merkle root over many manifest hashes. The key is consistency: if every team invents a new manifest shape every week, future verification becomes painful.
Should prompts be public?
Usually not. Prompts can contain customer data, trade secrets, personal data, safety-testing material, or internal policy details. You can hash prompts or prompt manifests without ever publishing the prompt text.
For sensitive workflows, a sealed record can preserve an encrypted prompt-and-output package. A later verifier holding the right key can decrypt the package, recompute the hash, and confirm it matches the public commitment. This gives you evidence without making the evidence public on day one. Note the limitation: once a recipient decrypts a sealed package, they hold the plaintext and can share it — sealing controls who can open the record, not what they do afterward. The pattern is covered in confidential disclosure without public files.
What is a good first implementation?
Start with batch commitments. For each day or release:
- Collect the generated outputs that matter.
- Build a manifest per output.
- Include C2PA manifest hashes where available.
- Hash each manifest into a leaf.
- Build a Merkle root.
- Publish a signed Label 309 record.
- Store the leaf list, inclusion proofs, and transaction reference.
Then layer on sealed preservation for sensitive packages and customer-facing verification for public assets. The goal is not to build the perfect provenance universe on day one — it is to stop losing the timeline. The same batching pattern shows up in CI/CD build proofs and AI dataset manifests.
Who needs this?
This pattern fits any team that produces content at scale and may later need to prove what it generated and when:
- AI media companies and generative design tools;
- AI video and image platforms;
- marketing automation platforms;
- enterprise AI teams;
- synthetic-data companies and model-evaluation teams;
- publishers using AI-assisted workflows;
- companies preparing for AI provenance audits.
The short version
AI provenance at scale needs batching. Hash your outputs and manifests, fold the hashes into Merkle roots, and publish Label 309 records on a cadence. Keep the leaf lists and inclusion proofs. Use C2PA and Content Credentials for media provenance where it fits, and use Label 309 as the public time anchor underneath.
The proof does not establish truth or legality. It establishes the timeline of exact bytes — and that is often the piece you can no longer reconstruct after the fact.
Further reading
- Anchor thousands of files under one root
- Proof of Existence vs C2PA and why C2PA needs a time anchor
- AI dataset manifests and prove training data without revealing it
- What a proof does not prove
- C2PA / Content Credentials: c2pa.org, the C2PA technical specification, and contentcredentials.org
- European Commission, Regulatory framework for AI
- The open standard at label309.org and the open-source SDKs and CLI at github.com/cardanowall