All posts

9 min read

How to Prove Training Data Without Revealing It

Commit a private dataset snapshot to a hash or Merkle root and publish one timestamped proof. The data stays private; later you can selectively prove a file, row, or version was in the committed set.

Yes, you can prove a private dataset existed without publishing the dataset.

The pattern is short: build a dataset manifest, hash its entries, fold those hashes into a single Merkle root, and publish one Label 309 Proof-of-Existence record on Cardano. The dataset itself never leaves your control. Later, you can reveal exactly one file, row, manifest entry, or inclusion proof to show it was part of the committed snapshot — and nothing more.

That proves prior possession at a point in time. It does not, on its own, prove ownership, copyright status, consent, or lawful use. Those are separate questions that need separate records.

Why would an AI team need this?

Training data has become a board-level question. A model provider may need to show what data it held, when it held it, where it came from, how it was processed, and which datasets fed a given model version — for investors, partners, customers, regulators, auditors, licensors, or litigation.

At the same time, the company often cannot publish the dataset. It may contain licensed content, customer data, personal data, proprietary sources, internal annotations, trade secrets, safety evaluations, retrieval corpora, synthetic data, or sensitive filtering rules.

Proof of Existence resolves that tension. It lets you commit to the dataset's state and timeline without disclosing the dataset publicly. You publish a fingerprint; the bytes stay home.

What should you commit to: raw data or a manifest?

Commit to a manifest, not raw bytes alone.

A dataset manifest describes the snapshot in a structured, machine-readable way. It can record:

  • dataset name and snapshot id;
  • collection window;
  • source categories and rights metadata;
  • per-file and per-row hashes;
  • deduplication and filtering versions;
  • annotation and preprocessing pipeline versions;
  • the model or training run that used it;
  • retention policy and internal ownership.

The manifest does not have to expose any sensitive detail publicly. It can live entirely inside the company. The public proof commits only to its hash, or to a Merkle root over many manifest entries. The goal is narrow and durable: freeze evidence of the dataset's state at a known time.

Why use a Merkle root instead of one record per file?

Datasets are large, and publishing one record per file or row does not scale. A Merkle root solves this: it commits to an ordered list of many hashes under a single 32-byte value, anchored in one transaction.

Later, to prove a single item was included, you reveal only:

  • the item or its hash;
  • the relevant manifest entry;
  • a Merkle inclusion proof;
  • the Label 309 transaction reference.

A verifier recomputes the path from that leaf up to the root and confirms the root was published at a specific Cardano block time. The proof grows with the logarithm of the batch size, so it stays small even for millions of leaves. Crucially, building the tree and checking proofs is pure offline computation — no server, no account, no cooperation from you is required at verification time.

This is what makes selective disclosure possible. You never have to reveal the whole dataset to prove one item belonged to the committed snapshot.

What does the public actually see?

Only the proof record on chain. Depending on how you publish, that can include a manifest hash, a Merkle root, a leaf count, the transaction time, an optional signature from your company or system, and optional content-addressed URIs (ar://, ipfs://) for public or encrypted supporting material.

The public does not see the dataset files, the full leaf list, source metadata, customer data, licensing details, annotations, or internal notes. Those stay inside your evidence system until a specific question forces disclosure.

What would you reveal later, and when?

Reveal only what the question requires.

  • Was one file in the dataset? Reveal the file or its hash, the manifest entry, and an inclusion proof.
  • Was a source category included? Reveal the relevant manifest section and the proof that it belongs to the committed snapshot.
  • Did a model version use a particular snapshot? Reveal the training-run manifest that links the model version to the dataset root.
  • Is this a full audit? Reveal the whole manifest and leaf list under the appropriate confidentiality process.

The on-chain root proves the timeline. Your internal archive determines how much detail you can show, and to whom. For cases where the supporting material itself must move to a third party but stay private, you can share it confidentially rather than make it public.

How does this relate to AI regulation?

AI regulation is moving toward stronger documentation and transparency duties. The EU AI Act, for example, sets out transparency and copyright-related rules for general-purpose AI models, and the European Commission has published a template for the public summary of training content — described, in the Commission's own words, as a minimal baseline for the information to be made publicly available.

A private dataset proof is not the same thing as that public summary. It does not replace regulatory reporting, legal review, consent management, or licensing records, and whether any of this helps in a given case depends on your jurisdiction and your counsel.

What it can support is the evidence layer behind those processes. If a company later needs to show what it had, what it knew, or which snapshot a published summary was based on, a timestamped manifest commitment is concrete, third-party-anchored evidence of timing and integrity.

What does a dataset proof actually prove?

It proves that a specific dataset commitment existed by a public block time. Depending on the evidence you preserve, that can help show:

  • a file was in a dataset snapshot;
  • a manifest existed before a dispute;
  • a dataset version existed before a model release;
  • a training run referenced a particular snapshot;
  • a source category was documented at the time;
  • a preprocessing or filtering pipeline was recorded.

If the record is signed — Label 309 supports optional record-level signatures — it can also show that a company key or system key vouched for the commitment. Signing is never required, so an unsigned commitment is equally valid; the signature just adds attributable authorship.

What does it not prove?

This is the part to be honest about, because the gaps matter.

A dataset proof does not prove the data was lawful to use. It does not prove you owned the data, that it was collected with consent, or what its copyright status is. It does not prove the data was actually used for training — unless your training pipeline and model records are themselves tied to the dataset snapshot. And it does not prove the manifest is complete; only your process and controls can make completeness credible.

Proof of Existence is timeline-and-integrity evidence. It establishes that exact bytes existed by a public time. It says nothing about truth, ownership, rights, or compliance — those require additional records and legal analysis. If you want the full picture of where the line sits, see what a proof does and doesn't prove.

How should you design the workflow?

Design for the question you expect to answer later, not just for hashing today.

A workable shape:

  1. Define a canonical dataset manifest format.
  2. Hash every dataset item or manifest entry.
  3. Build a Merkle root for the snapshot.
  4. Publish a Label 309 record, signed if you want attributable authorship.
  5. Store the manifest, leaf list, and inclusion-proof material.
  6. Link model training runs back to dataset roots.
  7. Seal sensitive evidence packages for legal or compliance recipients.
  8. Record superseding snapshots when the dataset changes.

The hard part is rarely the cryptography. The hard part is deciding which evidence will be meaningful when someone asks for it months or years from now.

How often should you commit a snapshot?

Commit whenever the dataset meaningfully changes — typically after a new ingestion, before a training run, after deduplication or filtering, after labeling, before a model release, at a governance checkpoint, or before sharing the dataset with a partner.

The cadence should match the questions you expect to answer. Commit once a year and you may not be able to prove which intermediate snapshot existed. Commit on every trivial change and you generate operational noise. Because Merkle batching lets one root stand in for an entire snapshot — one transaction, no matter how many files it covers — the cost stays roughly flat per commit, so you can choose a cadence that fits the evidence you need rather than one dictated by price.

How does sealed storage fit in?

Sometimes hashing is not enough — you want to preserve the evidence itself, not just a fingerprint of it.

A sealed PoE lets you do that. The public record still commits to the plaintext hash, exactly as a normal proof would. The sensitive payload is encrypted and stored at a content-addressed URI, with the content-encryption key wrapped to one or more recipient keys. Authorized recipients can decrypt it later and confirm that the recovered plaintext matches the on-chain commitment by recomputing the hash.

The chain never carries the plaintext and never reveals who the recipients are; it shows only that a sealed commitment was made at time T. This matters when losing the original manifest would weaken your proof. A hash-only record proves existence as long as you still hold the file. A sealed record can preserve the encrypted file itself, so the evidence and the commitment travel together.

One limitation worth stating plainly: sealing keeps the content private from everyone except the chosen key holders, but it does not make anyone anonymous, and a recipient can always leak the plaintext after decrypting it. Sealing controls who can read, not what they do next.

Who should own the process?

A dataset-proof process should not be an unowned engineering script. It touches legal, security, data governance, compliance, and model development, and a good process makes the boundaries explicit: who can create snapshots, who can sign commitments, where manifests are stored, who can decrypt sealed packages, how inclusion proofs are generated, how model runs link to roots, how superseded snapshots are handled, and how evidence is produced during an audit or dispute.

The proof is cryptographic. The governance is organizational. You need both.

The short version

To prove training data without revealing it, commit to the snapshot, not the dataset. Build a manifest, hash its entries, publish a Merkle root in a Label 309 record, and keep the leaf list and inclusion proofs. Seal sensitive supporting files when losing them would weaken the proof. Then reveal only the evidence each question actually requires.

That gives you durable, third-party-anchored proof of prior possession and timing. It does not, by itself, prove ownership, lawful use, or compliance — and it is most useful when you are clear about exactly which of those it is, and is not, doing.

Further reading

aidatasetsproof-of-existence