All posts

8 min read

How to Anchor an AI Dataset Manifest with Label 309

Hash a dataset manifest, batch it with a Merkle root, and anchor it on Cardano with Label 309 — so you can later prove what a dataset snapshot contained without publishing the dataset.

To prove later what a dataset snapshot contained, anchor its manifest: hash the manifest file, publish that hash on Cardano in a Label 309 record, and keep the dataset itself private. From then on, anyone with the transaction reference can confirm the manifest existed in that exact form on or before a public block time — without trusting your servers, and without seeing your data.

A dataset manifest is the stable inventory of a dataset snapshot: which files, records, URLs, licenses, hashes, sources, and processing steps were included at one point in time. Label 309 lets you hash or Merkle-anchor that manifest so the commitment is fixed in public, while the data stays where it is.

That matters for AI companies, research teams, legal and compliance teams, and anyone who may later have to explain where a model's training or evaluation data came from — long after the data lake has moved on.

What is an AI dataset manifest?

An AI dataset manifest is a structured inventory.

It does not have to contain the full training data. It can contain stable references and hashes for the data. The goal is to make a dataset snapshot auditable and reproducible enough that a future reviewer can understand what was included.

A manifest may describe:

  • files;
  • rows;
  • documents;
  • images;
  • audio clips;
  • videos;
  • web pages;
  • licenses;
  • source systems;
  • collection dates;
  • transformations;
  • filtering rules;
  • deduplication steps;
  • hash algorithms;
  • model-training split assignments;
  • internal dataset version ids.

Without a manifest, a dataset is often just a folder, bucket, table, or archive. That may work during experimentation. It is weak evidence later.

Why should AI teams timestamp manifests?

Because dataset history becomes hard to reconstruct.

AI teams continuously add, remove, clean, filter, deduplicate, label, redact, and re-split data. A dataset snapshot that trained a model in March may not exist in the same form in July.

The team may later need to answer:

  • what data did this model train on?
  • which evaluation set was used?
  • did this customer data exist in the dataset?
  • when did we remove restricted content?
  • which sources were included before a policy change?
  • did we possess this data before a dispute?
  • did this model use data covered by a particular license?

A timestamped manifest gives the answer a fixed point.

How does Label 309 fit?

Label 309 commits the manifest to public time. The simple version:

  1. create a deterministic manifest;
  2. hash the manifest file;
  3. publish that hash in a Label 309 record on Cardano;
  4. keep the manifest and source data private;
  5. verify later by recomputing the manifest hash and matching it to the record.

For large datasets, anchor a Merkle root instead of one flat hash. A Label 309 record can carry a Merkle commitment — an ordered list of 32-byte leaves bound to a single root plus a leaf count — so one root on the chain stands in for an arbitrarily large off-chain leaf list:

  1. hash each manifest entry into a leaf;
  2. order the leaves deterministically;
  3. build the Merkle tree;
  4. publish the root in the record;
  5. preserve the leaf list and the inclusion proofs.

The public record proves that a dataset commitment existed. The private manifest explains what was committed. This is the same batching pattern that lets one record stand in for thousands of files.

What should go into the manifest?

The manifest should be boring, deterministic, and useful.

Good fields include:

  • dataset id;
  • snapshot id;
  • creation time;
  • creator or pipeline id;
  • source system;
  • source URI or neutral source reference;
  • file or record id;
  • byte length;
  • content hash;
  • hash algorithm;
  • media type;
  • license or rights status;
  • consent or policy status, if applicable;
  • collection date;
  • transformation pipeline version;
  • deduplication group;
  • train/validation/test split;
  • exclusion reason for removed items;
  • Merkle leaf index.

Do not put sensitive personal data into a public manifest. If the manifest is sensitive, keep it private or seal it.

What makes a manifest deterministic?

Determinism means the same input produces the same manifest.

That requires clear rules:

  • normalize paths;
  • choose a stable character encoding;
  • define sort order;
  • define timestamp formats;
  • avoid local machine paths when possible;
  • record exact hash algorithms;
  • freeze transformation versions;
  • include schema version;
  • avoid fields that change every time the export runs.

If a manifest changes because the export tool adds a new random id or timestamp on every run, it is harder to verify.

The manifest should be designed for evidence, not only convenience.

How can a private dataset stay private?

Publish the commitment, not the dataset.

A Label 309 record contains a hash or a Merkle root. Neither reveals the data on its own — a hash is a one-way digest, and a root commits to a leaf structure without exposing the leaves. The company keeps the manifest, files, and access controls internally.

Later, you can disclose selectively against that fixed commitment:

  • one file and its Merkle inclusion proof;
  • one manifest row;
  • one subset or source category;
  • one training snapshot;
  • the whole manifest under NDA;
  • a sealed package addressed to counsel, an auditor, or a regulator.

This lets a team prove prior commitment without turning a private dataset into a public one — the same approach as confidential disclosure without public files. A sealed record encrypts the payload to specific recipient keys, but be clear about its limits: it keeps plaintext readable only to key holders, it does not guarantee anonymity, and a recipient can always leak what they decrypt.

How does this help with AI governance?

Governance needs records that survive audits.

AI governance teams increasingly need to show how datasets were sourced, filtered, documented, approved, and changed. A manifest is not the whole governance program, but it gives the program something concrete to verify.

For example:

  • model cards can reference dataset snapshot ids;
  • internal approval tickets can reference manifest hashes;
  • data-retention workflows can prove when restricted data was removed;
  • red-team evaluations can anchor evaluation sets;
  • compliance reviews can compare claimed datasets to committed manifests;
  • customer contracts can reference auditable dataset snapshots.

The proof layer makes the dataset record harder to rewrite silently.

How does this relate to AI disclosure rules?

Rules are moving toward better documentation. The EU AI Act includes obligations around general-purpose AI, and in 2025 the European Commission published an explanatory notice and template for a public summary of training content for such models. Other jurisdictions and platforms keep evolving their own transparency and provenance expectations.

Label 309 does not decide what you must disclose, and anchoring a manifest does not satisfy any specific regulation on its own — that depends on your jurisdiction and your obligations. What it can do is preserve tamper-evident evidence behind whatever you later need to disclose, summarize, defend, or audit.

The distinction matters: a proof can support a transparency claim, but it is not a legal determination, and it does not replace counsel.

How often should manifests be anchored?

Anchor at the rhythm of decision-making.

Common patterns include:

  • every training run;
  • every evaluation run;
  • every dataset release;
  • every policy-filter update;
  • every customer-specific dataset build;
  • every daily or weekly ingestion batch;
  • every major deduplication pass;
  • every red-team dataset snapshot.

High-volume teams should use Merkle batching. Important single releases may also deserve signed records and sealed archives.

What does this not prove?

A timestamp proves timing and integrity — not truth, ownership, or rights. Be honest about the boundary:

  • It does not prove the data was lawfully collected.
  • It does not prove copyright ownership or licensing.
  • It does not prove consent.
  • It does not prove a model actually trained on the dataset — unless your pipeline and logs connect the model run to that manifest.
  • It does not prove the manifest is complete if your team omitted entries.

What it does prove is narrow and durable: the committed manifest or Merkle root existed in exactly that form by a public block time, and nobody can backdate or silently edit it afterwards. That is powerful, but only when it is wired into your process. For the full picture of the boundary, see what a proof does not prove.

The short version

AI datasets need stable inventories.

A dataset manifest turns a moving data lake into a snapshot you can verify later. Label 309 anchors that snapshot with a hash or a Merkle root, optionally signs it with an Ed25519 record signature, and can seal a private package to named recipients. The public chain never needs the dataset — only the commitment.

Keep the manifest. Preserve the leaf list. Document the pipeline. Then, when the dataset is challenged, you reach for evidence instead of memory.

Label 309 is an open, vendor-neutral standard, currently submitted to the Cardano CIP process and under review by the CIP editors as a Metadata-category proposal. The reference implementation — gateway, SDKs, and the cardanowall CLI, whose merkle-build and merkle-verify commands handle the leaf lists and inclusion proofs above — is open source at github.com/cardanowall.

Further reading

aidatasetsmerkle