All posts

9 min read

One Record for Thousands of Files: Merkle Batching

A Merkle root lets one Label 309 record commit to thousands or millions of file hashes, then prove any single item later with a small inclusion proof — without putting every item on chain.

One Label 309 record can prove that thousands or millions of files existed at a point in time.

Instead of publishing one blockchain transaction per file, you hash every file, fold those hashes into a Merkle tree, and publish a single 32-byte Merkle root. Later, you can prove that any one file was part of that batch with a small inclusion proof — without revealing the rest, and without putting every item on chain.

That is how Proof of Existence scales from one document to an arbitrarily large set.

What problem does Merkle batching solve?

Blockchains are poor places to store long lists. Every byte costs a fee, and a Cardano transaction has a hard size ceiling.

If a system generates a steady stream of artifacts — logs, model outputs, release builds, invoices, reports, evidence entries — publishing each hash as its own on-chain record gets expensive and noisy fast.

Merkle batching compresses an ordered list of hashes into one root commitment. The root is fixed-size (32 bytes), the batch can be unbounded, and the proof for any single item stays small — it grows only with the logarithm of the batch size. That makes regular commitments practical for high-volume workflows.

What is a Merkle root?

A Merkle root is a single hash that commits to a whole ordered list.

Start with a list of leaves. In a Proof of Existence workflow, each leaf is typically the hash of a file, an event, or a manifest entry. The leaves are combined pairwise up a binary tree, and the hash at the top is the Merkle root.

The commitment is exact in three ways:

  • If any leaf changes, the root changes.
  • If the order of the leaves changes, the root changes.
  • The commitment also records how many leaves there are, so a list of a different size cannot pass as the same batch.

Publishing the root is like publishing a fingerprint for the entire ordered list.

What actually goes on chain?

Only the root. The full list of leaves stays off chain.

A Label 309 record carries the commitment in a dedicated merkle field, separate from ordinary per-file hashes. Each commitment is a small, fixed structure:

  • the commitment algorithm (Label 309 v1 registers rfc9162-sha256: the RFC 9162 Merkle Tree Hash with SHA-256);
  • the 32-byte root;
  • the leaf count, which binds the root to the size of the off-chain list;
  • optional content-addressed URIs (ar:// or ipfs://) pointing at the leaves-list file.

The on-chain record stays tiny — a single root commits to an unbounded list at a fixed cost of 32 bytes. The detail lives off chain, in the ordered leaves-list. (For where the rest of a record lives, see what goes on the blockchain.)

How do you prove one file later?

You produce an inclusion proof.

An inclusion proof is the short list of sibling hashes along the path from one leaf up to the root. A verifier folds the leaf and those siblings back up the tree and accepts the proof only if the recomputed root exactly equals the published root.

In practice the check has four inputs:

  1. the hash of the file or item you are proving;
  2. the inclusion proof (the sibling path);
  3. the Merkle root in the Label 309 record;
  4. the Cardano block time of the transaction that carried it.

If the fold reproduces the published root, the item was in the committed list — and the block time witnesses when that list existed. The verifier needs the one item and its proof; it never needs the other files in the batch.

Two details worth keeping straight. The construction is order-sensitive, so the leaves must be kept in the same sequence they were published in. And a single-leaf "tree" is not a useful timestamp: the root of a one-leaf tree is not the leaf itself, so to prove a single file you publish a plain content hash, not a one-item Merkle commitment.

Why is building and verifying fully offline a strength?

Because the only step that touches a server is publishing the root.

Building the tree, deriving inclusion proofs, and verifying them are pure computation. Anyone who holds the ordered leaves-list can recompute the root and re-derive any proof — no account, no gateway, no network, and no cooperation from whoever originally published. The publisher is never in the loop at verification time.

That matters for evidence that has to outlive tools and vendors. A proof you can check offline against a public Cardano explorer keeps working long after any particular service is gone. You can verify a batch commitment the same way you'd verify any Label 309 record, and you can wire the inclusion check straight into CI with the open-source cardanowall command-line tool (merkle-build to fold a folder into a root, merkle-verify to confirm an item belongs to it).

Why is this useful for high-volume workflows?

Because many real proofs are batch-shaped, not single-file-shaped. A team may need to show:

  • what a CI/CD pipeline built and which artifacts shipped in a release;
  • which software bill of materials (SBOM) existed before a vulnerability was disclosed;
  • which AI outputs were produced on a given day;
  • which dataset snapshot existed before a training run;
  • which compliance logs existed before an audit;
  • which legal evidence files were preserved before a hold;
  • which reserves a balance sheet committed to at a point in time.

None of these fit a one-file-one-transaction model well. Merkle batching lets you publish a single commitment per batch — without exposing every private item, and without on-chain metadata that grows linearly with batch size.

Can the leaf list stay private?

Yes. The published root reveals nothing about the leaves it commits to.

You can keep the leaves-list inside your own evidence system, release archive, data room, or compliance store, and later reveal only what you need:

  • one file and its inclusion proof;
  • one dataset row, document, or release artifact;
  • one audit-log entry;
  • a subset of the list;
  • or the whole leaves-list.

This is the pattern when you want a public, timestamped commitment without making the underlying data public — closely related to confidential disclosure without public files and proof of reserves with Merkle roots.

The tradeoff is responsibility. A root proves that some list of a known size existed at a known time; it does not, by itself, let anyone prove which specific items were in it. If you keep the leaves-list private, you must preserve it. Lose both the leaves-list and any saved inclusion proofs, and you keep the timestamp but lose the ability to prove a particular item was committed.

What should a leaf be?

A leaf should represent exactly the thing you may need to prove later.

For files, the leaf is the hash of the file bytes. For structured data, it is usually the hash of a canonical manifest entry. For CI/CD, a leaf might be an artifact digest, an SBOM digest, a build-log digest, a commit reference, or a signed release-manifest entry. For AI provenance, a leaf might be an output-file hash, a prompt/output manifest entry, a dataset-item commitment, or a content-provenance manifest hash.

The discipline that matters is consistency. If leaves are generated different ways across runs, later inclusion proofs become hard to trust. Fix the leaf definition and the canonicalization once, and apply it the same way every time.

Should you publish the leaves-list, or keep it private?

It depends on what you are proving and who should see it.

Publishing the leaves-list makes third-party verification trivial: anyone can fetch the list, recompute the root, and inspect the committed set. Keeping it private gives you confidentiality and selective disclosure — you reveal inclusion proofs only when needed. Many workflows do both: a public leaves-list for open-source releases, a private one for internal compliance logs, a sealed one for sensitive evidence.

The root is the public commitment. The leaves-list policy is a separate choice layered on top.

How often should you publish roots?

Match the cadence to the workflow.

A CI/CD system might publish one root per release, build, or deployment window. A compliance system might publish one root per hour, per day, or per control period. An AI platform might publish roots per batch of generated content, per training snapshot, or per model-version event. A legal-evidence system might publish a root per case bundle, intake window, or chain-of-custody milestone.

The right cadence balances cost, operational simplicity, and how precise a timeline you may later need to prove.

What does a Merkle root not prove?

A Merkle root proves commitment to a list at a point in time. It does not prove the business claims wrapped around that list. Like any Proof of Existence, it shows that specific bytes existed by a public time — not that they are true, lawful, authored by anyone in particular, or owned by anyone (see what a proof does not prove).

Concretely:

  • It does not prove a software build was secure. It can prove which artifacts or manifests were included in a release.
  • It does not prove a dataset was lawfully collected. It can prove a dataset commitment existed before a given time.
  • It does not prove a log entry is true. It can prove the entry was part of a committed batch.
  • It does not prove authorship — unless the record or the surrounding process adds signatures and identity controls.

In Label 309, authorship is optional and explicit: a record can carry detached signatures, but they are never required, and a Merkle commitment on its own makes no claim about who produced the list. The root gives the timestamped commitment; the process around it gives the business meaning.

How does Label 309 fit in?

Label 309 is the open, vendor-neutral standard for Proof of Existence on Cardano; it has been submitted to the Cardano CIP process and is under review by the CIP editors as a Metadata-category proposal. Merkle batching is not a separate product — it is the scale layer built into the standard.

A batch commitment uses the same record and the same verification path as a single-file proof. One record under metadata label 309 can carry a Merkle root, and alongside it the same optional pieces any record supports: ordinary per-file hashes, content-addressed storage URIs, a supersedence pointer to an earlier record, and authorship signatures. So a batch proof inherits everything a single proof has:

  • a Cardano block-time witness;
  • a standard, closed record structure;
  • standalone, offline verification against a public explorer;
  • the same open-source tooling, libraries, and CLI across languages.

Hash every item, build an ordered Merkle tree, publish one root in a Label 309 record, and keep the leaves-list and any proofs you may need. Later, prove that any single item was part of the committed batch — without putting every item on chain. That is what makes Proof of Existence practical for CI/CD, AI provenance, datasets, compliance, legal evidence, and other high-volume workflows.

Further reading

merkleproof-of-existencelabel-309