Patents Wiki
Back to briefings
PaperBlockchain

Inside Arweave: The Storage-Centric Blockchain That Wants to Outlive the Web

Most blockchains treat storage as a side effect. You submit a transaction, every full node copies your data onto a spinning disk, and that disk eventually fills up. The node operator prunes the old data, and the historical record quietly thins. The chain keeps producing blocks, but the content those blocks reference starts to decay.

A 67-page technical specification published in late 2019 set out to invert that assumption. The Arweave team argued that storage is not a cost, it is the mining resource. If you want to propose a new block, you have to prove you actually hold a piece of history, and the protocol pays you, forever, for keeping it.

This briefing walks through the Arweave Yellow Paper, the canonical document behind that idea. It introduces a graph-structured ledger called the blockweave, a consensus mechanism called Proof of Access, a propagation trick called blockshadows, and an economic model that pretends to solve a problem every other permanent-storage project dances around: how do you charge someone once for storage that has to last a hundred years? The paper's answer is a storage endowment funded by transaction fees, a perpetuity drawn from the assumption that storage keeps getting cheaper at roughly 30% a year. The mechanics underneath are the interesting part. Whether that 30% assumption holds for a century is the part the paper cheerfully admits is unresolved.

The paper at a glance

  • Title: Arweave: A Protocol for Economically Sustainable Information Permanence
  • Authors: Sam Williams, Viktor Diordiiev, Lev Berman, India Raybould, Ivan Uemlianin
  • Document version: DRAFT-1, compiled 5 November 2019, 67 pages
  • Where it lives: Hosted at arweave.org/yellow-paper.pdf, indexed on Semantic Scholar as the Arweave protocol specification
  • What it is: A protocol specification, not a peer-reviewed paper. The team's answer to the question, "How do you build a public ledger whose primary job is to hold data indefinitely?"
  • Companion short paper: Arweave: The Permanent Information Storage Protocol (the "lightpaper") by Williams and Berman, an elevator-pitch version of the same protocol

The problem with pretending the blockchain remembers everything

The web has a memory problem that is easy to miss because it happens so slowly. Pages that existed yesterday return 404 today. Newspapers quietly rewrite the lede. A government takes down a database and the records that used to live there go with it. Even Wikipedia, which is unusually well-defended, has deletion debates that resolve by editing history rather than preserving it. Every time a page is overwritten, the previous version becomes harder to retrieve, and after enough overwrites it is gone in any meaningful sense.

Blockchains looked, for a moment, like a possible fix. Bitcoin's white paper talks about a chain of "history" that, by construction, no one can alter. The word immutable does a lot of work in the marketing. In practice, blockchains are not very good at storing data. Every full node has to keep a copy of the whole ledger, and the ledger is expensive.

On Ethereum, putting a kilobyte of arbitrary data on-chain costs more than a coffee, and the data is only kept as long as the network keeps growing. Ethereum's state has been growing for years without a clean expiry path, and the difficulty-bomb ("Ice Age") mechanism that was supposed to push the community toward proof of stake kept being pushed back. That plan kept getting deferred, but the underlying pressure on storage never went away.

So we have a strange situation. The technology most loudly advertised as permanent is, technically, the least permanent place to put anything more than a few bytes. Arweave's response is to stop pretending the chain should hold everything, and instead build a separate ledger whose entire purpose is to hold everything. The "blockchain trilemma" tells you cannot be decentralised, secure, and scalable all at once. Arweave's answer is to be decentralised and secure, and to trade the last axis for a different one: permanence.

A weave, not a chain

The mechanism the paper proposes starts with a different data structure. In Bitcoin, every block points to exactly one previous block, and the result is a chain, a singly linked list. In Arweave, every block points to two previous blocks: the immediate predecessor (the "previous block", exactly as in Bitcoin) and a second block from somewhere deeper in history called the "recall block". The structure is a graph, not a list, and the team calls it the blockweave.

Which block becomes the recall block for a given new block is decided by a deterministic rule. You take a hash of the previous block and the previous block's height, and that gives you an index into history. The miner has no way to choose which block they have to fetch. They have to fetch it, and they have to prove they fetched it.

This sounds like a small change, but it is doing a lot of work. To be eligible to mine a new block, a node has to be holding, on disk, some piece of the network's history, and not just the last few thousand blocks. Which piece changes from block to block. A miner who is only storing the last week of activity will almost always fail to find the recall block for a future candidate and lose the race. A miner who is storing an unusual, rarely-replicated block gains a temporary advantage: if that rare block happens to be the recall block for the next candidate, that miner is competing against a smaller pool, and on average earns more per unit of storage. The protocol is paying miners to redundantly back up rare data, in proportion to how rare the data is.

This is the inversion promised at the top. Storage is the resource the system is paying for, and the rarer a piece of data is, the more profitable it is to keep. That single mechanism is what the rest of the design rests on.

How a new block is actually built

Suppose a miner wants to propose a new block Bₙ. The recipe, in roughly the order the paper describes it, runs as follows.

1. Pick a recall block. The rule above (hash of the previous block plus its height) tells you which older block to fetch. If you do not have it, you cannot mine. Period.

2. Assemble the block header. A block in Arweave carries a great deal of information in its header. Appendix 10.2.1 of the paper spells out the full list. The important pieces are the nonce (a number you vary), the previous block, the timestamp, the difficulty, the height, the dependent hash, the hash list merkle root, the list of transaction IDs, the wallet list, the reward address, the storage endowment, and the serialised representation of the recall block.

3. Hash the whole thing. The header is fed into two different hash functions. The "Independent Hash" is computed with a custom algorithm the paper calls Deep Hash, a recursive 384-bit hash over a nested list. Think of it as a single commitment to a structured, possibly deeply-nested set of values. The "Dependent Hash" is then a plain SHA-384 over the concatenation of the block data segment and the nonce. The dependent hash is the value that has to come out below the difficulty target.

4. Search for a nonce. Miners iterate the nonce until SHA-384(data_segment ‖ nonce) lands below the target. This is the same idea as Bitcoin's proof of work, with one critical difference. The data segment includes the serialised recall block, so the data segment changes if the recall block changes. A miner who does not hold the recall block cannot search for a valid nonce.

5. If you win, broadcast the block. In practice, you broadcast its blockshadow (more on that in a moment). The network checks that the deep hash and dependent hash are self-consistent, that the difficulty is satisfied, that the recall block is genuine, and that all of the transactions in the block are valid against the wallet list. If everything checks out, your block is appended to the weave, and you get the block reward plus any transaction fees the user paid to be included.

The whole thing reads like a Bitcoin mining algorithm, except the input to the hash function now includes a real chunk of the ledger's own history. A miner who turns off their storage farm can still hash, but they will never find a valid block again, because they will fail the recall-block check on every candidate.

Here is the visual summary of all four moving parts of the protocol:

Core Architecture/Flow

The top-left panel shows the chain-versus-weave comparison. The middle panel walks through the PoA mining flow: recall block, deep hash, difficulty check, new block. The blockshadow panel illustrates the propagation trick. The bottom row is the storage endowment economic loop, which is the part most people do not realise the protocol needs.

Blockshadows: gossiping a postcard instead of a postcard AND a hard drive

The naive version of "miner finds a block, broadcasts it to everyone" runs into a problem. The block can be large, because the permaweb is full of content, not just currency movements. If you have to ship a hundred-megabyte block to every node on the network before the next block arrives, you create a fork risk. Nodes that are still receiving the block when the next block appears will follow whichever chain they saw first, and you get a network split.

The paper's fix is called blockshadows. The full block is replaced, on the wire, by a slim replacement containing only the wallet list merkle root, the hash list merkle root, and the list of transaction IDs. If the receiving node already has the transactions in its mempool (and it usually does, because transactions are gossiped separately, as soon as they are submitted), it can reconstruct the full block locally in a few milliseconds.

A blockshadow is a few kilobytes. The full block can be effectively unbounded. The gossip layer no longer cares how big a block is, because the gossip layer never sees the full block. Fork probability is roughly proportional to block distribution time, and the distribution time for a blockshadow is constant in the size of the underlying block. You can have a hundred-megabyte block and a fifty-millisecond propagation time, which is the kind of trick that took the Bitcoin community a decade to invent a partial version of (Compact Blocks / BIP-152 and the Graphene protocol both sit in the citation graph here).

The paper also notes that blockshadows are a game, in the mechanism-design sense. A miner is rewarded for waiting until most of the network has the transactions before mining them into a block, because otherwise the blockshadow will arrive at peers who cannot reconstruct it and the block will be rejected. Wait too long, and someone else mines the transaction first. The blockshadow mechanism is a soft real-time scheduling signal that keeps gossip and mining roughly in sync, without anyone explicitly coordinating the two.

Paying for a hundred years of storage with a transaction fee

The endowment is the part of Arweave that needs the most patience, because it has no obvious analog elsewhere in cryptocurrency.

The basic question the endowment answers is: what should a user pay to store one megabyte of data forever? Nobody knows how much storage will cost in 2120. The paper's move is to refuse to predict the future, and instead lean on a historical fact: the cost of storing one gigabyte for one hour on a consumer hard drive, which the paper calls P_GBH, has fallen at a remarkably steady 30.57% per year for the last fifty years. If you believe the decline continues, the present-value cost of storing a byte in perpetuity is the sum of an infinite geometric series, and that sum converges. It is small, but it is not zero.

So when a user submits a transaction with a data payload of size SS, the protocol computes a storage cost as:

Pstore=TXsize×i=0PGBB[i]P_{store} = TX_{size} \times \sum_{i=0}^{\infty} P_{GBB}[i]

where the sum runs over every block period from now until forever. The endowment pool is credited with PstoreP_{store}. The user also pays an instant transaction reward Cfee×PstoreC_{fee} \times P_{store} that goes directly to the miner who includes the transaction, to compensate them for the work. Total paid by the user is TXcost+TXrewardTX_{cost} + TX_{reward}.

Every block, miners are paid out of the endowment, but only when the network actually needs them to be. The protocol computes the per-block expenditure required to maintain the blockweave (Wsize×PGBBW_{size} \times P_{GBB} for that block period) and only draws from the endowment when the sum of fee revenue and the inflation reward falls short. When fees and inflation cover it, the endowment does not pay out.

This is the safety valve. If storage gets more expensive in some future period, the endowment will draw down. If storage gets less expensive, which is the historical trend, the endowment mostly sits there, growing, and paying out only in the lean years.

The paper is honest that this whole apparatus is a bet on the 30.57% figure. The protocol's "highly conservative" pricing is the explicit hedge. The endowment is meant to be over-funded, so that a temporary stall in the cost-decline curve (or, worse, a reversal) does not bankrupt the system overnight. The simulation in section 6.4 of the paper shows the endowment remaining solvent across a range of storage-cost scenarios, but the authors are also clear that the model is not a proof. It is a calibrated extrapolation, and like every other calibration, it is falsifiable.

What the paper claims it has built

Working through the paper, the substantive claims a reader should walk away with are these.

  • The blockweave turns storage into the scarce mining resource. Miners who hoard disk space and back up rare blocks earn a structurally higher expected reward than miners who only store popular blocks.
  • Proof of access is a strict generalisation of proof of work. If every miner already has every block, the recall-block check is trivially satisfied, and the system degenerates to plain PoW.
  • Blockshadows decouple consensus from gossip cost. Block size and fork probability are no longer linked, because the network gossips a kilobyte-sized shadow, not a megabyte-sized block.
  • The endowment is the first end-to-end pricing mechanism for permanent storage on a public chain. It does not solve the problem of predicting future storage costs. It solves the problem of being conservative about that prediction in a way the protocol can live with if the prediction is wrong.
  • Replication in the live network exceeds 97%. That figure is well above the levels typical of contract-based storage systems, and it comes from the live system, not a model.
  • The system is Dominant Strategy Incentive Compatible (DSIC). A node doing what is best for itself (maximise expected mining reward) is also doing what is best for the network (replicate rare data, serve data quickly to peers, gossip generously). The proof is game-theoretic, not formal.

What is still open

The paper itself flags the open problems. They are worth restating.

  • The 30.57% per annum decline is a calibration, not a guarantee. If the cost of storing data ever flattens, the endowment absorbs the shock. The paper's safety margin is "highly conservative" pricing, but the longer the system runs, the more weight that assumption has to carry.
  • The AIIA / Wildfire mechanism is supported by a small simulation in section 6.4. A larger, adversarial simulation is on the future-work list.
  • Succinct proofs of access are explicitly labelled future work. These would shrink the on-chain footprint of the recall-block proof, which is currently a few hundred bytes per block. A logarithmic-size proof is a clean cryptographic improvement that the paper says is plausible but does not implement.
  • Wallet-log compression is also future work. Every block currently carries an updated wallet list, which is part of why the blockweave can be reconstructed from a single recent block. The cost is that the wallet list grows over time.
  • "Fast find" is the third item in the future-work section. A separate data structure that would let the network locate an arbitrary transaction quickly without scanning the whole weave, which becomes important as the weave grows into the petabyte range.

Governance is the biggest gap the paper leaves unspecified. Who, if anyone, can change the 30.57% assumption? Who decides when the endowment is "conservative enough"? The paper is silent on this. Arweave is a young system, and the operational answer to governance questions is probably "off-chain social consensus among the Arweave team and the larger mining community", which is fine for a system in its third year, but is a real question for a system that wants to outlive the team that built it.

Why this matters right now

December 2020 is a useful moment to read this paper, because the question it is trying to answer is getting louder. Cloudflare and AWS have become the de facto memory of the public web, and both have been visibly willing to drop customers on short notice when those customers run afoul of payment processors, regional regulators, or platform policies. Cloudflare's 2019 termination of 8chan is the most-cited example. The Library of Congress's efforts to archive the public web have been ongoing for two decades and still cover a fraction of it. The Internet Archive's Wayback Machine is heroic and underfunded and, in the last few years, has been on the receiving end of both lawsuits and copyfraud. None of those institutions is going to disappear tomorrow, but the structural risk that the public web's memory is held by a small number of mostly-private institutions is real and growing.

Arweave's pitch is the cleanest version of a thesis that is being tried several different ways: a public, permissionless, economically-sustainable store of arbitrary data, with no kill switch. Filecoin, Sia, and Storj are doing adjacent work with different trade-offs. What Arweave brings to the table that the others do not, in this paper at least, is the storage endowment: a specific, spelled-out mechanism for charging once and paying miners to keep paying themselves to keep the data. Whether the mechanism is sturdy enough to survive a century of hardware cycles is, again, a bet. The bet is explicit, the math is published, and the safety valves are at least sketched. That is more than most permanent-storage projects can claim.

If you want to follow the work, the mainnet has been live since 8 June 2018, the live-network replication rate is over 97%, and the permaweb already hosts a small but growing collection of permanent applications, including a serverless mail client (Weavemail) and a few file-archiving tools. The Yellow Paper is the place to start if you want to understand what is actually being promised.

Sources

  • Williams, S., Diordiiev, V., Berman, L., Raybould, I., Uemlianin, I. Arweave: A Protocol for Economically Sustainable Information Permanence. DRAFT-1, 5 November 2019, 67 pages. PDF · Semantic Scholar entry
  • Williams, S., Berman, L. Arweave: The Permanent Information Storage Protocol. Lightpaper companion document. PDF
  • Arweave project site and developer documentation. arweave.org · docs.arweave.org
  • Yellow Paper appendix 10.1: 50-year HDD cost-decline dataset (used to derive the 30.57% per annum figure).
  • Yellow Paper §3.2.3 (transaction pricing), §3.2.4 (storage endowment), §6 (AIIA / Wildfire), §8 (succinct proofs of access, wallet log reduction, fast find).
  • Background on compact-block and Graphene propagation: BIP-152 (Compact Blocks) and the Graphene protocol, both cited by the Yellow Paper as predecessors to blockshadows.
  • Investopedia, "What Is Ethereum's 'Difficulty Bomb'?" (June 2019). Reference for the Ethereum Ice Age / difficulty-bomb mechanism, which raises mining difficulty over time rather than deleting on-chain state.