---
eip: 7745
title: Trustless log and transaction index
description: An efficient, light client and DHT friendly replacement for block header bloom filters
author: Zsolt Felföldi (@zsfelfoldi)
discussions-to: https://ethereum-magicians.org/t/eip-7745-two-dimensional-log-filter-data-structure/20580
status: Draft
type: Standards Track
category: Core
created: 2024-07-17
requires: 7916
---

## Abstract

Replace the fixed 2048 bit log event bloom filters with a new lookup index data structure that can adapt to the changing number of events per block and consistently guarantee a sufficiently low false positive ratio, allowing efficient trustless proofs of log event queries, canonical block hash and transaction hash lookups.

The proposed structure maps all index entries (log events, transaction and block markers) onto a global linear index space and hashes them into a binary Merkle tree based on that index. It also contains a _filter map_ for every fixed length section of the index space. These are two dimensional sparse bit maps that provide an efficient probabilistic method for looking up indexed values or query patterns of values, yielding potential matches in the form of exact positions in the linear index space. Unlike the per-block bloom filters, they allow searching for specific events by accessing only a small portion of the entire dataset which can also be proven with a Merkle proof, making the search both efficient and light client friendly.

The proposed structure can be efficiently used both for local search and for remote proof generation/verification, thereby simplifying implementation of provers and verifiers. It also allows validators that are not interested in either searching or proving logs to generate the index root hash by maintaining a minimal index state with a relatively small (hard capped) size.

## Motivation

Adding logs has a significantly lower gas cost and should accordingly be less resource consuming than writing to the state. The original design of bloom filters in each block achieves this goal as there is no complex data structure like the state to update, the set of logs emitted in each block is all contained within the header and receipts belonging to that block. Logs mostly just have long term storage costs. On the other hand, searching logs in a long range of blocks is very expensive and the current bloom filters do not really help.

Bloom filters are only useful as long as they are sufficiently sparse. False positive ratio rises rapidly with the number of events per filter and the density of `1` bits in the filter bit vector. In the currently existing bloom filter each log address and topic sets 3 out of a fixed length of 2048 bits which resulted in sufficiently sparse filters in the beginning but with the increase of the block gas limits the false positive ratio soon made the filter practically useless. Mainnet blocks currently add over 1000 log addresses and topics in average and therefore the bloom filter size would need to increase about tenfold in order to achieve acceptable false positive rates again. This would raise block header size to about 3 kilobytes. Even if the size of the per-block bloom filters would be raised to a sufficient level, log search would still require accessing the entire header chain. Searching in just the most recent one year history would cost over 6 gigabytes of data, not counting the access of actual logs where the bloom filters have a match. The current situation is even worse, requiring a large portion of the full block receipt sets to be accessed due to the high false positive rate of the bloom filters. Transaction inclusion lookup is also very expensive, requiring all block bodies in the searched range.

While full nodes and RPC servers can and usually do build additional index structures to aid certain lookups, these structures cannot be verified remotely. While it is important to improve the transaction processing capabilities of Ethereum, trustless provability is required throughout the entire stack, from end user applications signing transactions, to the same or other end user applications getting the results they need. Scaling transaction processing only makes sense as long as there is also a trustless and scalable way to access the relevant results of those transactions. Users relying on trusted servers and indexers kind of beats the whole point of Ethereum.

## Specification

### Terms and definitions

- _index entry_: a single entry in the `index_entries` tree associated with an indexed event. It is either a _log entry_, a _transaction entry_ or a _block entry_. An _index entry_ generates one or more _map values_. Each _log entry_ adds one _address value_ and 0..4 _topic values_ (in this order). Each _transaction entry_ adds one _transaction value_ and each _block entry_ adds one _block value_. 
- _map value_: a searchable value associated with an _index entry_. It is either an _address value_, a _topic value_, a _transaction value_ or a _block value_. Each _map value_ is represented by a 32 byte hash (the _map value hash_) which is calculated as `sha2(address)`, `sha2(topic)`, `sha2(tx_hash + b"\x01")` or `sha2(block_hash + b"\x02")`.
- _map value index_: a single position in the global linear index space. A new _map value index_ is assigned to each indexed _map value_.
- _map entry index_: the position where an _index entry_ is located in the `index_entries` tree. Each _index entry_ is located at the position corresponding to the _map value index_ of the first _map value_ it generates.
- _filter map_: a `MAP_WIDTH` by `MAP_HEIGHT` sized sparse bit map assigned to a `VALUES_PER_MAP` length section of the _map value index_ space. Each _map value_ assigned to this range is marked on the map at a row and column that depends on the _map value index_, _mapping layer_ and the _map value hash_. Rows are sparsely encoded as a list of marked column indices (in strictly ascending order, which also coincides with the order of occurence). Each map contains at most `VALUES_PER_MAP` marks and therefore the chance of false positives is kept at a constant low level.
- _mapping layer_: each layer defines a (_map index_, _map value hash_) => _row index_ mapping and imposes a limit on the number of _map values_ mapped into each row using that particular mapping. Multiple _mapping layers_ can be applied simultaneously on the same _filter map_. Each _map value_ is mapped using the lowest possible _mapping layer_ where the number of marks in the assigned row is lower than the limit. On layer 0 the _row index_ mapping only changes once per _index epoch_. On higher layers the mapping changes more frequently and also the row size limit is higher.
- _index epoch_: a `MAPS_PER_EPOCH` sized group of consecutive _filter maps_ stored in the hash tree in a way so that multiple rows of adjacent _filter maps_ with the same _row index_ can be efficiently retrieved in a single Merkle multiproof.

### Consensus data format

#### Block headers

Beginning at the execution timestamp `FORK_TIMESTAMP`, execution clients MUST replace the `logs_bloom` field of the header schema with `log_index_root` which is the root hash of the `LogIndex` structure after adding the logs emitted in the given block.

The resulting RLP encoding of the header is therefore:

```
rlp([
    parent_hash,
    0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347, # ommers hash
    coinbase,
    state_root,
    txs_root,
    receipts_root,
    log_index_root,
    0, # difficulty
    number,
    gas_limit,
    gas_used,
    timestamp,
    extradata,
    prev_randao,
    0x0000000000000000, # nonce
    base_fee_per_gas,
    withdrawals_root,
    blob_gas_used,
    excess_blob_gas,
    parent_beacon_block_root,
])
```

#### Container types

These definitions use the `ProgressiveList` and `ProgressiveByteList` SSZ types as defined in [EIP-7916](./eip-7916.md).

Note that the `LogIndex` and `IndexEpoch` container definitions below are "naive" definitions suitable to define SSZ merkleization. The entire log index is typically too big to be represented in memory as a whole. A more efficient implementation can be found here: [log_index.py](../assets/eip-7745/log_index.py), [binary_tree.py](../assets/eip-7745/binary_tree.py)  (taken from the EELS implementation). This code keeps only the currently processed parts of the tree in memory, expanding untouched empty subtrees on demand and collapsing finalized subtrees.

Also note that the containers defined here (including `FilterRow`) have a non-initialized default value, represented as a null leaf in the Merkle tree. Initialization of these containers is explicit and happens at a well defined point. `IndexEpoch` is initialized when the first _index entry_ is added to the epoch. All rows of a map are initialized when the first _index entry_ is added to the map. `IndexEntry` is initialized when added; entry positions belonging to unused _map entry indices_ are left uninitialized. The `Log` container is only initialized in case of _log entries_. If it is initialized then `topics` and `data` lists in the `Log` container are always initialized.

```
class LogIndex(Container):
    epochs: Vector[IndexEpoch, MAX_EPOCH_HISTORY]
    next_index: uint64

class IndexEpoch(Container):
    filter_maps: Vector[Vector[FilterRow, MAPS_PER_EPOCH], MAP_HEIGHT]
    index_entries: Vector[IndexEntry, MAPS_PER_EPOCH * VALUES_PER_MAP]

type FilterRow = ProgressiveList[uint32]

class IndexEntry(Container):
    log_entry: Log
    entry_meta: Vector[Bytes32, 4] # LogMeta / TransactionMeta / BlockMeta

class Log(Container):
    address: ExecutionAddress
    topics: List[Bytes32, 4]
    data: ProgressiveByteList
```

The `entry_meta` vector has a fixed format but its interpretation depends on the type of _index entry_ and the type of entry can also be determined based on its contents (and based on whether `log_entry` is initialized or not). Whenever `IndexEntry` is initialized, `entry_meta` contains either one of these containers:

```
class LogMeta(Container):
    block_number: uint64
    transaction_hash: Root
    transaction_index: uint64
    log_in_tx_index: uint64

class TransactionMeta(Container):
    block_number: uint64
    transaction_hash: Root
    transaction_index: uint64
    receipt_hash: Root
 
class BlockMeta(Container):
    block_number: uint64
    block_hash: Root
    timestamp: uint64
    zero: uint64
```

#### Index entry types and corresponding map values

Three types of _index entries_ are added during state transition: log, transaction and block. For each processed transaction, a _transaction entry_ is added first, then _log entries_ are added in the order of EVM execution. The _block entry_ referring to the processed block is not added during the state transition because it is supposed to reference the block hash which is not known yet. It can be added either after block building/validaton or are the beginning of the next state transition. In either case, the `block entry` of block N first appears in the log index state of block N+1 as the first new `index entry`.

The following table shows an example of mapping all index entry types onto the _map value index_ space:

| Index entry  | Map entry index | Map values     | Map value indices  |
|--------------|-----------------|----------------|--------------------|
| Block #0     | 0               | Block          | 0                  |
| Tx    #1/0   | 1               | Transaction    | 1                  |
| Log   #1/0/0 | 2               | Addr, 3x Topic | 2, 3, 4, 5         |
| Log   #1/0/1 | 6               | Addr, 3x Topic | 6, 7, 8, 9         |
| Tx    #1/1   | 10              | Transaction    | 10                 |
| Log   #1/1/0 | 11              | Addr, 2x Topic | 11, 12, 13         |
| Log   #1/1/1 | 14              | Addr, Topic    | 14, 15             |
| Log   #1/1/2 | 16              | Addr, 2x Topic | 16, 17, 18         |
| Block #1     | 19              | Block          | 19                 |
| Tx    #2/0   | 20              | Transaction    | 20                 |
| Log   #2/0/0 | 21              | Addr, 4x Topic | 21, 22, 23, 24, 25 |

Note that the missing _map entry indices_ are represented in the `index_entries` tree with uninitialized `IndexEntry` containers (zero value leaves in the `index_entries` vector). Most of these empty entries correspond to the _topic value_ indices. Also, the last few entries of each map might be emtpy because the _map values_ generated by a single _index entry_ never cross a filter map boundary. So, for example, when `VALUES_PER_MAP` is 0x10000 and `next_index` is `0x2FFFE` and a _log entry_ with on address and two topics is added, the last two positions of the map are left empty and the new _log entry_ is added starting from position 0x30000.

#### Filter map row encoding

Each row of the _filter map_ is encoded as a series of little endian binary encoded column indices. For encoding simplicity the column indices are always encoded as 32 bit values, regardless of the fact that `MAP_WIDTH` is less than 32. This only applies to consensus encoding and hashing though while local storage and proof encoding can use a more tightly packed format.

#### Proposed constants

| Name                | Value                  |
|---------------------|------------------------|
| MAP_WIDTH           | 2**24                  |
| MAP_HEIGHT          | 2**16                  |
| VALUES_PER_MAP      | 2**16                  |
| MAPS_PER_EPOCH      | 2**10                  |
| MAX_EPOCH_HISTORY   | 2**24                  |
| MAX_ROW_LENGTH      | [8, 168, 2728, 10920]  |
| MAPPING_FREQUENCY   | [2\**10, 2\**6, 2\**2, 1] |

Note that `MAX_ROW_LENGTH` and `MAPPING_FREQUENCY` parameters are specified as lists because the applicable values depend on the used _mapping layer_. For layers higher than the index of the last elements of these lists the last element is applicable. Also note that the `MAX_ROW_LENGTH` values correspond to the cumulative capacity of `ProgressiveList` tree levels; a layer 0 row fits into the first level, a layer 1 row fits into the first three levels and so on.

### Constructing the filter map

For each `VALUES_PER_MAP` long section of the _map value index_ space a _filter map_ is generated. These are fixed size `MAP_WIDTH` by `MAP_HEIGHT` sparse bit maps and each _map value_ is marked on the map with a single bit being set to one. The number of marks in a row (the length of the sparse encoded row) is referred to as "row length", not to be confused with the constant `MAP_WIDTH`.

#### Epochs and mapping layers

In order to allow efficient search of a certain _map value_ in a long historical range, _filter maps_ are organized into _index epochs_, each consisting of a fixed `MAPS_PER_EPOCH` number of maps. In the most usual case (when row density is around average or below) row mapping stays the same  during an entire epoch. The hash tree is organized in a way that instead of putting all rows of a single map close to each other, rows of the same _row index_ throughout an entire epoch are close to each other and therefore are cheap to access and/or prove with a Merkle proof.

In order to mitigate collisions in densely populated rows, the concept of _mapping layers_ is introduced, meaning that if a certain row already has a certain number of entries then the row mapping is changed and very frequent _map values_ are mapped into multiple rows. Initially, when a map is empty, every _map value_ is mapped using _mapping layer_ 0 or the "base layer". If a row reaches `MAX_ROW_LENGTH[0]` then any further _map values_ mapped onto that row in the base layer mapping will use layer 1 mapping instead. On layer 1 a different row is assigned to the same _map value_ and the row length limit in increased to `MAX_ROW_LENGTH[1]`. If this row also reaches its limit, layer 2 mapping is used and so on. Note that a row filled at a certain _mapping layer_ can be grown further on a higher layer. Different _map values_ colliding in the same row on a certain layer are probably mapped in different rows on the next layer, which means that a very popular value might populate multiple rows (also very long ones) but an unlucky less popular one colliding with it on base layer will probably just have to move one layer up. The search process is similar, if the searcher finds that the row belonging to the searched value is full according to the base layer limit then it also has to check the next layer and so on, until it finds a non-full row.

If a row is longer than the limit according to the layer the searcher is looking at then it can safely ignore the extra entries assuming that they were added by another value on a higher layer. The `ProgressiveList` container makes it efficient to prove row data belonging to the lower layer even if there is much more data in the same row added on a higher layer.

#### Row mapping

While base layer row mapping stays the same for an entire epoch, higher layer mappings are changed more frequently. Each mapping change has a cost in terms of database access overhead and Merkle proof size overhead and epoch size is determined in a way that these overheads stay sufficiently low compared to the cost of accessing the actual useful data. On higher layers where the rows are longer, a more frequent remapping is possible because the useful data size per map is also higher. It is also desirable so that a less frequent _map value_ will only suffer from colliding with a more popular one for a shorter time.

The _row index_ is calculated as follows:

```
def get_row_index(map_index, layer_index: Uint, map_value_hash: Hash32) -> Uint:
    """
    Returns the row index where the given map value hash is mapped on the given
    map and mapping layer.
    """
    mf_index = min(layer_index, len(MAPPING_FREQUENCY) - 1)
    mapping_frequency = MAPPING_FREQUENCY[mf_index]
    masked_map_index = map_index - (map_index % mapping_frequency)
    row_hash = sha256(
        map_value_hash
        + masked_map_index.to_le_bytes4()
        + layer_index.to_le_bytes4()
    ).digest()
    return Uint.from_le_bytes(row_hash[0:4]) % MAP_HEIGHT
```

The following figure shows how _map values_ are mapped to rows on different _mapping layers_. Each dot represents a map row and the numbers indicate the _mapping layer_ on which the row has been assigned to the given _map value_. Note that it might happen that a higher layer mapping coincides with a lower layer mapping for the same value; this does not cause any problem though as the row is simply grown further on the higher layer. The search algorithm can also simply revisit the same row in a higher layer iteration if necessary and process the rest of the row that it first ignored.

```
map index        111111 1111222222222233 3333333344444444 4455555555556666
       0123456789012345 6789012345678901 2345678901234567 8901234567890123
      +----------------+----------------+----------------+----------------+
row 0 |2........2......|2...............|...2............|........2.......|
row 1 |........1111.2..|.....2..1111....|1111.....2...2..|2111..2......2..|
row 2 |0000000000000000|.2..2....2..2...|....2111....2...|..2.....11112...|
row 3 |....2..22...1111|..........2.1111|2.......2.2.1111|.......2...2....|
row 4 |.2..1111..2....2|1112..2.2....2..|0000000011110020|...21111.2....2.|
row 5 |...2........2...|...............2|.2...2.........2|.2...2..........|
row 6 |1111.22....2..2.|0020000000020000|.......2...2....|..........2.1112|
row 7 |..2.............|....1112......2.|..2...2.........|0000200000000000|
      +----------------+----------------+----------------+----------------+
            epoch 0          epoch 1          epoch 2          epoch 3

Fig 2. Row mapping of a single log value on different mapping layers

MAP_HEIGHT = 8
MAPS_PER_EPOCH = 16
MAPPING_FREQUENCY = [16, 4, 1]
```

#### Column mapping

Column mapping assumes that `MAP_WIDTH` is a multiple of `VALUES_PER_MAP`. _column index_ is calculated as follows:

```
def get_column_index(map_value_index: Uint, map_value_hash: Hash32) -> Uint:
    """
    Returns the column index where the given entry hash is mapped at the given
    entry index.
    """
    col_hash = fnv1a_64(map_value_index.to_le_bytes8() + map_value_hash)
    folded_hash = (col_hash >> 32) ^ (col_hash & 0xFFFFFFFF)
    hash_bits = LOG2_MAP_WIDTH - LOG2_VALUES_PER_MAP
    return (
        (map_value_index % VALUES_PER_MAP)
        << hash_bits + folded_hash
        >> (32 - hash_bits)
    )
```

As shown on the figure below, this mapping practically assigns a `MAP_WIDTH // VALUES_PER_MAP` by `MAP_HEIGHT` rectangle to each _map value index_ and ensures that each _map value_ places exactly one mark in its own rectangle (the letters A-D represent different _map values_). This property also ensures that _map value index_ can be restored from _map index_ and _column index_, column indices never collide and keep the original order of _map value_ indices, allowing efficient Merkle exclusion proofs of certain _column indices_ in long rows.

```
column             11 1111 1111 2222 2222 2233
       0123 4567 8901 2345 6789 0123 4567 8901
      +---------------------------------------+
row 0 |.... .... .... .... .... .... .... ....|
row 1 |.... .... .... .... .... .C.. .... ....|
row 2 |.A.. .... ...A .... A... .... ...A ....|
row 3 |.... .... .... .... .... .... .... ....|
row 4 |.... .... .... .... .... .... .... ....|
row 5 |.... .B.. .... ...B .... .... .... ....|
row 6 |.... .... .... .... .... .... .... ..D.|
row 7 |.... .... .... .... .... .... .... ....|
      +---------------------------------------+

Fig 3. A single filter map with 8 entries of 4 different log values

MAP_WIDTH = 32
MAP_HEIGHT = 8
VALUES_PER_MAP = 8
```

### Updating the log index and calculating the root hash

See the following functions in [log_index.py](../assets/eip-7745/log_index.py):

- `log_index_add_log_entries`
- `log_index_add_transaction_entry`
- `log_index_add_block_entry`
- `log_index_root`

Note that the _block entry_ may be added either after building/verifying the given block or before processing the next one, but `log_index_root` should always be calculated after adding transaction and log entries and before adding the block entry for the some block (block building would not even be possible otherwise as the log index root is referenced in the block header).

### Finding potential matches

Determining whether a _column index_ found in the appropriate row is relevant for the searched _map value_ is possible by restoring the _map value index_ and then calculating the _column index_ from it again in order to check whether the quasi-random collision filter part matches the expected value for the given _map value_:

```
def get_map_value_index(map_index, column_index: Uint) -> Uint:
    map_value_width = MAP_WIDTH // VALUES_PER_MAP
    return map_index * VALUES_PER_MAP + column_index // map_value_width

def is_potential_match(map_index, column_index: Uint, map_value: Hash32) -> bool:
    return get_column_index(get_map_value_index(map_index, column_index), map_value) == column_index
```

Iterating through all relevant _mapping layers_ and corresponding rows is similar to how new _map values_ are added. Filtering all potential matches from all relevant rows can be done with the following function:

```
def get_potential_matches((log_index: LogIndexReader, map_index: Uint, map_value: Hash) -> List[Uint]:
    matches = []
    layer_index = 0
    while True:
        mr_index = min(layer_index, len(MAX_ROW_LENGTH) - 1)
        max_row_length = MAX_ROW_LENGTH[mr_index]
        row_index = get_row_index(map_index, layer_index, map_value)
        row = get_filter_row(log_index, map_index, row_index)[:max_row_length]
        for column_index in row:
            if is_potential_match(map_index, column_index, log_value):
                matches.append(get_log_value_index(map_index, column_index))
        if len(row) < max_row_length:
            break
        layer_index += 1
    return matches
```

### Initialization, minimal state and wire protocol extension

It is a property of the log index tree that over time most of the tree nodes get finalized, which means that in a minimal log index state representation finalized subtrees can be collapsed into a single tree node as their descendants are no longer needed for adding new entries. At every single moment there is one _filter map_ being updated, maybe the few most recently completed maps might still be affected by a reorg, but if the chain finalizes properly then all but the few most recent maps are also finalized. Internal tree nodes with all finalized children are also finalized and can be collapsed into the lowest finalized ancestor. Similarly, internal nodes covering a range that has not been touched yet can be stored without having their children in memory, and they can be expanded on demand.

Certain types of clients (those only interested in validation and not in using/serving log index data) have the option to just maintain this minimal log index state. Those who are interested in event history should probably also only keep the minimal state in memory and store the finalized and collapsed subtrees on disk. Clients can also initialize their log index with such a minimal state and then optionally regenerate the older parts from blocks and receipts locally. When the log index state of a finalized block is represented and no room for rollback is needed, the minimal tree representation simplifies into a set of Merkle branches leading to the map rows and to the _index entry_ corresponding to `next_index` (with everything on the left side of these branches collapsed and nothing on their right sides expanded yet). Assuming an efficient encoding of the current `filter_map_rows` where the variable length of each filter row is encoded in two bytes and the column indices tightly packed into 3 bytes each, the minimal log index state can be serialized into 21300808 bytes. This number can be considered as an estimate of the amount of data required to initialize the log index data structure at any point.

Note that initialization is also possible on an epoch boundary, in which case the log index is very cheap to initialize (only requires the last _block entry_ branch before the epoch boundary and the first one after) but in this case all blocks and receipts between the boundary and the current head are required to generate the last unfinished epoch. Epoch boundary initialization can also be used to index further historical epochs after the initialization of the head state.

Both head state initialization and epoch boundary initialization requires the extension of the Ethereum Wire Protocol through which clients can request Merkle multiproofs to initialize the required parts of the log index tree, using a recently finalized block as the starting point. Note that since the total amount of initialization data is too big to fit in a single devp2p response, the protocol splits the data into multiple subsets, each obtainable in the form of a separately verifiable log index Merkle proof. See the [wire protocol extension](../assets/eip-7745/wire_protocol_extension.md) and [log index proof format](../assets/eip-7745/log_index_proof_format.md) specs for further details.

## Rationale

### Design goals

The proposed data structure is intended to realize a balance between the cost of adding items and accessing old ones. In a search structure of a constantly growing dataset there is typically a tradeoff between the cost of adding new data and the cost of searcing the existing dataset. One extreme is just linearly storing the data, which is practically the case now with logs, with the bloom filters being mostly useless. The other extreme is one big Merkle tree with all _map values_ ever used as keys and the list of all occurences (possibly in a further merkleized format) as values. With billions of unique _map values_, adding new entries here is expected to have costs similar to that of the state, with multiple lookups and modifications/insertions at random places in a database on the order of magnitude of hundreds of gigabytes. Another issue where this is similar to the state is that removing old entries is hard and expensive. Adding logs is supposed to be cheaper than writing the state so solutions between these two extremes were considered as potentially practical, with multiple smaller structures generated periodically.

The _filter maps_ have a fixed tree size and an efficient tree hashing scheme. Filter entries are sorted into rows based on content and position in a way that allows quick linear database access and size efficient Merkle proofs. The difficulties arising from certain types of events being much more frequent than others are also mitigated. Update and maintenance costs are also limited as tree nodes are eventually finalized and the number of non-finalized non-empty nodes is always hard capped, ensuring moderate memory requirements. Initialization costs of the data structure at any point of the chain are also capped. Additional database storage costs of _filter maps_ is about 15-20% of the size of the actual logs, depending on certain implementation tradeoffs. The Merkle tree structure also makes it easy to discard entire epochs along with the corresponding Merkle subtrees, making the implementation of history expiry of the log index simple.

### Map value index space

In each block a varying number of _map values_ are emitted. In addition to inefficient search, another drawback of per-block fixed size bloom filters is the varying filter utilization leading to over-utilized filters giving many false positives in some blocks and/or wastefully under-utilized filters in some blocks. Block gas limits also tend to change significantly over the long term so any future-proof solution has to be able to adapt to the varying number of _map values_ per block.

Mapping _map values_ on their own linear index space ensures uniform filter utilization of identically structured _filter maps_. Compared to the alternative of constantly changing adaptive filter size, this approach greatly improves the efficiency of database access and Merkle proofs. It also allows efficient search pattern filtering as each potential match provides exact position information.

### Index entries tree

Hashing every event into the `index_entries` tree keyed by _map entry index_ is additional work and complexity but it simplifies the proving process significantly. Proving the events based only on _map value indices_ would be possible if receipts had a _map value index_ pointer hashed into them (so the client could also verify that the received receipt actually covers the _map value index_ derived from the _filter map_; in this case though an extra mechanism would be necessary to prove that the receipt is referenced in a block that is part of the same canonical chain where the log index root was referenced. Proving old canonical blocks is currently theoretically possible through the beacon chain but painful and also would require extra infrastructure. On the other hand, the `index_entries` tree not only solves the problem of proving the matching canonical receipts, with the _block entries_ it also provides a convenient way to prove canonical EL blocks through EL data structures only, and even provides provable canonical block hash to block number lookups.

Note though that storing the entire`index_entries` trees directly in their proposed merkleized format on disk is not really efficient. This is not an issue for validators that want to maintain a minimal state; for them, updating the `index_entries` subtrees is really cheap as they only need to maintain a single Merkle branch pointing to the next _map value index_. Provers can implement `index_entries` efficiently by storing a subset of the Merkle tree nodes (the ones at least a few levels below leaf level in order to save space) and generate the rest on demand based on the receipts that encode the same data in a more compact form.

### Alternative filter structures considered

One question considered was whether to periodically build lookup trees of events with each unique _map value_ emitted in a given period being a separate key under which only the occurences of that specific event are listed, or to use a more compressed fixed size tree format where different _map values_ might collide (though preferably not too many of them). This decision mostly boiled down to data access efficiency, both in terms of local disk access and remote Merkle proof size. Identically structured trees can be efficiently arranged in larger units (called _index epochs_ here), with values belonging to the same key in subsequent trees of an epoch located close to each other. This improves database access speed. It also allows smaller Merkle proofs with a series of leaves encoded together in an efficient format and internal nodes on only two boundary branches. Database writes are also efficient as the order of adding tree entries is not random and all the non-finalized parts of the tree can be kept in memory with a hard capped memory requirement.

The other design decision considered here was whether to hash entire index entries into the list of _map value_ occurences or just store position info and have a separate tree of log entries. Though the separate `filter_maps` and `index_entries` structures do present some additional complexity, the second option was chosen because of the size of Merkle proofs. Looking up a rarely used _map value_ means encountering more irrelevant entries than relevant ones. By adding probabilistic filter information to position information, the vast majority of the irrelevant entries can be filtered out after accessing/proving 3 bytes of data instead of an entire _map value_ (32 bytes). When proving matches of multiple _map value_ patterns this 10x advantage exists even with more frequent individual lookup values. Tests have shown that realistic log searches often yield a lot more matches for the individual _map values_ themselves that the pattern itself. Pattern matching can be performed on the position information of individual potential matches and actual _index entries_ only need to be proven at the potential pattern matches.

In conclusion, for the given application the fixed tree size approach with separate position info plus probabilistic collision filter approach seemed to be the most appropriate.

### False positive rate

From the _filter maps_ a set of potential matches can be derived for any block range and _map value_ or pattern of _map values_. These matches can then be looked up in the corresponding `index_entries` trees and actually matching events can be added to the set of results. The design guarantees that the set of potential matches includes all actual matches but and also has a consistently low false positive rate.

False positives can happen when the quasi-random collision filter part of a _column index_ accidentally matches the expected value even though it was generated by a _map value_ other than the searched one. The chance of this happening is `VALUES_PER_MAP / MAP_WIDTH` per colliding enrty in a row that is relevant for the search. Assuming that most entries in a map are different from the searched one, assuming uniform random distribution of entries, the average number of colliding entries found in a relevant row is `VALUES_PER_MAP / MAP_HEIGHT` giving an average false positive rate of `VALUES_PER_MAP ** 2 / MAP_WIDTH / MAP_HEIGHT` per _filter map_.

Though certain _map values_ might be emitted a lot more than others and therefore the _row index_ distribution might not be entirely uniform, periodical remapping of rows and using multiple _mapping layers_ ensures that over a long enough search period random collisions with more frequent _map values_ do even out. _Mapping layers_ do have another consequence though; if any row has at least `MAX_ROW_LENGTH[0]` entries then the search logic requires looking into another row that is mapped to the searched _map value_ on the next _mapping layer_. The maximum possible number of such rows is `VALUES_PER_MAP / MAX_ROW_LENGTH[0]` and therefore the chance of randomly hitting one is `VALUES_PER_MAP / MAX_ROW_LENGTH[0] / MAP_HEIGHT` in the worst case. In this case an extra row has to be processed, with extra chance of finding false positives. A collision with a frequent value at a certain _mapping layer_ does not indicate a collision on the next layer though, therefore the expected number of entries in that row is no different from the first one. Having to process a third row would presume that the second one had at least `MAX_ROW_LENGTH[1]` entries. The chance of this happening after the first coincidence is practically negligible in the context of expected false positives.

The expected number of false positives for a single _map value_ search can be estimated as `VALUES_PER_MAP ** 2 / MAP_WIDTH / MAP_HEIGHT * (1 + VALUES_PER_MAP / MAX_ROW_LENGTH[0] / MAP_HEIGHT)` per _filter map_. With the proposed constants this roughly equals 0.0044 false positives per map. As of March 2025 the average number of _map values_ emitted in a mainnet block is slightly over 1000 while a _filter map_ consists of 65536 _map values_. This gives a rough estimate of one false positive per 14000 blocks, which costs the searcher an extra lookup in a `log_entries` tree. The expected number of false positives in the entire chain history is around 1200.

Note that this is only true for a single value search while a typical pattern search requiring certain values on multiple positions has an exponentially lower false positive rate. For example if the pattern is [Addr, Topic1, Topic2] then three _map value_ searches are performed and an actual log lookup is only necessary if the first search yields `N`, the second `N+1` and the third `N+2` simultaneously. If necessary, the rate can be easily reduced by using a higher `MAP_WIDTH`, at the cost of growing the size of encoded rows.

## Backwards Compatibility

The existing log filter API (`eth_getLogs`, `eth_newFilter`, `eth_getFilterLogs`, `eth_getFilterChanges`) can be implemented with the new filter data structure. Applications relying on this API can operate without any change, benefiting from a higher search performance. Repricing the `LOG` opcode might be considered after performing benchmarks but the extra processing cost is not significant while the extra storage cost is around 15%. Other than that the EVM is not affected in any way as it only emits logs but does not directly access them.

## Security Considerations

### Safe access with a remote prover

In order to prove a complete set of matches matching a given search pattern in a given block range, the prover needs to

- prove the _map value index_ range that corresponds to the searched block number range by proving the _block entries_ of `first_block - 1` and `last_block`
- prove the relevant rows of _filter maps_ based on _map index_ and _row index_ (verifier can determine the relevant rows in advance based on the _map values_ in the search pattern and the relevant _map value index_ range)
- prove the actual _index entry_ belonging to any potentially matching _map value index_ and also the _block entry_ with the same block number if the `blockHash` of the log position info is needed

Since all three steps can be realized with Merkle proofs of the same `LogIndex` structure referenced in the block headers, any search with a remove prover is as safe as the client's knowledge about the chain head.

### Deliberate false positive attacks

The design guarantees that false positive rates do even out statistically over several epochs, even in case of random collisions with very frequent values, ensuring that an excessive amount false positives will not make the bandwidth and processing costs of the search prohibitively high. All of this is true for random collisions only though, not deliberately created collisions. A deliberate attack on a certain important _map value_ in order to raise its false positive rate can not be ruled out entirely since with a low amount of filter data generated per _map value_ it is always possible to "mine" another value that generates colliding filter data. The column mapping used here makes this attack a lot harder though, since the _column index_ depends on both the _map value_ and the exact _map value index_, making this attack only possible for block builders who are probably offered MEV rewards for much more lucrative manipulations of the transaction set than making the search of certain events slightly more expensive.

## Copyright

Copyright and related rights waived via [CC0](../LICENSE.md).
