Partitioned Elias–Fano indexes

Partitioned Elias–Fano (PEF) indexes are a compressed data structure designed for efficiently representing sorted integer sequences, notably inverted indexes in information retrieval. Introduced by Giuseppe Ottaviano and Rossano Venturini in 2014, PEF indexes enhance classic Elias–Fano encoding by dividing sequences into partitions or chunks to leverage local clustering, thus achieving superior compression without sacrificing query speed.

Background

In search engines and information retrieval systems, an inverted index maps terms to "posting lists"—sorted sequences of document IDs where a specific term appears. Efficiently compressing these monotonically increasing sequences is critical for reducing memory consumption and improving query response times.

Elias–Fano encoding, proposed independently by Peter Elias and Robert Fano in the 1970s, provides a "quasi-succinct" method for this compression, using close to the theoretical minimum number of bits required. It operates by splitting each integer into high and low bits. The low bits are stored explicitly in a bit vector, while the high bits are encoded in unary. A major advantage of Elias-Fano over other delta-encoding schemes (like variable-byte or Golomb coding) is that it supports constant-time random access and fast skipping without needing to decompress the entire list. This makes it especially suitable for intersection and merging operations during boolean AND queries.

Partitioned Elias–Fano technique

While classic Elias-Fano encoding provides excellent query performance, it fails to exploit the local clustering often found in real-world inverted lists (i.e., long subsequences of documents with close IDs). The partitioned Elias–Fano method extends the standard encoding by dividing integer sequences into smaller sublists or chunks. Each chunk is compressed individually, which allows the encoding to better adapt to the local statistics of the data.

This two-level data structure consists of:

  • Top-level sequence: Encodes chunk descriptors (metadata and endpoints) using standard Elias–Fano encoding. This acts as a routing layer to allow quick navigation between chunks.
  • Chunk-level sequences: Compresses each chunk individually, choosing the most optimal encoding strategy based on the local structure of the data in that specific chunk.

Optimal Partitioning

Chunk partitioning can be implemented with fixed-length blocks or variable-length blocks. Variable-length partitioning provides significantly superior compression by adaptively setting chunk boundaries based on the data distribution. Because finding the absolute optimal partitioning configuration can be computationally expensive, Ottaviano and Venturini introduced a linear-time optimization algorithm. It reduces the partition problem to finding the shortest path on a specific directed acyclic graph (DAG), allowing the system to efficiently identify the minimum-space partition up to an arbitrarily small approximation factor.

Performance

In their 2014 study, Ottaviano and Venturini reported that partitioned Elias–Fano indexes significantly improved compression over standard Elias–Fano indexes, achieving up to double the compression efficiency on highly clustered datasets. Their benchmarks demonstrated that PEF maintained competitive query performance, particularly for intersection and skip-heavy queries common in web search applications. When compared to other state-of-the-art compressed encodings at the time—such as gamma-delta-Golomb coding and frame-of-reference encoding (PForDelta)—partitioned Elias–Fano exhibited the best trade-off between compression size and query speed.

Applications and adoption

Partitioned Elias–Fano indexes and their variations have become highly influential in modern search engine architecture, being cited by over 170 academic papers in the field of information retrieval. Academic implementations include the PISA toolkit (Performant Indexes and Search for Academia), an open-source search engine designed for high-performance information retrieval research.

Commercially, standard and partitioned Elias-Fano encoding concepts influence systems like Apache Lucene (which underpins popular enterprise search platforms like Elasticsearch and Solr) and Facebook’s Graph Search system (Unicorn), which utilized Elias–Fano encoding for rapid graph queries at scale.

Further developments in this space include clustered Elias–Fano indexes, which improve upon PEF by exploiting redundancy across multiple sequences, and dynamic Elias-Fano representations that allow for append-only operations while maintaining optimal space boundaries.