Reduce number of collisions in the ptrack map

Previously we thought that 1 MB can track changes page-to-page in the 1 GB of data files. However, recently it became evident that our ptrack map or basic hash table behaves more like a Bloom filter with a number of hash functions `k = 1`. See more here: https://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives. Such filter has naturally more collisions.

For example, for database of size 22 GB and ptrack.map_size = 32 MB previously we thought that we will get almost zero collisions. However, theory says that we should track an extra 640 MB by tracking a 1 GB of data:

```python
import math

def col_proba(n, m, k=1):
  return (1 - math.exp(-k*n/m))**k

map_size = 1024*1024      # 1 MB in bytes
map_size = map_size*32    # ptrack.map_size in bytes
map_pages = map_size // 8 # 64-bit uint or 8 bytes per slot

change_size = 1000*1024         # size of changes in data files in KB
change_pages = change_size // 8 # number of changed pages in DB

db_pages = 22*1024*1024 // 8 # 22 GB in pages

proba = col_proba(change_pages, map_pages, 1)

print("Collision probability: ", proba)
print("Extra data (MB): ", proba*(db_pages - change_pages)*8/1024)

Collision probability:  0.030056617868655988
Extra data (MB):  647.0588694764261
```

Unfortunately this seems to be true and experiment confirms that we mark 1.6 GB of pages instead of 1 GB. Same experiment with 16 MB of map, 24 GB database and 1 GB insert: theory says there should be 1375 MB of extra blocks, in reality it appears to be ~1275 MB, which is very close.

Theoretically we can substantially reduce collisions rate by storing each page LSN into two cells (i.e. `k = 2`):

```
Collision probability:  0.003505804615154038
Extra data (MB):  75.47296175503612
```

Let's try to do that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce number of collisions in the ptrack map #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduce number of collisions in the ptrack map #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions