Skip to content

Reduce number of collisions in the ptrack map #5

Closed
@ololobus

Description

@ololobus

Previously we thought that 1 MB can track changes page-to-page in the 1 GB of data files. However, recently it became evident that our ptrack map or basic hash table behaves more like a Bloom filter with a number of hash functions k = 1. See more here: https://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives. Such filter has naturally more collisions.

For example, for database of size 22 GB and ptrack.map_size = 32 MB previously we thought that we will get almost zero collisions. However, theory says that we should track an extra 640 MB by tracking a 1 GB of data:

import math

def col_proba(n, m, k=1):
  return (1 - math.exp(-k*n/m))**k

map_size = 1024*1024      # 1 MB in bytes
map_size = map_size*32    # ptrack.map_size in bytes
map_pages = map_size // 8 # 64-bit uint or 8 bytes per slot

change_size = 1000*1024         # size of changes in data files in KB
change_pages = change_size // 8 # number of changed pages in DB

db_pages = 22*1024*1024 // 8 # 22 GB in pages

proba = col_proba(change_pages, map_pages, 1)

print("Collision probability: ", proba)
print("Extra data (MB): ", proba*(db_pages - change_pages)*8/1024)

Collision probability:  0.030056617868655988
Extra data (MB):  647.0588694764261

Unfortunately this seems to be true and experiment confirms that we mark 1.6 GB of pages instead of 1 GB. Same experiment with 16 MB of map, 24 GB database and 1 GB insert: theory says there should be 1375 MB of extra blocks, in reality it appears to be ~1275 MB, which is very close.

Theoretically we can substantially reduce collisions rate by storing each page LSN into two cells (i.e. k = 2):

Collision probability:  0.003505804615154038
Extra data (MB):  75.47296175503612

Let's try to do that.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions