Description
Previously we thought that 1 MB can track changes page-to-page in the 1 GB of data files. However, recently it became evident that our ptrack map or basic hash table behaves more like a Bloom filter with a number of hash functions k = 1
. See more here: https://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives. Such filter has naturally more collisions.
For example, for database of size 22 GB and ptrack.map_size = 32 MB previously we thought that we will get almost zero collisions. However, theory says that we should track an extra 640 MB by tracking a 1 GB of data:
import math
def col_proba(n, m, k=1):
return (1 - math.exp(-k*n/m))**k
map_size = 1024*1024 # 1 MB in bytes
map_size = map_size*32 # ptrack.map_size in bytes
map_pages = map_size // 8 # 64-bit uint or 8 bytes per slot
change_size = 1000*1024 # size of changes in data files in KB
change_pages = change_size // 8 # number of changed pages in DB
db_pages = 22*1024*1024 // 8 # 22 GB in pages
proba = col_proba(change_pages, map_pages, 1)
print("Collision probability: ", proba)
print("Extra data (MB): ", proba*(db_pages - change_pages)*8/1024)
Collision probability: 0.030056617868655988
Extra data (MB): 647.0588694764261
Unfortunately this seems to be true and experiment confirms that we mark 1.6 GB of pages instead of 1 GB. Same experiment with 16 MB of map, 24 GB database and 1 GB insert: theory says there should be 1375 MB of extra blocks, in reality it appears to be ~1275 MB, which is very close.
Theoretically we can substantially reduce collisions rate by storing each page LSN into two cells (i.e. k = 2
):
Collision probability: 0.003505804615154038
Extra data (MB): 75.47296175503612
Let's try to do that.