Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: postgresql-cfbot/postgresql
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: cf/5081~1
Choose a base ref
...
head repository: postgresql-cfbot/postgresql
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: cf/5081
Choose a head ref
  • 6 commits
  • 35 files changed
  • 2 contributors

Commits on Apr 3, 2025

  1. Add nbtree skip scan optimizations.

    Teach nbtree composite index scans to opportunistically skip over
    irrelevant sections of composite indexes given a query with an omitted
    prefix column.  When nbtree is passed input scan keys derived from a
    query predicate "WHERE b = 5", new nbtree preprocessing steps now output
    "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys.  That
    is, preprocessing generates a "skip array" (along with an associated
    scan key) for the omitted column "a", which makes it safe to mark the
    scan key on "b" as required to continue the scan.  This is far more
    efficient than a traditional full index scan whenever it allows the scan
    to skip over many irrelevant leaf pages, by iteratively repositioning
    itself using the keys on "a" and "b" together.
    
    A skip array has "elements" that are generated procedurally and on
    demand, but otherwise works just like a regular ScalarArrayOp array.
    Preprocessing can freely add a skip array before or after any input
    ScalarArrayOp arrays.  Index scans with a skip array decide when and
    where to reposition the scan using the same approach as any other scan
    with array keys.  This design builds on the design for array advancement
    and primitive scan scheduling added to Postgres 17 by commit 5bf748b.
    
    The core B-Tree operator classes on most discrete types generate their
    array elements with the help of their own custom skip support routine.
    This infrastructure gives nbtree a way to generate the next required
    array element by incrementing (or decrementing) the current array value.
    It can reduce the number of index descents in cases where the next
    possible indexable value frequently turns out to be the next value
    stored in the index.  Opclasses that lack a skip support routine fall
    back on having nbtree "increment" (or "decrement") a skip array's
    current element by setting the NEXT (or PRIOR) scan key flag, without
    directly changing the scan key's sk_argument.  These sentinel values
    behave just like any other value from an array -- though they can never
    locate equal index tuples (they can only locate the next group of index
    tuples containing the next set of non-sentinel values that the scan's
    arrays need to advance to).
    
    Inequality scan keys can affect how skip arrays generate their values.
    Their range is constrained by the inequalities.  For example, a skip
    array on "a" will only use element values 1 and 2 given a qual such as
    "WHERE a BETWEEN 1 AND 2 AND b = 66".  A scan using such a skip array
    has almost identical performance characteristics to one with the qual
    "WHERE a = ANY('{1, 2}') AND b = 66".  The scan will be much faster when
    it can be executed as two selective primitive index scans instead of a
    single very large index scan that reads many irrelevant leaf pages.
    However, the array transformation process won't always lead to improved
    performance at runtime.  Much depends on physical index characteristics.
    
    B-Tree preprocessing is optimistic about skipping working out: it
    applies static, generic rules when determining where to generate skip
    arrays, which assumes that the runtime overhead of maintaining skip
    arrays will pay for itself -- or lead to only a modest performance loss.
    As things stand, these assumptions are much too optimistic: skip array
    maintenance will lead to unacceptable regressions with unsympathetic
    queries (queries whose scan can't skip over many irrelevant leaf pages).
    An upcoming commit will address the problems in this area by enhancing
    _bt_readpage's approach to saving cycles on scan key evaluation, making
    it work in a way that directly considers the needs of = array keys
    (particularly = skip array keys).
    
    Author: Peter Geoghegan <[email protected]>
    Reviewed-By: Masahiro Ikeda <[email protected]>
    Reviewed-By: Heikki Linnakangas <[email protected]>
    Reviewed-By: Tomas Vondra <[email protected]>
    Reviewed-By: Matthias van de Meent <[email protected]>
    Reviewed-By: Aleksander Alekseev <[email protected]>
    Reviewed-By: Alena Rybakina <[email protected]>
    Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
    @petergeoghegan
    petergeoghegan authored and Commitfest Bot committedApr 3, 2025
    Configuration menu
    Copy the full SHA
    526b08aView commit details
    Browse the repository at this point in the history
  2. Enhance nbtree tuple scan key optimizations.

    Postgres 17 commit e0b1ee1 added two closely related nbtree
    optimizations: the "prechecked" and "firstpage" optimizations.  Both
    optimizations avoided needlessly evaluating keys that are guaranteed to
    be satisfied by applying page-level context.  These optimizations were
    adapted to work with the nbtree ScalarArrayOp execution  a few
    months later, which became commit 5bf748b.
    
    The "prechecked" design had a number of notable weak points.  It didn't
    account for the fact that an = array scan key's sk_argument field might
    need to advance at the point of the page precheck (it didn't check the
    precheck tuple against the key's array, only the key's sk_argument,
    which needlessly made it ineffective in corner cases involving stepping
    to a page having advanced the scan's arrays using a truncated high key).
    It was also an "all or nothing" optimization: either it was completely
    effective (skipping all required-in-scan-direction keys against all
    attributes) for the whole page, or it didn't work at all.  This also
    implied that it couldn't be used on pages where the scan had to
    terminate before reaching the end of the page due to an unsatisfied
    low-order key setting continuescan=false.
    
    Replace both optimizations with a new optimization without any of these
    weak points.  This works by giving affected _bt_readpage calls a scankey
    offset that its _bt_checkkeys calls start at (an offset to the first key
    that might not be satisfied by every non-pivot tuple from the page).
    The new optimization is activated at the same point as the previous
    "prechecked" optimization (at the start of a _bt_readpage of any page
    after the scan's first).
    
    The old "prechecked" optimization worked off of the highest non-pivot
    tuple on the page (or the lowest, when scanning backwards), but the new
    "startikey" optimization always works off of a pair of non-pivot tuples
    (the lowest and the highest, taken together).  This approach allows the
    "startikey" optimization to bypass = array key comparisons whenever
    they're satisfied by _some_ element (not necessarily the current one).
    This is useful for SAOP array keys (it fixes the issue with truncated
    high keys), and is needed to get the most out of range skip array keys
    (we expect to be able to bypass range skip array = keys when a range of
    values on the page all satisfy the key, even when there are multiple
    values, provided they all "satisfy some range skip array element").
    
    Although this is independently useful work, the main motivation is to
    fix regressions in index scans that are nominally eligible to use skip
    scan, but can never actually benefit from skipping.  These are cases
    where a leading prefix column contains many distinct values, especially
    when the number of values approaches the total number of index tuples,
    where skipping can never be profitable.  The CPU costs of skip array
    maintenance is by far the main cost that must be kept under control.
    
    Skip scan's approach of adding skip arrays during preprocessing and then
    fixing (or significantly ameliorating) the resulting regressions seen in
    unsympathetic cases is enabled by the optimization added by this commit
    (and by the "look ahead" optimization introduced by commit 5bf748b).
    This allows the planner to avoid generating distinct, competing index
    paths (one path for skip scan, another for an equivalent traditional
    full index scan).  The overall effect is to make scan runtime close to
    optimal, even when the planner works off an incorrect cardinality
    estimate.  Scans will also perform well given a skipped column with data
    skew: individual groups of pages with many distinct values in respect of
    a skipped column can be read about as efficiently as before, without
    having to give up on skipping over other provably-irrelevant leaf pages.
    
    Author: Peter Geoghegan <[email protected]>
    Reviewed-By: Heikki Linnakangas <[email protected]>
    Reviewed-By: Masahiro Ikeda <[email protected]>
    Reviewed-By: Matthias van de Meent <[email protected]>
    Discussion: https://postgr.es/m/CAH2-Wz=Y93jf5WjoOsN=xvqpMjRy-bxCE037bVFi-EasrpeUJA@mail.gmail.com
    Discussion: https://postgr.es/m/CAH2-WznWDK45JfNPNvDxh6RQy-TaCwULaM5u5ALMXbjLBMcugQ@mail.gmail.com
    @petergeoghegan
    petergeoghegan authored and Commitfest Bot committedApr 3, 2025
    Configuration menu
    Copy the full SHA
    6113e72View commit details
    Browse the repository at this point in the history
  3. Improve skip scan primitive scan scheduling.

    Fixes a few remaining cases where affected skip scans never quite manage
    to reach the point of being able to apply the "passed first page"
    heuristic added by commit 9a2e2a2.  They only need to manage to get
    there once to converge on full index scan behavior, but it was still
    possible for that to never happen, with the wrong workload.
    
    Author: Peter Geoghegan <[email protected]>
    Discussion: https://postgr.es/m/CAH2-Wz=RVdG3zWytFWBsyW7fWH7zveFvTHed5JKEsuTT0RCO_A@mail.gmail.com
    @petergeoghegan
    petergeoghegan authored and Commitfest Bot committedApr 3, 2025
    Configuration menu
    Copy the full SHA
    a62d29eView commit details
    Browse the repository at this point in the history
  4. Apply low-order skip key in _bt_first more often.

    Convert low_compare and high_compare nbtree skip array inequalities
    (with opclasses that offer skip support) such that _bt_first is
    consistently able to use later keys when descending the tree within
    _bt_first.
    
    For example, an index qual "WHERE a > 5 AND b = 2" is now converted to
    "WHERE a >= 6 AND b = 2" by a new preprocessing step that takes place
    after a final low_compare and/or high_compare are chosen by all earlier
    preprocessing steps.  That way the scan's initial call to _bt_first will
    use "WHERE a >= 6 AND b = 2" to find the initial leaf level position,
    rather than merely using "WHERE a > 5" -- "b = 2" can always be applied.
    This has a decent chance of making the scan avoid an extra _bt_first
    call that would otherwise be needed just to determine the lowest-sorting
    "a" value in the index (the lowest that still satisfies "WHERE a > 5").
    
    The transformation process can only lower the total number of index
    pages read when the use of a more restrictive set of initial positioning
    keys in _bt_first actually allows the scan to land on some later leaf
    page directly, relative to the unoptimized case (or on an earlier leaf
    page directly, when scanning backwards).  The savings can be far greater
    when affected skip arrays come after some higher-order array.  For
    example, a qual "WHERE x IN (1, 2, 3) AND y > 5 AND z = 2" can now save
    as many as 3 _bt_first calls as a result of these transformations (there
    can be as many as 1 _bt_first call saved per "x" array element).
    
    Author: Peter Geoghegan <[email protected]>
    Reviewed-By: Matthias van de Meent <[email protected]>
    Discussion: https://postgr.es/m/CAH2-Wz=FJ78K3WsF3iWNxWnUCY9f=Jdg3QPxaXE=uYUbmuRz5Q@mail.gmail.com
    @petergeoghegan
    petergeoghegan authored and Commitfest Bot committedApr 3, 2025
    Configuration menu
    Copy the full SHA
    a0a9db4View commit details
    Browse the repository at this point in the history
  5. DEBUG: Add skip scan disable GUCs.

    @petergeoghegan
    petergeoghegan authored and Commitfest Bot committedApr 3, 2025
    Configuration menu
    Copy the full SHA
    055e60dView commit details
    Browse the repository at this point in the history
  6. [CF 5081] v33 - nbtree skip scan

    This branch was automatically generated by a robot using es from an
    email thread registered at:
    
    https://commitfest.postgresql.org//5081
    
    The branch will be overwritten each time a new  version is posted to
    the thread, and also periodically to check for bitrot caused by changes
    on the master branch.
    
    (es): https://www.postgresql.org/message-id/CAH2-WznK55udrFQm4umLjOmxcBh8k_5Ybxmt1_rXSYW9N8j64A@mail.gmail.com
    Author(s): Peter Geoghegan
    Commitfest Bot committedApr 3, 2025
    Configuration menu
    Copy the full SHA
    d819362View commit details
    Browse the repository at this point in the history
Loading