Comparing changes

Teach nbtree composite index scans to opportunistically skip over irrelevant sections of composite indexes given a query with an omitted prefix column. When nbtree is passed input scan keys derived from a query predicate "WHERE b = 5", new nbtree preprocessing steps now output "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys. That is, preprocessing generates a "skip array" (along with an associated scan key) for the omitted column "a", which makes it safe to mark the scan key on "b" as required to continue the scan. This is far more efficient than a traditional full index scan whenever it allows the scan to skip over many irrelevant leaf pages, by iteratively repositioning itself using the keys on "a" and "b" together. A skip array has "elements" that are generated procedurally and on demand, but otherwise works just like a regular ScalarArrayOp array. Preprocessing can freely add a skip array before or after any input ScalarArrayOp arrays. Index scans with a skip array decide when and where to reposition the scan using the same approach as any other scan with array keys. This design builds on the design for array advancement and primitive scan scheduling added to Postgres 17 by commit 5bf748b. The core B-Tree operator classes on most discrete types generate their array elements with the help of their own custom skip support routine. This infrastructure gives nbtree a way to generate the next required array element by incrementing (or decrementing) the current array value. It can reduce the number of index descents in cases where the next possible indexable value frequently turns out to be the next value stored in the index. Opclasses that lack a skip support routine fall back on having nbtree "increment" (or "decrement") a skip array's current element by setting the NEXT (or PRIOR) scan key flag, without directly changing the scan key's sk_argument. These sentinel values behave just like any other value from an array -- though they can never locate equal index tuples (they can only locate the next group of index tuples containing the next set of non-sentinel values that the scan's arrays need to advance to). Inequality scan keys can affect how skip arrays generate their values. Their range is constrained by the inequalities. For example, a skip array on "a" will only use element values 1 and 2 given a qual such as "WHERE a BETWEEN 1 AND 2 AND b = 66". A scan using such a skip array has almost identical performance characteristics to one with the qual "WHERE a = ANY('{1, 2}') AND b = 66". The scan will be much faster when it can be executed as two selective primitive index scans instead of a single very large index scan that reads many irrelevant leaf pages. However, the array transformation process won't always lead to improved performance at runtime. Much depends on physical index characteristics. B-Tree preprocessing is optimistic about skipping working out: it applies static, generic rules when determining where to generate skip arrays, which assumes that the runtime overhead of maintaining skip arrays will pay for itself -- or lead to only a modest performance loss. As things stand, these assumptions are much too optimistic: skip array maintenance will lead to unacceptable regressions with unsympathetic queries (queries whose scan can't skip over many irrelevant leaf pages). An upcoming commit will address the problems in this area by enhancing _bt_readpage's approach to saving cycles on scan key evaluation, making it work in a way that directly considers the needs of = array keys (particularly = skip array keys). Author: Peter Geoghegan <[email protected]> Reviewed-By: Masahiro Ikeda <[email protected]> Reviewed-By: Heikki Linnakangas <[email protected]> Reviewed-By: Tomas Vondra <[email protected]> Reviewed-By: Matthias van de Meent <[email protected]> Reviewed-By: Aleksander Alekseev <[email protected]> Reviewed-By: Alena Rybakina <[email protected]> Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com

Postgres 17 commit e0b1ee1 added two closely related nbtree optimizations: the "prechecked" and "firstpage" optimizations. Both optimizations avoided needlessly evaluating keys that are guaranteed to be satisfied by applying page-level context. These optimizations were adapted to work with the nbtree ScalarArrayOp execution a few months later, which became commit 5bf748b. The "prechecked" design had a number of notable weak points. It didn't account for the fact that an = array scan key's sk_argument field might need to advance at the point of the page precheck (it didn't check the precheck tuple against the key's array, only the key's sk_argument, which needlessly made it ineffective in corner cases involving stepping to a page having advanced the scan's arrays using a truncated high key). It was also an "all or nothing" optimization: either it was completely effective (skipping all required-in-scan-direction keys against all attributes) for the whole page, or it didn't work at all. This also implied that it couldn't be used on pages where the scan had to terminate before reaching the end of the page due to an unsatisfied low-order key setting continuescan=false. Replace both optimizations with a new optimization without any of these weak points. This works by giving affected _bt_readpage calls a scankey offset that its _bt_checkkeys calls start at (an offset to the first key that might not be satisfied by every non-pivot tuple from the page). The new optimization is activated at the same point as the previous "prechecked" optimization (at the start of a _bt_readpage of any page after the scan's first). The old "prechecked" optimization worked off of the highest non-pivot tuple on the page (or the lowest, when scanning backwards), but the new "startikey" optimization always works off of a pair of non-pivot tuples (the lowest and the highest, taken together). This approach allows the "startikey" optimization to bypass = array key comparisons whenever they're satisfied by _some_ element (not necessarily the current one). This is useful for SAOP array keys (it fixes the issue with truncated high keys), and is needed to get the most out of range skip array keys (we expect to be able to bypass range skip array = keys when a range of values on the page all satisfy the key, even when there are multiple values, provided they all "satisfy some range skip array element"). Although this is independently useful work, the main motivation is to fix regressions in index scans that are nominally eligible to use skip scan, but can never actually benefit from skipping. These are cases where a leading prefix column contains many distinct values, especially when the number of values approaches the total number of index tuples, where skipping can never be profitable. The CPU costs of skip array maintenance is by far the main cost that must be kept under control. Skip scan's approach of adding skip arrays during preprocessing and then fixing (or significantly ameliorating) the resulting regressions seen in unsympathetic cases is enabled by the optimization added by this commit (and by the "look ahead" optimization introduced by commit 5bf748b). This allows the planner to avoid generating distinct, competing index paths (one path for skip scan, another for an equivalent traditional full index scan). The overall effect is to make scan runtime close to optimal, even when the planner works off an incorrect cardinality estimate. Scans will also perform well given a skipped column with data skew: individual groups of pages with many distinct values in respect of a skipped column can be read about as efficiently as before, without having to give up on skipping over other provably-irrelevant leaf pages. Author: Peter Geoghegan <[email protected]> Reviewed-By: Heikki Linnakangas <[email protected]> Reviewed-By: Masahiro Ikeda <[email protected]> Reviewed-By: Matthias van de Meent <[email protected]> Discussion: https://postgr.es/m/CAH2-Wz=Y93jf5WjoOsN=xvqpMjRy-bxCE037bVFi-EasrpeUJA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WznWDK45JfNPNvDxh6RQy-TaCwULaM5u5ALMXbjLBMcugQ@mail.gmail.com

Fixes a few remaining cases where affected skip scans never quite manage to reach the point of being able to apply the "passed first page" heuristic added by commit 9a2e2a2. They only need to manage to get there once to converge on full index scan behavior, but it was still possible for that to never happen, with the wrong workload. Author: Peter Geoghegan <[email protected]> Discussion: https://postgr.es/m/CAH2-Wz=RVdG3zWytFWBsyW7fWH7zveFvTHed5JKEsuTT0RCO_A@mail.gmail.com

Convert low_compare and high_compare nbtree skip array inequalities (with opclasses that offer skip support) such that _bt_first is consistently able to use later keys when descending the tree within _bt_first. For example, an index qual "WHERE a > 5 AND b = 2" is now converted to "WHERE a >= 6 AND b = 2" by a new preprocessing step that takes place after a final low_compare and/or high_compare are chosen by all earlier preprocessing steps. That way the scan's initial call to _bt_first will use "WHERE a >= 6 AND b = 2" to find the initial leaf level position, rather than merely using "WHERE a > 5" -- "b = 2" can always be applied. This has a decent chance of making the scan avoid an extra _bt_first call that would otherwise be needed just to determine the lowest-sorting "a" value in the index (the lowest that still satisfies "WHERE a > 5"). The transformation process can only lower the total number of index pages read when the use of a more restrictive set of initial positioning keys in _bt_first actually allows the scan to land on some later leaf page directly, relative to the unoptimized case (or on an earlier leaf page directly, when scanning backwards). The savings can be far greater when affected skip arrays come after some higher-order array. For example, a qual "WHERE x IN (1, 2, 3) AND y > 5 AND z = 2" can now save as many as 3 _bt_first calls as a result of these transformations (there can be as many as 1 _bt_first call saved per "x" array element). Author: Peter Geoghegan <[email protected]> Reviewed-By: Matthias van de Meent <[email protected]> Discussion: https://postgr.es/m/CAH2-Wz=FJ78K3WsF3iWNxWnUCY9f=Jdg3QPxaXE=uYUbmuRz5Q@mail.gmail.com

This branch was automatically generated by a robot using es from an email thread registered at: https://commitfest.postgresql.org//5081 The branch will be overwritten each time a new version is posted to the thread, and also periodically to check for bitrot caused by changes on the master branch. (es): https://www.postgresql.org/message-id/CAH2-WznK55udrFQm4umLjOmxcBh8k_5Ybxmt1_rXSYW9N8j64A@mail.gmail.com Author(s): Peter Geoghegan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Commits on Apr 3, 2025

This comparison is taking too long to generate.