Skip to content

Conversation

@stuhood
Copy link
Contributor

@stuhood stuhood commented Aug 7, 2025

What

Add a TopDocs::order_by method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the TopOrderable trait. The TopOrderable trait is implemented (by a macro) for tuples of length 1 through 3 (for now).

How

Add:

  • a TopOrderable trait which is implemented for tuples, and a TopOrderableCollector to collect for it.
  • a Feature trait which is implemented for Scores, and for fast fields.
    • To allow for boxing/dynamic dispatch of Features (which reduces code generation when the sort features are not known until runtime), Arc<dyn Feature> is implemented via ErasedFeature.
  • a TopNCompare trait which can be used together with a LazyTopNComputer to lazily fetch features during TopN.
    • This new interface is necessary because TopNComputer does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier features).
    • It could also allow for making DocId/DocAddress tiebreaking optional (see), via something like a "DocIdFeature".

This interface additionally could not use the CustomScorer APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by Strings).

Note

This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so!

@stuhood stuhood force-pushed the stuhood.generic-order-by-upstream branch 2 times, most recently from 407a1d8 to 1c82ef0 Compare August 8, 2025 17:42
stuhood added a commit to paradedb/tantivy that referenced this pull request Aug 8, 2025
## What

Add a `TopDocs::order_by` method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the `TopOrderable` trait. The `TopOrderable` trait is implemented (by a macro) for tuples of length 1 through 3 (for now).

## How

Add:
* a `TopOrderable` trait which is implemented for tuples, and a `TopOrderableCollector` to collect for it.
* a `Feature` trait which is implemented for `Score`s, and for fast fields.
    * To allow for boxing/dynamic dispatch of `Features` (which reduces code generation when the sort columns are not known until runtime), `Arc<dyn Feature>` is implemented via `ErasedFeature`.  
* a `TopNCompare` trait which can be used together with a `LazyTopNComputer` to lazily fetch columns during TopN.
    * This new interface is necessary because `TopNComputer` does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier columns).
    * It could also allow for making `DocId`/`DocAddress` tiebreaking optional ([see](quickwit-oss#2672 (comment))), via something like a "`DocIdFeature`".

This interface additionally could not use the `CustomScorer` APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by `String`s).

## Note

This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so!

----

Upstream at quickwit-oss#2681
stuhood added 2 commits August 8, 2025 14:09
Uses:
* a `TopOrderable` trait which can be derived for tuples
* a `TopOrderableCollector` to collect for it.
* a `TopNCompare` trait which can be used together with a
  `LazyTopNComputer` to lazily fetch columns during TopN.

Note that this does not use the `CustomScorer` API because:
1. `TopNComputer` does not allow for lazily fetching additional fields
   for the comparison tuple, which is important when tiebreakers are
   only rarely actually coming into play in the comparison, and most
   values are being eliminated by earlier columns.
2. `CustomScoreTopCollector` does not allow segments to Top-N a different type
   than their final output type, which is essential for ordering by `String`s.
3. In order to include scores as one of the ordering columns, we need to be able
   to optionally enable scores.
4. The `CustomScoreTopCollector::merge_fruits` function needs to operate over a
   wrapper type in order to apply different ordering globally than per-segment.
@stuhood stuhood force-pushed the stuhood.generic-order-by-upstream branch from 1c82ef0 to 140c506 Compare August 8, 2025 21:09
@fulmicoton
Copy link
Collaborator

I am surprised this requires macros @stuhood. Can we get away with doing only generics?

fn get(
&self,
column: &FeatureColumn,
order: Order,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
order: Order,

The comment is wrong. This method is not returning the value for this feature. (depending on the order it can return the opposite).
At this point, I don't think this is a good idea to have order. We could integrate the order within the Feature trait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, I don't think this is a good idea to have order. We could integrate the order within the Feature trait.

By making it a generic parameter, or as a field?

I think that a bit more performance could be gained by making it a generic parameter (as TopNComputer now does): will give that a shot.

/// NOTE: We don't require a `PartialOrd` bound on the output type in order to make it possible
/// to use a boxed type like `OwnedValue` without giving it a `PartialOrd` implementation which
/// might be unsafe (i.e.: panicing) in other positions.
fn compare(&self, a: &Self::Output, b: &Self::Output) -> Option<Ordering>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn compare(&self, a: &Self::Output, b: &Self::Output) -> Option<Ordering>;
fn compare(&self, lhs: &Self::Output, rhs: &Self::Output) -> Option<Ordering>;

}
}

fn compare(&self, a: &Self::Output, b: &Self::Output) -> Option<Ordering> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming

@stuhood
Copy link
Contributor Author

stuhood commented Aug 12, 2025

I am surprised this requires macros @stuhood. Can we get away with doing only generics?

I don't think so... AFAIK it is not possible to abstract over the length of a tuple, and doing this without tuples requires that the types be boxed/dynamic or wrapped in enums.

I think that this is similar to the "SegmentCollectors for tuples" pattern over here:

// -----------------------------------------------
// Tuple implementations.
-- except implemented with a macro.

@fulmicoton
Copy link
Collaborator

fulmicoton commented Sep 3, 2025

@stuhood

I think this is possible:

struct LexicographicComparator<HeadComparator, TailComparator> {
    head: HeadComparator,
    tail: TailComparator,
}

impl<HeadComparator: TopNCompare, TailComparator: TopNCompare> TopNCompare for LexicographicComparator<HeadComparator, TailComparator> {
    type Accepted = (HeadComparator::Accepted, TailComparator::Accepted);

    fn accept(
        &self,
        threshold_value: &Self::Accepted,
        threshold_doc_id: DocId,
        score: Score,
        doc_id: DocId,
    ) -> Option<Self::Accepted> {
        todo!();
    }

    fn get(&self, score: Score, doc_id: DocId) -> Self::Accepted {
        todo!()
    }
}

struct LexicographicOrderable<Head, Tail> {
    head: Head,
    tail: Tail,
}

impl<Head, Tail> TopOrderable for LexicographicOrderable<Head, Tail>
where
    Head: TopOrderable,
    Tail: TopOrderable,
{
    type SegmentOutput = (Head::SegmentOutput, Tail::SegmentOutput);

    type Output = (Head::Output, Tail::Output);

    type SegmentComparator = LexicographicComparator<Head::SegmentComparator, Tail::SegmentComparator>;

    fn requires_scoring(&self) -> bool {
        todo!()
    }

    fn segment_comparator(
        &self,
        segment_reader: &SegmentReader,
    ) -> crate::Result<Self::SegmentComparator> {
        todo!()
    }

    fn feature_columns(
        &self,
        segment_reader: &SegmentReader,
    ) -> crate::Result<Vec<(FeatureColumn, Order)> > {
        todo!()
    }

    fn decode(
        &self,
        features: &Vec<(FeatureColumn, Order)>,
        segment_output: Vec<(Self::SegmentOutput, DocAddress)>,
    ) -> Vec<(Self::Output, DocAddress)> {
        todo!()
    }

    fn compare(&self, a: &(Self::Output, DocAddress), b: &(Self::Output, DocAddress)) -> bool {
        todo!()
    }

}

You could then write an adapter for 1,2,3 features without macros.
The current macro code is too hard for me to understand.

For the collector over tuples. It used to be implemented that way.
After noticing had API stabilized, I ended up ditching and created the current implementation, that doesn't use generics nor macros, and is just verbose.

Adjusts `Feature` to use `Option` where necessary. `StringFeature::SegmentOutput` avoids being wrapped in an `Option` because the `u64::MAX` "niche" is safe to use in that case (since it would require that many terms to trigger a collision).

Surprisingly, no performance impact downstream.

I additionally needed to adjust the unit tests to:
* include a NULL
* use a stable id, because for some reason at three segments (but not two?) the fact that segment ords are not stable was exposed.
@stuhood
Copy link
Contributor Author

stuhood commented Sep 3, 2025

@stuhood

I think this is possible:
...

Interesting: thanks! Will give that a shot.


In the meantime, I just pushed a change to properly handle NULLs in TopOrderable.

I think that TopDocs::order_by_fast_field and TopDocs::order_by_string_fast_field are not doing the right thing for missing values / NULLs currently: they use a sort_column.first_or_default_col(default_value), but in the case of either 0 or u64::MAX, those values can collide. And in the case of string ordering, u64::MAX will trigger a lookup error.

I've adjusted the TopOrderable interface to encode Option<u64>, and (surprisingly?) it did not make a noticeable difference in Top-N performance.

@stuhood
Copy link
Contributor Author

stuhood commented Sep 26, 2025

This has been on the backburner, but I did a bit of work to apply this feedback a few weeks ago, and it looks good: thanks a lot! I'm also going to apply this feedback before posting: #2681 (comment) : sometime in the next few weeks.

Leaving one note for myself about an additional helpful feature here: we'd additionally like the ability to preserve scores during a TopDocs::order_by, without necessarily including them in the feature columns (which would cause them to be included in the comparison). It's relatively minor, so it may be that the answer is to just go ahead and include the feature.

mdashti pushed a commit to paradedb/tantivy that referenced this pull request Oct 21, 2025
## What

Add a `TopDocs::order_by` method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the `TopOrderable` trait. The `TopOrderable` trait is implemented (by a macro) for tuples of length 1 through 3 (for now).

## How

Add:
* a `TopOrderable` trait which is implemented for tuples, and a `TopOrderableCollector` to collect for it.
* a `Feature` trait which is implemented for `Score`s, and for fast fields.
    * To allow for boxing/dynamic dispatch of `Features` (which reduces code generation when the sort columns are not known until runtime), `Arc<dyn Feature>` is implemented via `ErasedFeature`.  
* a `TopNCompare` trait which can be used together with a `LazyTopNComputer` to lazily fetch columns during TopN.
    * This new interface is necessary because `TopNComputer` does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier columns).
    * It could also allow for making `DocId`/`DocAddress` tiebreaking optional ([see](quickwit-oss#2672 (comment))), via something like a "`DocIdFeature`".

This interface additionally could not use the `CustomScorer` APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by `String`s).

## Note

This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so!

----

Upstream at quickwit-oss#2681
mdashti pushed a commit to paradedb/tantivy that referenced this pull request Oct 22, 2025
## What

Add a `TopDocs::order_by` method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the `TopOrderable` trait. The `TopOrderable` trait is implemented (by a macro) for tuples of length 1 through 3 (for now).

## How

Add:
* a `TopOrderable` trait which is implemented for tuples, and a `TopOrderableCollector` to collect for it.
* a `Feature` trait which is implemented for `Score`s, and for fast fields.
    * To allow for boxing/dynamic dispatch of `Features` (which reduces code generation when the sort columns are not known until runtime), `Arc<dyn Feature>` is implemented via `ErasedFeature`.  
* a `TopNCompare` trait which can be used together with a `LazyTopNComputer` to lazily fetch columns during TopN.
    * This new interface is necessary because `TopNComputer` does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier columns).
    * It could also allow for making `DocId`/`DocAddress` tiebreaking optional ([see](quickwit-oss#2672 (comment))), via something like a "`DocIdFeature`".

This interface additionally could not use the `CustomScorer` APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by `String`s).

## Note

This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so!

----

Upstream at quickwit-oss#2681
mdashti pushed a commit to paradedb/tantivy that referenced this pull request Oct 22, 2025
## What

Add a `TopDocs::order_by` method, which supports ordering by multiple fast fields and scores in one collection pass, as defined by the `TopOrderable` trait. The `TopOrderable` trait is implemented (by a macro) for tuples of length 1 through 3 (for now).

## How

Add:
* a `TopOrderable` trait which is implemented for tuples, and a `TopOrderableCollector` to collect for it.
* a `Feature` trait which is implemented for `Score`s, and for fast fields.
    * To allow for boxing/dynamic dispatch of `Features` (which reduces code generation when the sort columns are not known until runtime), `Arc<dyn Feature>` is implemented via `ErasedFeature`.  
* a `TopNCompare` trait which can be used together with a `LazyTopNComputer` to lazily fetch columns during TopN.
    * This new interface is necessary because `TopNComputer` does not allow for lazily fetching additional fields for the comparison tuple, which can eliminate a lot of IO when tiebreakers are only rarely actually coming into play in the comparison (because most values are being eliminated by earlier columns).
    * It could also allow for making `DocId`/`DocAddress` tiebreaking optional ([see](quickwit-oss#2672 (comment))), via something like a "`DocIdFeature`".

This interface additionally could not use the `CustomScorer` APIs because it does not allow segments to Top-N a different type than their final output type (which is essential for ordering by `String`s).

## Note

This patch isolates everything to one module, but should almost certainly be split up into multiple modules, and better integrated with the existing modules. I was hoping to get some feedback on it before rearranging things, but I'm very happy to do so!

----

Upstream at quickwit-oss#2681
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants