Skip to content

Conversation

@MagellaX
Copy link

PR description

This change introduces Near Real-Time indexing support:

Soft commit

  • Adds PreparedCommit::soft_commit() and IndexWriter::soft_commit() to publish current uncommitted segments to the in-memory committed set without persisting meta.json.
  • Deletes up to the soft-commit opstamp are applied; no GC or meta.json write happens.

NRT searcher

  • Adds IndexWriter::nrt_searcher() to construct a Searcher directly from the in-memory committed segments for instant visibility after soft commits.

NRTDirectory overlay

  • Adds NrtDirectory, an overlay that writes to an in-memory RamDirectory and reads from overlay first, then base directory.
  • sync_directory() persists overlay files to the base directory to make data durable when desired.

Compatibility

  • Default commit semantics are unchanged. Existing users are unaffected unless the new APIs are used.
  • Normal commit still persists meta.json and triggers garbage collection as before.

Why

  • Enables low-latency searchability for recently indexed documents in client applications without writing meta.json on every change.
  • Aligns with the design discussed in tantivy issue Near Real Time indexing. #494.

How to use

  1. Wrap your base directory with NrtDirectory when creating/opening the index.
  2. Call index_writer.soft_commit() after indexing to publish for NRT search.
  3. Use index_writer.nrt_searcher() to obtain a Searcher that sees soft-committed data immediately.
  4. Call index_writer.commit() whenever you are ready to persist meta.json and make changes durable on disk.

@fulmicoton
Copy link
Collaborator

@MagellaX can you fix the PR. I think the first two commits are unrelated.

let guard = self.overlay_paths.read().unwrap();
guard.iter().cloned().collect()
};
for path in snapshot_paths {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but if we update the meta json file before a segment file, and this function has an io Error in the middle, the index will be corrupted right now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve moved the meta publication to the end of sync_directory() and removed the early base write in atomic_write. We now snapshot overlay paths, persist all data files first, and atomically write meta.json last to avoid partial states. Build is green; please have another look.

@fulmicoton
Copy link
Collaborator

This is a cool effort! ... But I stopped reviewing in the middle as too many problems have been overlooked.

@MagellaX
Copy link
Author

MagellaX commented Sep 3, 2025

@MagellaX can you fix the PR. I think the first two commits are unrelated.

just give me a sec, i am checking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants