Skip to content

Conversation

sampaccoud
Copy link
Member

@sampaccoud sampaccoud commented Aug 7, 2025

Purpose

We want to add fulltext (and semantic in a second phase) search to Docs.

The goal is to enable efficient and scalable search across document content by pushing relevant data to a dedicated search backend, such as OpenSearch. The backend should be pluggable.

Proposal

  • Add indexing logic in a search indexer that can be declared as a backend
  • Implement indexing for the Find backend. See corresponding PR in Find
  • Implement search views as a proxy
  • Implement triggers to update search index when a document or its accesses change. Synchronization should be done asyncrhonously as changing a document or its accesses affects all its descendants...

Fixes #322

@sampaccoud sampaccoud requested a review from joehybird August 7, 2025 16:40
@sampaccoud sampaccoud added feature add a new feature backend labels Aug 7, 2025
@joehybird joehybird force-pushed the index-to-search branch 3 times, most recently from 10bfd94 to 5bd6b18 Compare September 8, 2025 12:38
Copy link

gitguardian bot commented Sep 8, 2025

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
While these secrets were previously flagged, we no longer have a reference to the
specific commits where they were detected. Once a secret has been leaked into a git
repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Search in Docs relies on an external project like "La Suite Find".
We need to declare a common external network in order to connect to
the search app and index our documents.
We need to content in our demo documents so that we can test
indexing.
Add indexer that loops across documents in the database, formats them
as json objects and indexes them in the remote "Find" mico-service.
Copy link
Member

@qbey qbey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First review, I know work is still ongoing and I did not read all the tests... :)

Comment on lines 829 to 837
q = serializers.CharField(required=True)

def validate_q(self, value):
"""Ensure the text field is not empty."""

if len(value.strip()) == 0:
raise serializers.ValidationError("Text field cannot be empty.")

return value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
q = serializers.CharField(required=True)
def validate_q(self, value):
"""Ensure the text field is not empty."""
if len(value.strip()) == 0:
raise serializers.ValidationError("Text field cannot be empty.")
return value
q = serializers.CharField(required=True, allow_blank=False)

You may also add trim_whitespace=True

serializer.is_valid(raise_exception=True)

try:
indexer = FindDocumentIndexer()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this class should come from settings, because not everyone will have an indexer. I also think this view might fallback on searching locally on title if no indexer is configured.

url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None)

if not url:
raise RuntimeError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise RuntimeError(
raise ImproperlyConfigured(

Returns:
dict: A JSON-serializable dictionary.
"""
url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None)
url = settings.SEARCH_INDEXER_QUERY_URL

Comment on lines 224 to 225
logger.error("HTTPError: %s", e)
logger.error("Response content: %s", response.text) # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log the error only once

@@ -1169,6 +1185,15 @@ def get_abilities(self, user):
}


@receiver(signals.post_save, sender=DocumentAccess)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to follow "Use signals as a last resort" (Two scoops of Django): is there a problem calling this from the save method?

"""
return models.Document.objects.filter(pk__in=[d["_id"] for d in data])

def push(self, data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as before in this method

Comment on lines 37 to 38
def sortkey(d):
return d["id"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

push_call_args = [call.args[0] for call in mock_push.call_args_list]

assert len(push_call_args) == 1 # called once but with a batch of docs
assert sorted(push_call_args[0], key=sortkey) == sorted(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, actually I think the documents sorting should be deterministic, in case we need tu run several times the index command => I think we should change the index command to sort documents by creation date or something ^^

We can keep it this way for now, but we surely need to add a comment in the "index" management command.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sort is by id because the indexation is done by batch : loop + id__gt=prev_batch_last_id

Comment on lines 43 to 45
push_call_args = [call.args[0] for call in mock_push.call_args_list]

assert len(push_call_args) == 1 # called once but with a batch of docs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to simply check the assert_called_once, then use the first value?

@joehybird joehybird force-pushed the index-to-search branch 2 times, most recently from 42e69af to 41f4967 Compare September 11, 2025 13:32
sampaccoud and others added 4 commits September 11, 2025 15:39
On document content or permission changes, start a celery job that will call the
indexation API of the app "Find".

Signed-off-by: Fabre Florian <ffabre@hybird.org>
Signed-off-by: Fabre Florian <ffabre@hybird.org>
Signed-off-by: Fabre Florian <ffabre@hybird.org>
New API view that calls the indexed documents search view
(resource server) of app "Find".

Signed-off-by: Fabre Florian <ffabre@hybird.org>
@joehybird joehybird force-pushed the index-to-search branch 4 times, most recently from 0290e02 to 2acbaa0 Compare September 12, 2025 05:27
New SEARCH_INDEXER_CLASS setting to define the indexer service class.
Raise ImpoperlyConfigured errors instead of RuntimeError in index service.

Signed-off-by: Fabre Florian <ffabre@hybird.org>
@joehybird joehybird force-pushed the index-to-search branch 3 times, most recently from 2058095 to 6e9c6ec Compare September 12, 2025 12:22
Signed-off-by: Fabre Florian <ffabre@hybird.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend feature add a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Full-Blown search feature
3 participants