-
Notifications
You must be signed in to change notification settings - Fork 398
Index to search #1276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Index to search #1276
Conversation
89fd00e
to
526d757
Compare
10bfd94
to
5bd6b18
Compare
️✅ There are no secrets present in this pull request anymore.If these secrets were true positive and are still valid, we highly recommend you to revoke them. 🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request. |
Search in Docs relies on an external project like "La Suite Find". We need to declare a common external network in order to connect to the search app and index our documents.
We need to content in our demo documents so that we can test indexing.
Add indexer that loops across documents in the database, formats them as json objects and indexes them in the remote "Find" mico-service.
5bd6b18
to
e966594
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First review, I know work is still ongoing and I did not read all the tests... :)
src/backend/core/api/serializers.py
Outdated
q = serializers.CharField(required=True) | ||
|
||
def validate_q(self, value): | ||
"""Ensure the text field is not empty.""" | ||
|
||
if len(value.strip()) == 0: | ||
raise serializers.ValidationError("Text field cannot be empty.") | ||
|
||
return value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q = serializers.CharField(required=True) | |
def validate_q(self, value): | |
"""Ensure the text field is not empty.""" | |
if len(value.strip()) == 0: | |
raise serializers.ValidationError("Text field cannot be empty.") | |
return value | |
q = serializers.CharField(required=True, allow_blank=False) |
You may also add trim_whitespace=True
src/backend/core/api/viewsets.py
Outdated
serializer.is_valid(raise_exception=True) | ||
|
||
try: | ||
indexer = FindDocumentIndexer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this class should come from settings, because not everyone will have an indexer. I also think this view might fallback on searching locally on title if no indexer is configured.
url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None) | ||
|
||
if not url: | ||
raise RuntimeError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise RuntimeError( | |
raise ImproperlyConfigured( |
Returns: | ||
dict: A JSON-serializable dictionary. | ||
""" | ||
url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None) | |
url = settings.SEARCH_INDEXER_QUERY_URL |
logger.error("HTTPError: %s", e) | ||
logger.error("Response content: %s", response.text) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log the error only once
src/backend/core/models.py
Outdated
@@ -1169,6 +1185,15 @@ def get_abilities(self, user): | |||
} | |||
|
|||
|
|||
@receiver(signals.post_save, sender=DocumentAccess) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to follow "Use signals as a last resort" (Two scoops of Django): is there a problem calling this from the save
method?
""" | ||
return models.Document.objects.filter(pk__in=[d["_id"] for d in data]) | ||
|
||
def push(self, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as before in this method
def sortkey(d): | ||
return d["id"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
push_call_args = [call.args[0] for call in mock_push.call_args_list] | ||
|
||
assert len(push_call_args) == 1 # called once but with a batch of docs | ||
assert sorted(push_call_args[0], key=sortkey) == sorted( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, actually I think the documents sorting should be deterministic, in case we need tu run several times the index command => I think we should change the index command to sort documents by creation date or something ^^
We can keep it this way for now, but we surely need to add a comment in the "index" management command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the sort is by id because the indexation is done by batch : loop + id__gt=prev_batch_last_id
push_call_args = [call.args[0] for call in mock_push.call_args_list] | ||
|
||
assert len(push_call_args) == 1 # called once but with a batch of docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want to simply check the assert_called_once, then use the first value?
42e69af
to
41f4967
Compare
On document content or permission changes, start a celery job that will call the indexation API of the app "Find". Signed-off-by: Fabre Florian <ffabre@hybird.org>
Signed-off-by: Fabre Florian <ffabre@hybird.org>
Signed-off-by: Fabre Florian <ffabre@hybird.org>
New API view that calls the indexed documents search view (resource server) of app "Find". Signed-off-by: Fabre Florian <ffabre@hybird.org>
0290e02
to
2acbaa0
Compare
New SEARCH_INDEXER_CLASS setting to define the indexer service class. Raise ImpoperlyConfigured errors instead of RuntimeError in index service. Signed-off-by: Fabre Florian <ffabre@hybird.org>
2058095
to
6e9c6ec
Compare
Signed-off-by: Fabre Florian <ffabre@hybird.org>
6e9c6ec
to
dab8220
Compare
Purpose
We want to add fulltext (and semantic in a second phase) search to Docs.
The goal is to enable efficient and scalable search across document content by pushing relevant data to a dedicated search backend, such as OpenSearch. The backend should be pluggable.
Proposal
Fixes #322