-
Notifications
You must be signed in to change notification settings - Fork 24
Add support for semantic search to the MCP server #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/mcp_server_datahub/mcp_server.py
Outdated
Returns both a truncated list of results and facets/aggregations that can be used to iteratively refine the search filters. | ||
To explore the data catalog and get aggregate statistics, use the wildcard '*' as the query and set `filters: null`. This provides | ||
facets showing platform distribution, entity types, and other aggregate insights across the entire catalog, plus a representative | ||
sample of entities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please note that I put slightly different explanation for '*' here - it made a little bit more sense to me
src/mcp_server_datahub/mcp_server.py
Outdated
Returns both a truncated list of results and facets/aggregations that can be used to iteratively refine the search filters. | ||
To explore the data catalog and get aggregate statistics, use the wildcard '*' as the query and set `filters: null`. This provides | ||
facets showing platform distribution, entity types, and other aggregate insights across the entire catalog, plus a representative | ||
sample of entities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw - semantic search doesn't have special treatment for *, it would compute embeddings for it and do the matching. Effectively I think the result will be correct (as the caller intends to get the facets which semantic search retrieves by calling keyword search with * internally) since the actual result of the search is not well defined except for facets. There is a possibility of updating semantic search to switch to keyword search when the query is *.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for calling this out.
src/mcp_server_datahub/mcp_server.py
Outdated
|
||
def _is_semantic_search_enabled() -> bool: | ||
"""Check if semantic search is enabled via environment variable.""" | ||
return os.environ.get("SEMANTIC_SEARCH_ENABLED", "false").lower() == "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it more clear that this is only available on datahub cloud with a specific version, and ideally throw good error messages if it's used incorrectly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i updated the docstring - does it make sense to you? if not i would appreciate suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still have a number of questions and imo the tests are still not amazing
ultimate approval will come from @mayurinehate
src/mcp_server_datahub/mcp_server.py
Outdated
This function only checks the environment variable. Actual feature | ||
availability is validated when the DataHub client is used. | ||
""" | ||
return os.environ.get("SEMANTIC_SEARCH_ENABLED", "false").lower() == "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also have a helper in the datahub codebase called something like get_boolean from env var - let's reuse that and then we can remove most of the parsing tests around this method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
from datahub.cli.env_utils import get_boolean_env_variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but then we will lose the nice docstring which we just prepare, i guess i can put it as a comment in the code
tests/test_semantic_search.py
Outdated
@pytest.mark.anyio | ||
async def test_tool_registration_without_semantic_search(): | ||
"""Test that regular search tool is available when semantic search is disabled.""" | ||
# Test that environment check works | ||
with mock.patch.dict(os.environ, {"SEMANTIC_SEARCH_ENABLED": "false"}): | ||
assert _is_semantic_search_enabled() is False | ||
|
||
|
||
@pytest.mark.anyio | ||
async def test_tool_registration_with_semantic_search(): | ||
"""Test that enhanced search tool is available when semantic search is enabled.""" | ||
# Test that environment check works | ||
with mock.patch.dict(os.environ, {"SEMANTIC_SEARCH_ENABLED": "true"}): | ||
assert _is_semantic_search_enabled() is True | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do these tests do? imo they're not super useful and the docstring does not match the actual implementation
tests/test_semantic_search.py
Outdated
This test sets SEMANTIC_SEARCH_ENABLED=false and reloads the module to verify | ||
the conditional registration logic works correctly with basic search mode. | ||
""" | ||
# Set environment variable for basic mode | ||
os.environ["SEMANTIC_SEARCH_ENABLED"] = "false" | ||
|
||
# Reload the module at the beginning to ensure fresh state | ||
print("Reloading mcp_server module for fresh state...") | ||
importlib.reload(mcp_server_module) | ||
|
||
# Re-import the reloaded objects | ||
from mcp_server_datahub.mcp_server import ( | ||
mcp as reloaded_mcp, | ||
with_datahub_client as reloaded_with_datahub_client, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reloading logic seems scary - can we avoid doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont see a good way, the reason these 2 tests exist is to make sure we are correctly bind the call and we obey the environment variable. why do you think it is scary? is it because of the possible side effects that we (I) don't fully understand?
tests/test_semantic_search.py
Outdated
the conditional registration logic works correctly with basic search mode. | ||
""" | ||
# Set environment variable for basic mode | ||
os.environ["SEMANTIC_SEARCH_ENABLED"] = "false" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this permanently change the environ instead of temporarily patching it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
tests/test_semantic_search.py
Outdated
|
||
|
||
@pytest.mark.anyio | ||
async def test_tool_registration_with_semantic_search(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests should only be async if they use await
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall I still worry that these tests are extremely brittle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what makes you think the tests are brittle?
def assert_type(expected_type: Type[T], obj: Any) -> T: | ||
"""Assert that obj is of expected_type and return it properly typed.""" | ||
assert isinstance(obj, expected_type), ( | ||
f"Expected {expected_type.__name__}, got {type(obj).__name__}" | ||
) | ||
return obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this instead of a simple assert isinstance(...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idea was to make it less verbose in the tests
tests/test_semantic_search.py
Outdated
the conditional registration logic works correctly with basic search mode. | ||
""" | ||
# Set environment variable for basic mode | ||
os.environ["SEMANTIC_SEARCH_ENABLED"] = "false" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
src/mcp_server_datahub/mcp_server.py
Outdated
This function only checks the environment variable. Actual feature | ||
availability is validated when the DataHub client is used. | ||
""" | ||
return os.environ.get("SEMANTIC_SEARCH_ENABLED", "false").lower() == "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
from datahub.cli.env_utils import get_boolean_env_variable
src/mcp_server_datahub/mcp_server.py
Outdated
Returns both a truncated list of results and facets/aggregations that can be used to iteratively refine the search filters. | ||
To explore the data catalog and get aggregate statistics, use the wildcard '*' as the query and set `filters: null`. This provides | ||
facets showing platform distribution, entity types, and other aggregate insights across the entire catalog, plus a representative | ||
sample of entities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for calling this out.
|
||
|
||
if __name__ == "__main__": | ||
pytest.main([__file__]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I acknowledge this test file has some value, I'd much rather prefer less but more focused tests from readability and maintainability point of view.
Mocking of graphql request and response seems like an overkill at this point. We can simply leave at making sure _search_implementation
is invoked with correct args, therefore removing the need of TestSearchImplementation
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should hopefully also ease need for patching execute_graphql
and get_datahub_client
.
Also, the tests are passing alright, however there is this error on running make test
, which should also go away with suggested changes.
ERROR tests/test_semantic_search.py::test_tool_binding_enhanced_search - ValueError: <Token var=<ContextVar name='_mcp_dh_client' at 0x7c53318b61b0> at 0x7c5333ac8d80> was created by a different ContextVar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that generally only test_tool_binding_enhanced_search and test_tool_binding_basic_search have real value - they actually verify that the search methods are bound correctly. However we need to have 75% code coverage and we need to have code coverage for _search_implementation. I don't think we are ready to challenge this 75% code coverage requirement - my personal choice is to not think too hard about what i think non essential tests. I'm open to suggestions and if the extra tests bother you much I can remove them I guess
@@ -0,0 +1,111 @@ | |||
fragment SearchEntityInfo on Entity { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is exactly same as search.gql except different endpoints and no scrollId input..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionality is fine.
Please address suggested changes sooner than later.
Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>
This PR adds semantic search capabilities via a new
enhanced_search
tool that supports both "semantic" (AI-powered concept matching) and "keyword" (traditional text search) strategies using DataHub'ssemanticSearchAcrossEntities
GraphQL API. TheSEMANTIC_SEARCH_ENABLED
environment variable controls tool exposure: when enabled, the enhanced tool replaces the standard search tool; when disabled, the original tool remains for full backward compatibility. Includes comprehensive tests, detailed LLM guidance on strategy selection, and maintains the same interface as the existing search functionality.