Skip to content

Conversation

alexsku
Copy link
Contributor

@alexsku alexsku commented Sep 3, 2025

This PR adds semantic search capabilities via a new enhanced_search tool that supports both "semantic" (AI-powered concept matching) and "keyword" (traditional text search) strategies using DataHub's semanticSearchAcrossEntities GraphQL API. The SEMANTIC_SEARCH_ENABLED environment variable controls tool exposure: when enabled, the enhanced tool replaces the standard search tool; when disabled, the original tool remains for full backward compatibility. Includes comprehensive tests, detailed LLM guidance on strategy selection, and maintains the same interface as the existing search functionality.

Returns both a truncated list of results and facets/aggregations that can be used to iteratively refine the search filters.
To explore the data catalog and get aggregate statistics, use the wildcard '*' as the query and set `filters: null`. This provides
facets showing platform distribution, entity types, and other aggregate insights across the entire catalog, plus a representative
sample of entities.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please note that I put slightly different explanation for '*' here - it made a little bit more sense to me

Returns both a truncated list of results and facets/aggregations that can be used to iteratively refine the search filters.
To explore the data catalog and get aggregate statistics, use the wildcard '*' as the query and set `filters: null`. This provides
facets showing platform distribution, entity types, and other aggregate insights across the entire catalog, plus a representative
sample of entities.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw - semantic search doesn't have special treatment for *, it would compute embeddings for it and do the matching. Effectively I think the result will be correct (as the caller intends to get the facets which semantic search retrieves by calling keyword search with * internally) since the actual result of the search is not well defined except for facets. There is a possibility of updating semantic search to switch to keyword search when the query is *.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for calling this out.

@acryldata acryldata deleted a comment from cursor bot Sep 3, 2025
hsheth2
hsheth2 previously requested changes Sep 3, 2025

def _is_semantic_search_enabled() -> bool:
"""Check if semantic search is enabled via environment variable."""
return os.environ.get("SEMANTIC_SEARCH_ENABLED", "false").lower() == "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it more clear that this is only available on datahub cloud with a specific version, and ideally throw good error messages if it's used incorrectly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i updated the docstring - does it make sense to you? if not i would appreciate suggestions

@hsheth2 hsheth2 requested a review from mayurinehate September 3, 2025 22:47
Copy link
Contributor

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still have a number of questions and imo the tests are still not amazing

ultimate approval will come from @mayurinehate

This function only checks the environment variable. Actual feature
availability is validated when the DataHub client is used.
"""
return os.environ.get("SEMANTIC_SEARCH_ENABLED", "false").lower() == "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also have a helper in the datahub codebase called something like get_boolean from env var - let's reuse that and then we can remove most of the parsing tests around this method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
from datahub.cli.env_utils import get_boolean_env_variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then we will lose the nice docstring which we just prepare, i guess i can put it as a comment in the code

Comment on lines 282 to 296
@pytest.mark.anyio
async def test_tool_registration_without_semantic_search():
"""Test that regular search tool is available when semantic search is disabled."""
# Test that environment check works
with mock.patch.dict(os.environ, {"SEMANTIC_SEARCH_ENABLED": "false"}):
assert _is_semantic_search_enabled() is False


@pytest.mark.anyio
async def test_tool_registration_with_semantic_search():
"""Test that enhanced search tool is available when semantic search is enabled."""
# Test that environment check works
with mock.patch.dict(os.environ, {"SEMANTIC_SEARCH_ENABLED": "true"}):
assert _is_semantic_search_enabled() is True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do these tests do? imo they're not super useful and the docstring does not match the actual implementation

Comment on lines 302 to 316
This test sets SEMANTIC_SEARCH_ENABLED=false and reloads the module to verify
the conditional registration logic works correctly with basic search mode.
"""
# Set environment variable for basic mode
os.environ["SEMANTIC_SEARCH_ENABLED"] = "false"

# Reload the module at the beginning to ensure fresh state
print("Reloading mcp_server module for fresh state...")
importlib.reload(mcp_server_module)

# Re-import the reloaded objects
from mcp_server_datahub.mcp_server import (
mcp as reloaded_mcp,
with_datahub_client as reloaded_with_datahub_client,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reloading logic seems scary - can we avoid doing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont see a good way, the reason these 2 tests exist is to make sure we are correctly bind the call and we obey the environment variable. why do you think it is scary? is it because of the possible side effects that we (I) don't fully understand?

the conditional registration logic works correctly with basic search mode.
"""
# Set environment variable for basic mode
os.environ["SEMANTIC_SEARCH_ENABLED"] = "false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this permanently change the environ instead of temporarily patching it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1



@pytest.mark.anyio
async def test_tool_registration_with_semantic_search():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests should only be async if they use await

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall I still worry that these tests are extremely brittle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what makes you think the tests are brittle?

Comment on lines +25 to +29
def assert_type(expected_type: Type[T], obj: Any) -> T:
"""Assert that obj is of expected_type and return it properly typed."""
assert isinstance(obj, expected_type), (
f"Expected {expected_type.__name__}, got {type(obj).__name__}"
)
return obj
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this instead of a simple assert isinstance(...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea was to make it less verbose in the tests

@hsheth2 hsheth2 dismissed their stale review September 4, 2025 04:28

Mayuri is the main reviewer

the conditional registration logic works correctly with basic search mode.
"""
# Set environment variable for basic mode
os.environ["SEMANTIC_SEARCH_ENABLED"] = "false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

This function only checks the environment variable. Actual feature
availability is validated when the DataHub client is used.
"""
return os.environ.get("SEMANTIC_SEARCH_ENABLED", "false").lower() == "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
from datahub.cli.env_utils import get_boolean_env_variable

Returns both a truncated list of results and facets/aggregations that can be used to iteratively refine the search filters.
To explore the data catalog and get aggregate statistics, use the wildcard '*' as the query and set `filters: null`. This provides
facets showing platform distribution, entity types, and other aggregate insights across the entire catalog, plus a representative
sample of entities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for calling this out.



if __name__ == "__main__":
pytest.main([__file__])
Copy link
Contributor

@mayurinehate mayurinehate Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I acknowledge this test file has some value, I'd much rather prefer less but more focused tests from readability and maintainability point of view.

Mocking of graphql request and response seems like an overkill at this point. We can simply leave at making sure _search_implementation is invoked with correct args, therefore removing the need of TestSearchImplementation class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should hopefully also ease need for patching execute_graphql and get_datahub_client.

Also, the tests are passing alright, however there is this error on running make test, which should also go away with suggested changes.

ERROR tests/test_semantic_search.py::test_tool_binding_enhanced_search - ValueError: <Token var=<ContextVar name='_mcp_dh_client' at 0x7c53318b61b0> at 0x7c5333ac8d80> was created by a different ContextVar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that generally only test_tool_binding_enhanced_search and test_tool_binding_basic_search have real value - they actually verify that the search methods are bound correctly. However we need to have 75% code coverage and we need to have code coverage for _search_implementation. I don't think we are ready to challenge this 75% code coverage requirement - my personal choice is to not think too hard about what i think non essential tests. I'm open to suggestions and if the extra tests bother you much I can remove them I guess

@@ -0,0 +1,111 @@
fragment SearchEntityInfo on Entity {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is exactly same as search.gql except different endpoints and no scrollId input..

Copy link
Contributor

@mayurinehate mayurinehate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionality is fine.

Please address suggested changes sooner than later.

@alexsku alexsku merged commit c313b2d into main Sep 5, 2025
1 check passed
@alexsku alexsku deleted the PFP-1641 branch September 5, 2025 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants