Add support for semantic search to the MCP server #45

alexsku · 2025-09-03T04:50:41Z

This PR adds semantic search capabilities via a new enhanced_search tool that supports both "semantic" (AI-powered concept matching) and "keyword" (traditional text search) strategies using DataHub's semanticSearchAcrossEntities GraphQL API. The SEMANTIC_SEARCH_ENABLED environment variable controls tool exposure: when enabled, the enhanced tool replaces the standard search tool; when disabled, the original tool remains for full backward compatibility. Includes comprehensive tests, detailed LLM guidance on strategy selection, and maintains the same interface as the existing search functionality.

alexsku · 2025-09-03T19:33:10Z

src/mcp_server_datahub/mcp_server.py

+Returns both a truncated list of results and facets/aggregations that can be used to iteratively refine the search filters.
+To explore the data catalog and get aggregate statistics, use the wildcard '*' as the query and set `filters: null`. This provides 
+facets showing platform distribution, entity types, and other aggregate insights across the entire catalog, plus a representative 
+sample of entities.


please note that I put slightly different explanation for '*' here - it made a little bit more sense to me

alexsku · 2025-09-03T19:35:33Z

src/mcp_server_datahub/mcp_server.py

+Returns both a truncated list of results and facets/aggregations that can be used to iteratively refine the search filters.
+To explore the data catalog and get aggregate statistics, use the wildcard '*' as the query and set `filters: null`. This provides 
+facets showing platform distribution, entity types, and other aggregate insights across the entire catalog, plus a representative 
+sample of entities.


Btw - semantic search doesn't have special treatment for *, it would compute embeddings for it and do the matching. Effectively I think the result will be correct (as the caller intends to get the facets which semantic search retrieves by calling keyword search with * internally) since the actual result of the search is not well defined except for facets. There is a possibility of updating semantic search to switch to keyword search when the query is *.

Thank you for calling this out.

hsheth2 · 2025-09-03T19:49:36Z

src/mcp_server_datahub/mcp_server.py


+def _is_semantic_search_enabled() -> bool:
+    """Check if semantic search is enabled via environment variable."""
+    return os.environ.get("SEMANTIC_SEARCH_ENABLED", "false").lower() == "true"


make it more clear that this is only available on datahub cloud with a specific version, and ideally throw good error messages if it's used incorrectly

i updated the docstring - does it make sense to you? if not i would appreciate suggestions

src/mcp_server_datahub/mcp_server.py

tests/test_semantic_search.py

hsheth2

still have a number of questions and imo the tests are still not amazing

ultimate approval will come from @mayurinehate

hsheth2 · 2025-09-04T03:35:18Z

src/mcp_server_datahub/mcp_server.py

+        This function only checks the environment variable. Actual feature
+        availability is validated when the DataHub client is used.
+    """
+    return os.environ.get("SEMANTIC_SEARCH_ENABLED", "false").lower() == "true"


we also have a helper in the datahub codebase called something like get_boolean from env var - let's reuse that and then we can remove most of the parsing tests around this method

+1
from datahub.cli.env_utils import get_boolean_env_variable

but then we will lose the nice docstring which we just prepare, i guess i can put it as a comment in the code

hsheth2 · 2025-09-04T03:38:34Z

tests/test_semantic_search.py

+@pytest.mark.anyio
+async def test_tool_registration_without_semantic_search():
+    """Test that regular search tool is available when semantic search is disabled."""
+    # Test that environment check works
+    with mock.patch.dict(os.environ, {"SEMANTIC_SEARCH_ENABLED": "false"}):
+        assert _is_semantic_search_enabled() is False
+
+
+@pytest.mark.anyio
+async def test_tool_registration_with_semantic_search():
+    """Test that enhanced search tool is available when semantic search is enabled."""
+    # Test that environment check works
+    with mock.patch.dict(os.environ, {"SEMANTIC_SEARCH_ENABLED": "true"}):
+        assert _is_semantic_search_enabled() is True
+


what do these tests do? imo they're not super useful and the docstring does not match the actual implementation

hsheth2 · 2025-09-04T03:39:10Z

tests/test_semantic_search.py

+    This test sets SEMANTIC_SEARCH_ENABLED=false and reloads the module to verify
+    the conditional registration logic works correctly with basic search mode.
+    """
+    # Set environment variable for basic mode
+    os.environ["SEMANTIC_SEARCH_ENABLED"] = "false"
+
+    # Reload the module at the beginning to ensure fresh state
+    print("Reloading mcp_server module for fresh state...")
+    importlib.reload(mcp_server_module)
+
+    # Re-import the reloaded objects
+    from mcp_server_datahub.mcp_server import (
+        mcp as reloaded_mcp,
+        with_datahub_client as reloaded_with_datahub_client,
+    )


this reloading logic seems scary - can we avoid doing this?

i dont see a good way, the reason these 2 tests exist is to make sure we are correctly bind the call and we obey the environment variable. why do you think it is scary? is it because of the possible side effects that we (I) don't fully understand?

hsheth2 · 2025-09-04T03:39:30Z

tests/test_semantic_search.py

+    the conditional registration logic works correctly with basic search mode.
+    """
+    # Set environment variable for basic mode
+    os.environ["SEMANTIC_SEARCH_ENABLED"] = "false"


this permanently change the environ instead of temporarily patching it

hsheth2 · 2025-09-04T03:39:56Z

tests/test_semantic_search.py

+
+
+@pytest.mark.anyio
+async def test_tool_registration_with_semantic_search():


tests should only be async if they use await

hsheth2 · 2025-09-04T03:40:55Z

tests/test_semantic_search.py

overall I still worry that these tests are extremely brittle

what makes you think the tests are brittle?

hsheth2 · 2025-09-04T03:41:36Z

tests/test_semantic_search.py

+def assert_type(expected_type: Type[T], obj: Any) -> T:
+    """Assert that obj is of expected_type and return it properly typed."""
+    assert isinstance(obj, expected_type), (
+        f"Expected {expected_type.__name__}, got {type(obj).__name__}"
+    )
+    return obj


why do we need this instead of a simple assert isinstance(...)?

the idea was to make it less verbose in the tests

Mayuri is the main reviewer

mayurinehate · 2025-09-04T11:57:47Z

tests/test_semantic_search.py

+    the conditional registration logic works correctly with basic search mode.
+    """
+    # Set environment variable for basic mode
+    os.environ["SEMANTIC_SEARCH_ENABLED"] = "false"


mayurinehate · 2025-09-04T12:38:23Z

src/mcp_server_datahub/mcp_server.py

+        This function only checks the environment variable. Actual feature
+        availability is validated when the DataHub client is used.
+    """
+    return os.environ.get("SEMANTIC_SEARCH_ENABLED", "false").lower() == "true"


+1
from datahub.cli.env_utils import get_boolean_env_variable

src/mcp_server_datahub/mcp_server.py

tests/test_semantic_search.py

mayurinehate · 2025-09-04T12:53:17Z

src/mcp_server_datahub/mcp_server.py

+Returns both a truncated list of results and facets/aggregations that can be used to iteratively refine the search filters.
+To explore the data catalog and get aggregate statistics, use the wildcard '*' as the query and set `filters: null`. This provides 
+facets showing platform distribution, entity types, and other aggregate insights across the entire catalog, plus a representative 
+sample of entities.


Thank you for calling this out.

src/mcp_server_datahub/mcp_server.py

mayurinehate · 2025-09-04T13:20:06Z

tests/test_semantic_search.py

+
+
+if __name__ == "__main__":
+    pytest.main([__file__])


While I acknowledge this test file has some value, I'd much rather prefer less but more focused tests from readability and maintainability point of view.

Mocking of graphql request and response seems like an overkill at this point. We can simply leave at making sure _search_implementation is invoked with correct args, therefore removing the need of TestSearchImplementation class.

This should hopefully also ease need for patching execute_graphql and get_datahub_client.

Also, the tests are passing alright, however there is this error on running make test, which should also go away with suggested changes.

ERROR tests/test_semantic_search.py::test_tool_binding_enhanced_search - ValueError: <Token var=<ContextVar name='_mcp_dh_client' at 0x7c53318b61b0> at 0x7c5333ac8d80> was created by a different ContextVar

I agree that generally only test_tool_binding_enhanced_search and test_tool_binding_basic_search have real value - they actually verify that the search methods are bound correctly. However we need to have 75% code coverage and we need to have code coverage for _search_implementation. I don't think we are ready to challenge this 75% code coverage requirement - my personal choice is to not think too hard about what i think non essential tests. I'm open to suggestions and if the extra tests bother you much I can remove them I guess

mayurinehate · 2025-09-04T13:27:29Z

src/mcp_server_datahub/gql/semantic_search.gql

@@ -0,0 +1,111 @@
+fragment SearchEntityInfo on Entity {


so this is exactly same as search.gql except different endpoints and no scrollId input..

mayurinehate

Functionality is fine.

Please address suggested changes sooner than later.

Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>

alexsku commented Sep 3, 2025

View reviewed changes

alexsku requested review from abedatahub and hsheth2 September 3, 2025 19:35

acryldata deleted a comment from cursor bot Sep 3, 2025

hsheth2 previously requested changes Sep 3, 2025

View reviewed changes

hsheth2 requested a review from mayurinehate September 3, 2025 22:47

hsheth2 reviewed Sep 4, 2025

View reviewed changes

mayurinehate reviewed Sep 4, 2025

View reviewed changes

mayurinehate approved these changes Sep 4, 2025

View reviewed changes

alexsku and others added 18 commits September 4, 2025 17:06

Add support for semantic search to the MCP server

cee8602

formatting

da40e24

more fixes

734219d

linter errors

0062a1c

added an integration test for semantic search

db1cdc5

linter fix

3cd8534

some fixes

c72d67a

fixed missing comments

b447791

formatting

862627f

another fix

7f767b5

formatting

8c88441

fixed the unit test

782a0b5

ruff+linter

a83928f

fixed another test

8b0beb9

more fixes

f5a15b7

Apply suggestion from @mayurinehate

d209f0e

Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>

feedback

f1cf6ed

fixes

bf04f47

alexsku force-pushed the PFP-1641 branch from 61e2f7c to bf04f47 Compare September 5, 2025 00:07

hsheth2 approved these changes Sep 5, 2025

View reviewed changes

alexsku merged commit c313b2d into main Sep 5, 2025
1 check passed

alexsku deleted the PFP-1641 branch September 5, 2025 04:51



		@pytest.mark.anyio
		async def test_tool_registration_with_semantic_search():

Add support for semantic search to the MCP server #45

Add support for semantic search to the MCP server #45

Uh oh!

Conversation

alexsku commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hsheth2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mayurinehate Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayurinehate left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

alexsku commented Sep 3, 2025 •

edited

Loading

mayurinehate Sep 4, 2025 •

edited

Loading

mayurinehate left a comment •

edited

Loading