-
Couldn't load subscription status.
- Fork 25
feat: add schema tools #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cbff0dc to
bc9711f
Compare
src/mcp_server_datahub/mcp_server.py
Outdated
| if schema_history := _get_schema_history(client, urn): | ||
| result["schemaHistory"] = { | ||
| "latestVersion": schema_history.latest_version.semantic_version, | ||
| "versions": sorted([v.semantic_version for v in schema_history.versions]), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example
"schemaHistory": {
"latestVersion": "0.1.0",
"versions": [ "0.0.0", "0.1.0"]
}
src/mcp_server_datahub/mcp_server.py
Outdated
| variables = {"urn": dataset_urn, "versionStamp": target_version_stamp} | ||
| resp = _execute_graphql( | ||
| client._graph, | ||
| query=entity_details_fragment_gql, | ||
| variables=variables, | ||
| operation_name="getVersionedDataset", | ||
| ) | ||
| return resp.get("versionedDataset", {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example
{
"schema": {
"fields": [
{
"fieldPath": "[version=2.0].[type=long].id",
"jsonPath": null,
"nullable": true,
"description": null,
"type": "NUMBER",
"nativeDataType": "BIGINT",
"recursive": false,
"isPartOfKey": false,
"isPartitioningKey": false,
"__typename": "SchemaField"
},
...(repeated every fields)
],
"lastObserved": 1749608884835,
"__typename": "Schema"
},
"editableSchemaMetadata": null,
"__typename": "VersionedDataset"
}
src/mcp_server_datahub/mcp_server.py
Outdated
| semantic_version: str | ||
| version_stamp: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example
{
"semanticVersion": "0.0.0",
"versionStamp": "browsePaths:0;dataPlatformInstance:0;datasetKey:0;schemaMetadata:1",
}
src/mcp_server_datahub/mcp_server.py
Outdated
|
|
||
|
|
||
| @mcp.tool(description="Get schema from a dataset by its URN and version.") | ||
| @lru_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's seems better to use cache since versioned_dataset is immutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how frequently we would hit this cache. I don't however mind keeping this around, if we set a max size of few tens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure how effective it will be either. But it would be good to have it since it can reduce at least some traffic when querying the same version of the schema.
I’ll set max_size to around 20.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll set max_size to around 20.
Changed
src/mcp_server_datahub/mcp_server.py
Outdated
| versions: list[SemanticVersionStruct] | ||
|
|
||
|
|
||
| def _get_schema_version_list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: getVersionedDataset retrieves each version’s schema correctly whereas getSchemaBlame is not which I tried to use at first.
bc9711f to
00347a9
Compare
Additional Test: Fixing outdated query with LLMPurposeCheck if LLM can fix outdated(column name changed) query with datahub MCP. note: Pls let me know if you need to test on different MCP hosts or models. PromptClaude Desktop, Sonet 4 |
|
@hsheth2 Could you please review again? 🙏 FYI:
|
src/mcp_server_datahub/mcp_server.py
Outdated
| return cls( | ||
| semantic_version=data["semanticVersion"], | ||
| version_stamp=data["versionStamp"], | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may as well set alias for fields and use SemanticVersionStruct.model_validate with dict form instead of separate method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
src/mcp_server_datahub/mcp_server.py
Outdated
|
|
||
| _inject_urls_for_urns(client._graph, result, [""]) | ||
|
|
||
| if schema_version_list := _get_schema_version_list(client, urn): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm somewhat concerned about performance hit due to additional call to get schema versions every time for a dataset entity. I wonder if this needs its own separate tool, for performance reasons. cc: @hsheth2
Also we should skip this call for non-dataset entities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous PR, @hsheth2 already mentioned about this.
my main worry here is that every new tool consumes additional tokens on every request. The more tools we have, the more likely it is that the LLM gets confused / doesn't call our other tools when it should. So I'd like to think about what we can do to reduce the number of tools while keeping our responses simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eagle-25 I would like to run some tests about how addition of this get_schema_version_list affects overall tool timings of get_entity for dataset entity. I might get to this next week. In the meantime, if you can get some numbers or have any observations, feel free to share.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we should skip this call for non-dataset entities.
Changed
src/mcp_server_datahub/mcp_server.py
Outdated
|
|
||
|
|
||
| @mcp.tool(description="Get schema from a dataset by its URN and version.") | ||
| @lru_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how frequently we would hit this cache. I don't however mind keeping this around, if we set a max size of few tens.
|
@mayurinehate Added comments to your feedback. Could you check please? I will modify the code after resolve this conversation |
70239fc to
728948b
Compare
- Add a tool to retrieve the schema of a dataset - Modify get_entity so that when querying a dataset, it also returns the schema version
728948b to
0b61fa2
Compare
|
@mayurinehate Could you review these changes? Applied the following improvement feedbacks.
I also refactored the code into the DatasetSchemaAPI class to improve cohesion. |

Changes
get_entity().get_versioned_datasettool for retrieving schema by version.Motivation
Tests
I defined a test scenario and tried the available MCP hosts and models. Since I didn’t select them based on specific criteria, please suggest any additional tests.
Settings
DataHub: Self hosted (v1.1.0)
Test Dataset in DataHub
emailis renamed toemail_addressat v0.1.0)Test Scenario
0.0.0and0.1.0onsample.userstable.emailcolumn is renamed toemail_addresson0.1.0Prompt(common)
note: To also test that the version list is retrieved correctly, the prompt uses
latestandpreviousinstead of specifying concrete versions.Test Result Summary
Claude Desktop
Claude Sonnet4
Cursor
Claude Sonnet 4
gemini-2.5-pro
GPT-4.1
CLINE (VS Code)
GPT-4.1
note: This PR is recreated from #10