Skip to content

Conversation

@eagle-25
Copy link

@eagle-25 eagle-25 commented Jun 12, 2025

Changes

  • Add listing schema_versions(latest_version, versions) functionality to response of get_entity().
  • Add get_versioned_dataset tool for retrieving schema by version.

Motivation

  • Getting schema by the version let LLM detect changed columns and help fix outdated queries.
  • This will boost productivity compared to manual methods.

Tests

I defined a test scenario and tried the available MCP hosts and models. Since I didn’t select them based on specific criteria, please suggest any additional tests.

Settings

  • DataHub: Self hosted (v1.1.0)

  • Test Dataset in DataHub

    • Type: Athena Table
    • Name: sample.users
    • Schema Histories: (email is renamed to email_address at v0.1.0)
      version col1 col2 col3 col4
      0.0.0 id name email created_at
      0.1.0 id name email_address created_at
  • Test Scenario

    • There are two schema versions: 0.0.0 and 0.1.0 on sample.users table.
    • The email column is renamed to email_address on 0.1.0
    • Confirm the LLM can detect differences between two schemas.
  • Prompt(common)

    === Instructions ===
    You are a DataHub AI agent. Your job is to answer user questions about DataHub metadata by calling the datahub MCP (Model Context Protocol) methods.
    
    === Input ===
    Can you tell me the schema differences between the latest version and the previous version of the Athena named sample.users?
    

    note: To also test that the version list is retrieved correctly, the prompt uses latest and previous instead of specifying concrete versions.

Test Result Summary

MCP Host Model Worked Expectedly Notes
Claude Desktop Claude Sonet 4 Y -
Cursor Claude Sonet 4 Y 'search' tool errored few times, recovered itself and succeeded.
Cursor gemini-2.5-pro N Failed: argument not supported
Cursor GPT-4.1 Y 'search' tool errored few times, recovered itself and succeeded.
Cline GPT-4.1 Y -

Claude Desktop

Claude Sonnet4

스크린샷 2025-06-12 오후 6 10 16

Cursor

Claude Sonnet 4

스크린샷 2025-06-12 오후 6 06 32

gemini-2.5-pro

  • Failed

스크린샷 2025-06-12 오후 6 03 54

GPT-4.1

스크린샷 2025-06-12 오후 5 56 34

CLINE (VS Code)

GPT-4.1

스크린샷 2025-06-12 오후 5 47 24
스크린샷 2025-06-12 오후 5 48 17

note: This PR is recreated from #10

@eagle-25 eagle-25 force-pushed the feat/ass-schema-history-tools branch 2 times, most recently from cbff0dc to bc9711f Compare June 12, 2025 04:40
Comment on lines 175 to 180
if schema_history := _get_schema_history(client, urn):
result["schemaHistory"] = {
"latestVersion": schema_history.latest_version.semantic_version,
"versions": sorted([v.semantic_version for v in schema_history.versions]),
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example

"schemaHistory": {
    "latestVersion": "0.1.0",
    "versions": [ "0.0.0", "0.1.0"]
}

Comment on lines 386 to 395
variables = {"urn": dataset_urn, "versionStamp": target_version_stamp}
resp = _execute_graphql(
client._graph,
query=entity_details_fragment_gql,
variables=variables,
operation_name="getVersionedDataset",
)
return resp.get("versionedDataset", {})
Copy link
Author

@eagle-25 eagle-25 Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example

{
  "schema": {
    "fields": [
      {
        "fieldPath": "[version=2.0].[type=long].id",
        "jsonPath": null,
        "nullable": true,
        "description": null,
        "type": "NUMBER",
        "nativeDataType": "BIGINT",
        "recursive": false,
        "isPartOfKey": false,
        "isPartitioningKey": false,
        "__typename": "SchemaField"
      },
      ...(repeated every fields)
    ],
    "lastObserved": 1749608884835,
    "__typename": "Schema"
  },
  "editableSchemaMetadata": null,
  "__typename": "VersionedDataset"
}

Comment on lines 110 to 112
semantic_version: str
version_stamp: str
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example

{
    "semanticVersion": "0.0.0",
    "versionStamp": "browsePaths:0;dataPlatformInstance:0;datasetKey:0;schemaMetadata:1",
}



@mcp.tool(description="Get schema from a dataset by its URN and version.")
@lru_cache
Copy link
Author

@eagle-25 eagle-25 Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's seems better to use cache since versioned_dataset is immutable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how frequently we would hit this cache. I don't however mind keeping this around, if we set a max size of few tens.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure how effective it will be either. But it would be good to have it since it can reduce at least some traffic when querying the same version of the schema.

I’ll set max_size to around 20.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll set max_size to around 20.

Changed

versions: list[SemanticVersionStruct]


def _get_schema_version_list(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: getVersionedDataset retrieves each version’s schema correctly whereas getSchemaBlame is not which I tried to use at first.

@eagle-25 eagle-25 force-pushed the feat/ass-schema-history-tools branch from bc9711f to 00347a9 Compare June 12, 2025 09:55
@eagle-25
Copy link
Author

eagle-25 commented Jun 12, 2025

Additional Test: Fixing outdated query with LLM

Purpose

Check if LLM can fix outdated(column name changed) query with datahub MCP.

note: Pls let me know if you need to test on different MCP hosts or models.

Prompt

=== Instructions ===
You are a DataHub AI agent. Your job is to answer user questions about DataHub metadata by calling the datahub MCP (Model Context Protocol) methods.

=== Input ===
The following query was written for schema version 0.0.0 of the sample.users table:

SELECT 
    id, 
    name, 
    email, 
    created_at 
FROM 
    sample.users WHERE id = 123;
After the table schema was updated, an error occurs when executing the query.

Could you modify the query to match the latest schema?

Claude Desktop, Sonet 4

스크린샷 2025-06-12 오후 7 00 29

@eagle-25
Copy link
Author

eagle-25 commented Jun 12, 2025

@hsheth2 Could you please review again? 🙏
I’ve added tests using various MCP hosts and models with some code cahnges.

FYI:

  • I’ve recreated this PR since it differs noticeably from the last PR.
  • The notable change is that I replaced the getSchemaBlame API with getVersionedDataset to fetch accurate versioned schemas.

@eagle-25 eagle-25 marked this pull request as ready for review June 12, 2025 10:13
@hsheth2 hsheth2 requested a review from mayurinehate July 4, 2025 03:35
return cls(
semantic_version=data["semanticVersion"],
version_stamp=data["versionStamp"],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may as well set alias for fields and use SemanticVersionStruct.model_validate with dict form instead of separate method.

Copy link
Author

@eagle-25 eagle-25 Aug 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed


_inject_urls_for_urns(client._graph, result, [""])

if schema_version_list := _get_schema_version_list(client, urn):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat concerned about performance hit due to additional call to get schema versions every time for a dataset entity. I wonder if this needs its own separate tool, for performance reasons. cc: @hsheth2

Also we should skip this call for non-dataset entities.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous PR, @hsheth2 already mentioned about this.

my main worry here is that every new tool consumes additional tokens on every request. The more tools we have, the more likely it is that the LLM gets confused / doesn't call our other tools when it should. So I'd like to think about what we can do to reduce the number of tools while keeping our responses simple.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eagle-25 I would like to run some tests about how addition of this get_schema_version_list affects overall tool timings of get_entity for dataset entity. I might get to this next week. In the meantime, if you can get some numbers or have any observations, feel free to share.

Copy link
Author

@eagle-25 eagle-25 Aug 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we should skip this call for non-dataset entities.

Changed



@mcp.tool(description="Get schema from a dataset by its URN and version.")
@lru_cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how frequently we would hit this cache. I don't however mind keeping this around, if we set a max size of few tens.

@eagle-25
Copy link
Author

eagle-25 commented Jul 16, 2025

@mayurinehate Added comments to your feedback. Could you check please?

I will modify the code after resolve this conversation

@eagle-25 eagle-25 requested a review from mayurinehate July 17, 2025 04:17
@eagle-25 eagle-25 force-pushed the feat/ass-schema-history-tools branch 3 times, most recently from 70239fc to 728948b Compare August 10, 2025 14:07
- Add a tool to retrieve the schema of a dataset
- Modify get_entity so that when querying a dataset, it also returns the schema version
@eagle-25 eagle-25 force-pushed the feat/ass-schema-history-tools branch from 728948b to 0b61fa2 Compare August 10, 2025 14:09
@eagle-25
Copy link
Author

eagle-25 commented Aug 11, 2025

@mayurinehate Could you review these changes?

Applied the following improvement feedbacks.

  1. set alias for fields and use SemanticVersionStruct.model_validate
  2. skip retrieving schema versions call for non-dataset entities
  3. set max_size 20 to lru_cache

I also refactored the code into the DatasetSchemaAPI class to improve cohesion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants