Fix TaskEvaluation validation errors for missing quality and dict suggestions #3916
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix TaskEvaluation validation errors for missing quality and dict suggestions
Summary
Fixes #3915 by making the TaskEvaluation Pydantic model more resilient to malformed LLM output. The issue reported validation errors when:
qualityfield entirelysuggestionsas[{'point': '...', 'priority': 'high'}]instead oflist[str]scoreinstead ofqualityas the field nameChanges:
quality,suggestions, andentitiesfields optional with sensible defaultsConfigDict(extra="ignore")to ignore unexpected fields likerelationships@model_validatorto mapscore→qualitywhen quality is missing@field_validatorforsuggestionsto extractpointvalues from dict entries@field_validatorforqualityto coerce int/string to floatBackward compatibility:
LongTermMemoryItemalready acceptsquality=None, and the strict crew evaluation path uses a separateTaskEvaluationPydanticOutputmodel that remains unchanged.Review & Testing Checklist for Human
.qualityusages - Verify no code assumesTaskEvaluation.qualityis always non-None (I checked the main ones but may have missed some)memory=Trueand verify the fix works end-to-end with actual LLM responses (my tests are unit tests only)normalize_suggestionsvalidator correctly handles all dict formats you've seen in production logs (especially thepoint/prioritystructure)Test Plan
memory=Trueandexternal_memory=ExternalMemory(...)Notes
Devin session: https://app.devin.ai/sessions/8dc1309c760a4898bc9d347c1af9f702
Requested by: João ([email protected])