Skip to content

Conversation

@devin-ai-integration
Copy link
Contributor

Fix TaskEvaluation validation errors for missing quality and dict suggestions

Summary

Fixes #3915 by making the TaskEvaluation Pydantic model more resilient to malformed LLM output. The issue reported validation errors when:

  1. LLM omits the quality field entirely
  2. LLM returns suggestions as [{'point': '...', 'priority': 'high'}] instead of list[str]
  3. LLM uses score instead of quality as the field name

Changes:

  • Made quality, suggestions, and entities fields optional with sensible defaults
  • Added ConfigDict(extra="ignore") to ignore unexpected fields like relationships
  • Added @model_validator to map scorequality when quality is missing
  • Added @field_validator for suggestions to extract point values from dict entries
  • Added @field_validator for quality to coerce int/string to float
  • Created 16 comprehensive unit tests covering all edge cases from the issue

Backward compatibility: LongTermMemoryItem already accepts quality=None, and the strict crew evaluation path uses a separate TaskEvaluationPydanticOutput model that remains unchanged.

Review & Testing Checklist for Human

  • Critical: Search codebase for .quality usages - Verify no code assumes TaskEvaluation.quality is always non-None (I checked the main ones but may have missed some)
  • Test with real LLM output - Run a crew with memory=True and verify the fix works end-to-end with actual LLM responses (my tests are unit tests only)
  • Verify suggestions normalization - Check that the normalize_suggestions validator correctly handles all dict formats you've seen in production logs (especially the point/priority structure)
  • Review uv.lock changes - I had to regenerate the lock file due to corruption; verify CI passes and no dependency issues arise

Test Plan

  1. Create a crew with memory=True and external_memory=ExternalMemory(...)
  2. Run tasks and monitor logs for the "Failed to parse structured output" error
  3. Verify long-term memory saves succeed without validation errors
  4. Check that quality scores are properly recorded (or None when missing)

Notes

  • Pre-existing mypy errors in the file (lines 156, 185, 198) are unrelated to this PR
  • All 18 tests pass (16 new + 2 existing)
  • Ruff linter passes

Devin session: https://app.devin.ai/sessions/8dc1309c760a4898bc9d347c1af9f702
Requested by: João ([email protected])

…gestions

Fixes #3915

This commit addresses Pydantic validation errors that occur when the LLM
output doesn't match the expected TaskEvaluation schema:

1. Missing 'quality' field - LLM sometimes omits this field
2. 'suggestions' as list of dicts - LLM returns [{'point': '...', 'priority': 'high'}]
   instead of list[str]
3. 'score' field instead of 'quality' - LLM uses 'score' as alternate field name

Changes:
- Make 'quality' field optional (float | None) with default None
- Make 'suggestions' field optional with default empty list
- Make 'entities' field optional with default empty list
- Add ConfigDict(extra='ignore') to ignore unexpected fields
- Add model_validator to map 'score' to 'quality' when quality is missing
- Add field_validator for 'suggestions' to normalize dict format to list[str]
  - Extracts 'point' value from dicts with 'point' key
  - Handles single dict, single string, list of mixed types, and None
- Add field_validator for 'quality' to coerce int/str to float

The fix is backward compatible - LongTermMemoryItem already accepts
quality=None, and the strict crew evaluation path uses a separate
TaskEvaluationPydanticOutput model that remains unchanged.

Tests:
- Added 16 comprehensive unit tests covering all edge cases
- All existing tests continue to pass
- Tests replicate exact error scenarios from issue #3915

Co-Authored-By: João <[email protected]>
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Co-Authored-By: João <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] ERROR:root:Failed to parse structured output from stream: 1 validation error for TaskEvaluation quality

1 participant