-
Notifications
You must be signed in to change notification settings - Fork 87
Andrew/readme updates #558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @JCamyre, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request comprehensively revamps the project's README.md to provide a more current and focused overview of Judgeval. The changes aim to better articulate the platform's core capabilities, particularly highlighting its new reinforcement learning integration and emphasizing its role in Agent Behavior Monitoring. The update also streamlines the content for clarity and removes redundant or outdated information.
Highlights
- README Content Overhaul: The README.md file has undergone a significant restructuring and content update to provide a more current and focused overview of the project.
- New Logo Integration: New dark and light mode logos have been implemented, replacing older versions, for improved visual presentation and branding consistency.
- Agent Reinforcement Learning (RL) Feature: A new section introduces Judgeval's integration with Fireworks AI for Reinforcement Fine-Tuning (RFT), highlighting its capability to train agents with minimal code changes.
- Updated Project Overview: The project's high-level description has been refined to emphasize its role as an Agent Behavior Monitoring (ABM) framework, focusing on tracking and judging agent behavior.
- Streamlined Information: Outdated sections, duplicate content, and previous feature descriptions (including GIFs and tables) have been removed or consolidated for improved clarity and conciseness.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with π and π on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. β©
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly revamps the README.md to highlight new features like Agent Reinforcement Learning and improve the overall structure. The changes make the README more modern and focused. I've left a few comments to address some issues, such as an orphaned HTML tag, placeholder links, and a minor grammatical error. I also have a question about some larger sections of content that have been commented out, as this removes potentially valuable information for users. Overall, a great update to the project's front page.
| </td> | ||
| </tr> | ||
|
|
||
| </table> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
README.md
Outdated
| | Custom Scorers | [Link to custom scorers cookbook] | | ||
| | Online Monitoring | [Link to monitoring cookbook] | | ||
| | RL | [Link to RL cookbook] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
README.md
Outdated
|
|
||
| ## Why Judgeval? | ||
|
|
||
| β’ **Custom Evaluators**: Judgeval provides simple abstractions for custom evaluators and their applications to your agents, supporting LLM-as-a-judge and code-based evaluators that connect to datasets our and metric-tracking infrastructure. [Learn more](https://docs.judgmentlabs.ai/documentation/evaluation/scorers/custom-scorers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a minor grammatical error here. "datasets our and" should be "our datasets and" for better readability.
| β’ **Custom Evaluators**: Judgeval provides simple abstractions for custom evaluators and their applications to your agents, supporting LLM-as-a-judge and code-based evaluators that connect to datasets our and metric-tracking infrastructure. [Learn more](https://docs.judgmentlabs.ai/documentation/evaluation/scorers/custom-scorers) | |
| β’ **Custom Evaluators**: Judgeval provides simple abstractions for custom evaluators and their applications to your agents, supporting LLM-as-a-judge and code-based evaluators that connect to our datasets and metric-tracking infrastructure. [Learn more](https://docs.judgmentlabs.ai/documentation/evaluation/scorers/custom-scorers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
| <!-- | ||
| <img src="assets/product_shot.png" alt="Judgment Platform" width="800" /> | ||
| ## π οΈ Installation | ||
| | | | | ||
| |:---|:---:| | ||
| | <h3>π§ͺ Evals</h3>Build custom evaluators on top of your agents. Judgeval supports LLM-as-a-judge, manual labeling, and code-based evaluators that connect with our metric-tracking infrastructure. <br><br>**Useful for:**<br>β’ β οΈ Unit-testing <br>β’ π¬ A/B testing <br>β’ π‘οΈ Online guardrails | <p align="center"><img src="assets/test.png" alt="Evaluation metrics" width="800"/></p> | | ||
| | <h3>π‘ Monitoring</h3>Get Slack alerts for agent failures in production. Add custom hooks to address production regressions.<br><br> **Useful for:** <br>β’ π Identifying degradation early <br>β’ π Visualizing performance trends across agent versions and time | <p align="center"><img src="assets/errors.png" alt="Monitoring Dashboard" width="1200"/></p> | | ||
| | <h3>π Datasets</h3>Export environment interactions and test cases to datasets for scaled analysis and optimization. Move datasets to/from Parquet, S3, etc. <br><br>Run evals on datasets as unit tests or to A/B test different agent configurations, enabling continuous learning from production interactions. <br><br> **Useful for:**<br>β’ ποΈ Agent environment interaction data for optimization<br>β’ π Scaled analysis for A/B tests | <p align="center"><img src="assets/datasets_preview_screenshot.png" alt="Dataset management" width="1200"/></p> | | ||
| --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This large block of commented-out HTML contains a visually rich "Features" section. Similarly, lines 102-121 comment out the "Self-Hosting" and "Development with Cursor" sections. Was it intended to remove this content? If this information is still relevant, it might be better to either restore it or ensure it's accessible elsewhere in the documentation and linked appropriately. Commenting it out hides valuable information from users browsing the README.
JCamyre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments here
| rft_provider="fireworks" | ||
| ) | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to section in docs about our train
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Train your agents with reinforcement learning using [Fireworks AI](https://fireworks.ai/)! Judgeval now integrates with Fireworks' Reinforcement Fine-Tuning (RFT) endpoint. | ||
| Judgeval provides a simple harness for integrating GRPO into any Python agent, giving builders a quick method to **try RL with minimal code changes** to their existing agents! | ||
|
|
||
| ```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to Ishan's cookbook once completed.
| ) | ||
| ``` | ||
|
|
||
| **That's it!** Judgeval automatically manages trajectory collection and reward tagging - your agent can learn from production data with minimal code changes. You can view and monitor training progress for free via the [Judgment Dashboard](https://app.judgmentlabs.ai/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also link to the Optimization dashboard section of the docs once we have. Linking to the Judgment platform doesn't make sense to me - would take too many steps to actually get to the optimization page.
Would rather see the Optimization page directly on the docs so I can conceptualize easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another note to discuss: Demoing each section of the Platform website on the docs. Helps with discoverability, fastest way for users to learn about features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On your second point, do you mean that for Datasets, Tests, Monitoring, PromptScorer and more, we have some kind of section in the docs for each?
README.md
Outdated
| ```python | ||
| await trainer.train( | ||
| agent_function=your_agent_function, | ||
| scorers=[RewardScorer()], # Custom scorer you define based on task criteria |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change comment to "Custom scorer(s) you define based on task criteria to serve as reward functions"
| ### Start monitoring with Judgeval | ||
|
|
||
| ## β¨ Features | ||
| ```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a Custom scorers + async eval example here. We want to stress full customization that fits users' agent-specific behavior scorers
from judgeval.tracer import Tracer, wrap
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
from openai import OpenAI
judgment = Tracer(project_name="default_project")
client = wrap(OpenAI())
# Define a custom example class
class CustomerRequest(Example):
request: str
response: str
# Define a agent-specific custom scorer
class ResolutionScorer(ExampleScorer):
name: str = "Resolution Scorer"
server_hosted: bool = True
async def a_score_example(self, example: CustomerRequest):
# Custom scoring logic
if "package" in example.response.lower():
self.reason = "The response addresses the package inquiry"
return 1.0
else:
self.reason = "The response does not address the package inquiry"
return 0.0
@judgment.observe(span_type="tool")
def get_customer_request():
return "Where is my package?"
@judgment.observe(span_type="function")
def main():
customer_request = get_customer_request()
# Generate response using LLM
response = client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": customer_request}]
).choices[0].message.content
# Run online evaluation with custom scorer
judgment.async_evaluate(
scorer=ResolutionScorer(threshold=0.8),
example=CustomerRequest(
request=customer_request,
response=response
)
)
return response
main()
Clarified the functionality of judgeval's scorer customization and added details about its secure container hosting.
|
|
||
| β’ **Custom Evaluators**: No restriction to only monitoring with prefab scorers. Judgeval provides simple abstractions for custom python evaluators and their applications, supporting any LLM-as-a-judge rubrics and code-based scorers that integrate to our live agent-tracking infrastructure. [Learn more](https://docs.judgmentlabs.ai/documentation/evaluation/scorers/custom-scorers) | ||
|
|
||
| β’ **Production Monitoring**: Run any custom scorer to flag agent behaviors online in production. Group agent runs by behavior type into buckets for deeper analysis. Get Slack alerts for failures and add custom hooks to address regressions before they impact users. [Learn more](https://docs.judgmentlabs.ai/documentation/performance/online-evals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have an example here also
seancfong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple suggestions
|
|
||
| [Demo](https://www.youtube.com/watch?v=1S4LixpVbcc) β’ [Bug Reports](https://github.com/JudgmentLabs/judgeval/issues) β’ [Changelog](https://docs.judgmentlabs.ai/changelog/2025-04-21) | ||
| [](https://docs.judgmentlabs.ai/documentation) | ||
| [](https://app.judgmentlabs.ai/register) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) These badges are still our old colors. Should we use orange for these?
Also Judgment Cloud should be Judgment Platform
cc: @shunuen0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
| </td> | ||
| </tr> | ||
|
|
||
| </table> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| </table> |
agree with gemini
rishi763
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Sean Fong <[email protected]>

π Summary
β Checklist