Skip to content

Conversation

@JCamyre
Copy link
Collaborator

@JCamyre JCamyre commented Sep 16, 2025

πŸ“ Summary

  • 1. ...

βœ… Checklist

@JCamyre JCamyre marked this pull request as draft September 16, 2025 20:06
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @JCamyre, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request comprehensively revamps the project's README.md to provide a more current and focused overview of Judgeval. The changes aim to better articulate the platform's core capabilities, particularly highlighting its new reinforcement learning integration and emphasizing its role in Agent Behavior Monitoring. The update also streamlines the content for clarity and removes redundant or outdated information.

Highlights

  • README Content Overhaul: The README.md file has undergone a significant restructuring and content update to provide a more current and focused overview of the project.
  • New Logo Integration: New dark and light mode logos have been implemented, replacing older versions, for improved visual presentation and branding consistency.
  • Agent Reinforcement Learning (RL) Feature: A new section introduces Judgeval's integration with Fireworks AI for Reinforcement Fine-Tuning (RFT), highlighting its capability to train agents with minimal code changes.
  • Updated Project Overview: The project's high-level description has been refined to emphasize its role as an Agent Behavior Monitoring (ABM) framework, focusing on tracking and judging agent behavior.
  • Streamlined Information: Outdated sections, duplicate content, and previous feature descriptions (including GIFs and tables) have been removed or consolidated for improved clarity and conciseness.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with πŸ‘ and πŸ‘Ž on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly revamps the README.md to highlight new features like Agent Reinforcement Learning and improve the overall structure. The changes make the README more modern and focused. I've left a few comments to address some issues, such as an orphaned HTML tag, placeholder links, and a minor grammatical error. I also have a question about some larger sections of content that have been commented out, as this removes potentially valuable information for users. Overall, a great update to the project's front page.

</td>
</tr>

</table>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's an orphaned closing </table> tag here. The table it belonged to seems to have been removed, but this tag was left behind. This can cause rendering issues in some Markdown parsers and should be removed.

README.md Outdated
Comment on lines 56 to 58
| Custom Scorers | [Link to custom scorers cookbook] |
| Online Monitoring | [Link to monitoring cookbook] |
| RL | [Link to RL cookbook] |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The links in the "Cookbooks" table are placeholders (e.g., [Link to custom scorers cookbook]). These should be updated to point to the correct cookbook pages before merging to provide a good user experience.

README.md Outdated

## Why Judgeval?

β€’ **Custom Evaluators**: Judgeval provides simple abstractions for custom evaluators and their applications to your agents, supporting LLM-as-a-judge and code-based evaluators that connect to datasets our and metric-tracking infrastructure. [Learn more](https://docs.judgmentlabs.ai/documentation/evaluation/scorers/custom-scorers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a minor grammatical error here. "datasets our and" should be "our datasets and" for better readability.

Suggested change
β€’ **Custom Evaluators**: Judgeval provides simple abstractions for custom evaluators and their applications to your agents, supporting LLM-as-a-judge and code-based evaluators that connect to datasets our and metric-tracking infrastructure. [Learn more](https://docs.judgmentlabs.ai/documentation/evaluation/scorers/custom-scorers)
β€’ **Custom Evaluators**: Judgeval provides simple abstractions for custom evaluators and their applications to your agents, supporting LLM-as-a-judge and code-based evaluators that connect to our datasets and metric-tracking infrastructure. [Learn more](https://docs.judgmentlabs.ai/documentation/evaluation/scorers/custom-scorers)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Comment on lines 73 to 83
<!--
<img src="assets/product_shot.png" alt="Judgment Platform" width="800" />
## πŸ› οΈ Installation
| | |
|:---|:---:|
| <h3>πŸ§ͺ Evals</h3>Build custom evaluators on top of your agents. Judgeval supports LLM-as-a-judge, manual labeling, and code-based evaluators that connect with our metric-tracking infrastructure. <br><br>**Useful for:**<br>β€’ ⚠️ Unit-testing <br>β€’ πŸ”¬ A/B testing <br>β€’ πŸ›‘οΈ Online guardrails | <p align="center"><img src="assets/test.png" alt="Evaluation metrics" width="800"/></p> |
| <h3>πŸ“‘ Monitoring</h3>Get Slack alerts for agent failures in production. Add custom hooks to address production regressions.<br><br> **Useful for:** <br>β€’ πŸ“‰ Identifying degradation early <br>β€’ πŸ“ˆ Visualizing performance trends across agent versions and time | <p align="center"><img src="assets/errors.png" alt="Monitoring Dashboard" width="1200"/></p> |
| <h3>πŸ“Š Datasets</h3>Export environment interactions and test cases to datasets for scaled analysis and optimization. Move datasets to/from Parquet, S3, etc. <br><br>Run evals on datasets as unit tests or to A/B test different agent configurations, enabling continuous learning from production interactions. <br><br> **Useful for:**<br>β€’ πŸ—ƒοΈ Agent environment interaction data for optimization<br>β€’ πŸ”„ Scaled analysis for A/B tests | <p align="center"><img src="assets/datasets_preview_screenshot.png" alt="Dataset management" width="1200"/></p> |
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This large block of commented-out HTML contains a visually rich "Features" section. Similarly, lines 102-121 comment out the "Self-Hosting" and "Development with Cursor" sections. Was it intended to remove this content? If this information is still relevant, it might be better to either restore it or ensure it's accessible elsewhere in the documentation and linked appropriately. Commenting it out hides valuable information from users browsing the README.

Copy link
Collaborator Author

@JCamyre JCamyre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments here

rft_provider="fireworks"
)
```

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to section in docs about our train

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2025-09-17 at 6 48 06β€―PM I dont think there is one rn?

Train your agents with reinforcement learning using [Fireworks AI](https://fireworks.ai/)! Judgeval now integrates with Fireworks' Reinforcement Fine-Tuning (RFT) endpoint.
Judgeval provides a simple harness for integrating GRPO into any Python agent, giving builders a quick method to **try RL with minimal code changes** to their existing agents!

```python
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to Ishan's cookbook once completed.

)
```

**That's it!** Judgeval automatically manages trajectory collection and reward tagging - your agent can learn from production data with minimal code changes. You can view and monitor training progress for free via the [Judgment Dashboard](https://app.judgmentlabs.ai/).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also link to the Optimization dashboard section of the docs once we have. Linking to the Judgment platform doesn't make sense to me - would take too many steps to actually get to the optimization page.

Would rather see the Optimization page directly on the docs so I can conceptualize easier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note to discuss: Demoing each section of the Platform website on the docs. Helps with discoverability, fastest way for users to learn about features.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On your second point, do you mean that for Datasets, Tests, Monitoring, PromptScorer and more, we have some kind of section in the docs for each?

README.md Outdated
```python
await trainer.train(
agent_function=your_agent_function,
scorers=[RewardScorer()], # Custom scorer you define based on task criteria
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change comment to "Custom scorer(s) you define based on task criteria to serve as reward functions"

### Start monitoring with Judgeval

## ✨ Features
```python
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a Custom scorers + async eval example here. We want to stress full customization that fits users' agent-specific behavior scorers

from judgeval.tracer import Tracer, wrap
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
from openai import OpenAI

judgment = Tracer(project_name="default_project")
client = wrap(OpenAI())

# Define a custom example class
class CustomerRequest(Example):
    request: str
    response: str

# Define a agent-specific custom scorer
class ResolutionScorer(ExampleScorer):
    name: str = "Resolution Scorer"
    server_hosted: bool = True

    async def a_score_example(self, example: CustomerRequest):
        # Custom scoring logic
        if "package" in example.response.lower():
            self.reason = "The response addresses the package inquiry"
            return 1.0
        else:
            self.reason = "The response does not address the package inquiry"
            return 0.0

@judgment.observe(span_type="tool")
def get_customer_request():
    return "Where is my package?"

@judgment.observe(span_type="function")
def main():
    customer_request = get_customer_request()
    
    # Generate response using LLM
    response = client.chat.completions.create(
        model="gpt-5",
        messages=[{"role": "user", "content": customer_request}]
    ).choices[0].message.content

    # Run online evaluation with custom scorer
    judgment.async_evaluate(
        scorer=ResolutionScorer(threshold=0.8),
        example=CustomerRequest(
            request=customer_request,
            response=response
        )
    )
    
    return response

main()

SecroLoL and others added 3 commits September 17, 2025 22:47
Clarified the functionality of judgeval's scorer customization and added details about its secure container hosting.

β€’ **Custom Evaluators**: No restriction to only monitoring with prefab scorers. Judgeval provides simple abstractions for custom python evaluators and their applications, supporting any LLM-as-a-judge rubrics and code-based scorers that integrate to our live agent-tracking infrastructure. [Learn more](https://docs.judgmentlabs.ai/documentation/evaluation/scorers/custom-scorers)

β€’ **Production Monitoring**: Run any custom scorer to flag agent behaviors online in production. Group agent runs by behavior type into buckets for deeper analysis. Get Slack alerts for failures and add custom hooks to address regressions before they impact users. [Learn more](https://docs.judgmentlabs.ai/documentation/performance/online-evals)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have an example here also

Copy link
Contributor

@seancfong seancfong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple suggestions


[Demo](https://www.youtube.com/watch?v=1S4LixpVbcc) β€’ [Bug Reports](https://github.com/JudgmentLabs/judgeval/issues) β€’ [Changelog](https://docs.judgmentlabs.ai/changelog/2025-04-21)
[![Docs](https://img.shields.io/badge/Documentation-blue)](https://docs.judgmentlabs.ai/documentation)
[![Judgment Cloud](https://img.shields.io/badge/Judgment%20Cloud-brightgreen)](https://app.judgmentlabs.ai/register)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) These badges are still our old colors. Should we use orange for these?
Also Judgment Cloud should be Judgment Platform

e.g. Judgment Labs

cc: @shunuen0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

</td>
</tr>

</table>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
</table>

agree with gemini

Copy link
Contributor

@rishi763 rishi763 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Co-authored-by: Sean Fong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants