Andrew/readme updates #558

JCamyre · 2025-09-16T20:06:32Z

📝 Summary

1. ...

✅ Checklist

Docs updated (if necessary)
Changelogs are updated (if necessary)

gemini-code-assist

Summary of Changes

Hello @JCamyre, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request comprehensively revamps the project's README.md to provide a more current and focused overview of Judgeval. The changes aim to better articulate the platform's core capabilities, particularly highlighting its new reinforcement learning integration and emphasizing its role in Agent Behavior Monitoring. The update also streamlines the content for clarity and removes redundant or outdated information.

Highlights

README Content Overhaul: The README.md file has undergone a significant restructuring and content update to provide a more current and focused overview of the project.
New Logo Integration: New dark and light mode logos have been implemented, replacing older versions, for improved visual presentation and branding consistency.
Agent Reinforcement Learning (RL) Feature: A new section introduces Judgeval's integration with Fireworks AI for Reinforcement Fine-Tuning (RFT), highlighting its capability to train agents with minimal code changes.
Updated Project Overview: The project's high-level description has been refined to emphasize its role as an Agent Behavior Monitoring (ABM) framework, focusing on tracking and judging agent behavior.
Streamlined Information: Outdated sections, duplicate content, and previous feature descriptions (including GIFs and tables) have been removed or consolidated for improved clarity and conciseness.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request significantly revamps the README.md to highlight new features like Agent Reinforcement Learning and improve the overall structure. The changes make the README more modern and focused. I've left a few comments to address some issues, such as an orphaned HTML tag, placeholder links, and a minor grammatical error. I also have a question about some larger sections of content that have been commented out, as this removes potentially valuable information for users. Overall, a great update to the project's front page.

gemini-code-assist · 2025-09-16T20:07:56Z

README.md

-</td>
-</tr>

 </table>


There's an orphaned closing </table> tag here. The table it belonged to seems to have been removed, but this tag was left behind. This can cause rendering issues in some Markdown parsers and should be removed.

gemini-code-assist · 2025-09-16T20:07:56Z

README.md

+| Custom Scorers | [Link to custom scorers cookbook] |
+| Online Monitoring | [Link to monitoring cookbook] |
+| RL | [Link to RL cookbook] |


The links in the "Cookbooks" table are placeholders (e.g., [Link to custom scorers cookbook]). These should be updated to point to the correct cookbook pages before merging to provide a good user experience.

gemini-code-assist · 2025-09-16T20:07:56Z

README.md

+
+## Why Judgeval?
+
+• **Custom Evaluators**: Judgeval provides simple abstractions for custom evaluators and their applications to your agents, supporting LLM-as-a-judge and code-based evaluators that connect to datasets our and metric-tracking infrastructure. [Learn more](https://docs.judgmentlabs.ai/documentation/evaluation/scorers/custom-scorers)


There's a minor grammatical error here. "datasets our and" should be "our datasets and" for better readability.

Suggested change

• **Custom Evaluators**: Judgeval provides simple abstractions for custom evaluators and their applications to your agents, supporting LLM-as-a-judge and code-based evaluators that connect to datasets our and metric-tracking infrastructure. [Learn more](https://docs.judgmentlabs.ai/documentation/evaluation/scorers/custom-scorers)

• **Custom Evaluators**: Judgeval provides simple abstractions for custom evaluators and their applications to your agents, supporting LLM-as-a-judge and code-based evaluators that connect to our datasets and metric-tracking infrastructure. [Learn more](https://docs.judgmentlabs.ai/documentation/evaluation/scorers/custom-scorers)

gemini-code-assist · 2025-09-16T20:07:56Z

README.md

+<!--
+<img src="assets/product_shot.png" alt="Judgment Platform" width="800" />

-## 🛠️ Installation
+
+|  |  |
+|:---|:---:|
+| <h3>🧪 Evals</h3>Build custom evaluators on top of your agents. Judgeval supports LLM-as-a-judge, manual labeling, and code-based evaluators that connect with our metric-tracking infrastructure. <br><br>**Useful for:**<br>• ⚠️ Unit-testing <br>• 🔬 A/B testing <br>• 🛡️ Online guardrails | <p align="center"><img src="assets/test.png" alt="Evaluation metrics" width="800"/></p> |
+| <h3>📡 Monitoring</h3>Get Slack alerts for agent failures in production. Add custom hooks to address production regressions.<br><br> **Useful for:** <br>• 📉 Identifying degradation early <br>• 📈 Visualizing performance trends across agent versions and time | <p align="center"><img src="assets/errors.png" alt="Monitoring Dashboard" width="1200"/></p> |
+| <h3>📊 Datasets</h3>Export environment interactions and test cases to datasets for scaled analysis and optimization. Move datasets to/from Parquet, S3, etc. <br><br>Run evals on datasets as unit tests or to A/B test different agent configurations, enabling continuous learning from production interactions. <br><br> **Useful for:**<br>• 🗃️ Agent environment interaction data for optimization<br>• 🔄 Scaled analysis for A/B tests | <p align="center"><img src="assets/datasets_preview_screenshot.png" alt="Dataset management" width="1200"/></p> |
+
+-->


This large block of commented-out HTML contains a visually rich "Features" section. Similarly, lines 102-121 comment out the "Self-Hosting" and "Development with Cursor" sections. Was it intended to remove this content? If this information is still relevant, it might be better to either restore it or ensure it's accessible elsewhere in the documentation and linked appropriately. Commenting it out hides valuable information from users browsing the README.

JCamyre

Left comments here

JCamyre · 2025-09-16T20:16:57Z

README.md

+    rft_provider="fireworks"
+)
+```
+


Link to section in docs about our train

I dont think there is one rn?

JCamyre · 2025-09-16T20:19:24Z

README.md

+Train your agents with reinforcement learning using [Fireworks AI](https://fireworks.ai/)! Judgeval now integrates with Fireworks' Reinforcement Fine-Tuning (RFT) endpoint. 
+Judgeval provides a simple harness for integrating GRPO into any Python agent, giving builders a quick method to **try RL with minimal code changes** to their existing agents!
+
+```python


Link to Ishan's cookbook once completed.

JCamyre · 2025-09-16T20:20:49Z

README.md

+)
+```
+
+**That's it!** Judgeval automatically manages trajectory collection and reward tagging - your agent can learn from production data with minimal code changes. You can view and monitor training progress for free via the [Judgment Dashboard](https://app.judgmentlabs.ai/).


Should also link to the Optimization dashboard section of the docs once we have. Linking to the Judgment platform doesn't make sense to me - would take too many steps to actually get to the optimization page.

Would rather see the Optimization page directly on the docs so I can conceptualize easier.

Another note to discuss: Demoing each section of the Platform website on the docs. Helps with discoverability, fastest way for users to learn about features.

On your second point, do you mean that for Datasets, Tests, Monitoring, PromptScorer and more, we have some kind of section in the docs for each?

JCamyre · 2025-09-16T20:23:15Z

README.md

+```python
+await trainer.train(
+    agent_function=your_agent_function,
+    scorers=[RewardScorer()],  # Custom scorer you define based on task criteria


Change comment to "Custom scorer(s) you define based on task criteria to serve as reward functions"

README.md

JCamyre · 2025-09-17T20:16:55Z

README.md

+### Start monitoring with Judgeval

-## ✨ Features
+```python


I think we should have a Custom scorers + async eval example here. We want to stress full customization that fits users' agent-specific behavior scorers

from judgeval.tracer import Tracer, wrap from judgeval.data import Example from judgeval.scorers.example_scorer import ExampleScorer from openai import OpenAI judgment = Tracer(project_name="default_project") client = wrap(OpenAI()) # Define a custom example class class CustomerRequest(Example): request: str response: str # Define a agent-specific custom scorer class ResolutionScorer(ExampleScorer): name: str = "Resolution Scorer" server_hosted: bool = True async def a_score_example(self, example: CustomerRequest): # Custom scoring logic if "package" in example.response.lower(): self.reason = "The response addresses the package inquiry" return 1.0 else: self.reason = "The response does not address the package inquiry" return 0.0 @judgment.observe(span_type="tool") def get_customer_request(): return "Where is my package?" @judgment.observe(span_type="function") def main(): customer_request = get_customer_request() # Generate response using LLM response = client.chat.completions.create( model="gpt-5", messages=[{"role": "user", "content": customer_request}] ).choices[0].message.content # Run online evaluation with custom scorer judgment.async_evaluate( scorer=ResolutionScorer(threshold=0.8), example=CustomerRequest( request=customer_request, response=response ) ) return response main()

Clarified the functionality of judgeval's scorer customization and added details about its secure container hosting.

Mandolaro · 2025-09-18T17:43:36Z

README.md

+
+• **Custom Evaluators**: No restriction to only monitoring with prefab scorers. Judgeval provides simple abstractions for custom python evaluators and their applications, supporting any LLM-as-a-judge rubrics and code-based scorers that integrate to our live agent-tracking infrastructure. [Learn more](https://docs.judgmentlabs.ai/documentation/evaluation/scorers/custom-scorers)
+
+• **Production Monitoring**: Run any custom scorer to flag agent behaviors online in production. Group agent runs by behavior type into buckets for deeper analysis. Get Slack alerts for failures and add custom hooks to address regressions before they impact users. [Learn more](https://docs.judgmentlabs.ai/documentation/performance/online-evals)


Should we have an example here also

seancfong

Left a couple suggestions

README.md

seancfong · 2025-09-18T18:59:25Z

README.md


- [Demo](https://www.youtube.com/watch?v=1S4LixpVbcc) • [Bug Reports](https://github.com/JudgmentLabs/judgeval/issues) • [Changelog](https://docs.judgmentlabs.ai/changelog/2025-04-21)
+[![Docs](https://img.shields.io/badge/Documentation-blue)](https://docs.judgmentlabs.ai/documentation)
+[![Judgment Cloud](https://img.shields.io/badge/Judgment%20Cloud-brightgreen)](https://app.judgmentlabs.ai/register)


(minor) These badges are still our old colors. Should we use orange for these?
Also Judgment Cloud should be Judgment Platform

e.g.

cc: @shunuen0

seancfong · 2025-09-18T19:02:42Z

README.md

-</td>
-</tr>

 </table>


Suggested change

</table>

agree with gemini

rishi763

LGTM

Co-authored-by: Sean Fong <[email protected]>

andrewli2403 and others added 17 commits September 10, 2025 22:02

hero section

0d7c4ce

monitoring code

f9d09d4

add rl code

c808437

hero

3e4ba3d

clean up

0fd4f46

test

7055c24

test1

a0acb28

test2

250fc68

changes

c38c890

changes1

672cb56

hero section

fc0ae18

changes

d1b5c81

updates

17c3397

Revise README for clarity on agent monitoring and RL

b4d0dfb

Update comment for scorer definition in README

f58bb46

Improve clarity and details in README.md

38abffc

Replace logo assets and update README

701b2a9

JCamyre marked this pull request as draft September 16, 2025 20:06

gemini-code-assist bot reviewed Sep 16, 2025

View reviewed changes

SecroLoL added 6 commits September 16, 2025 18:27

Update Judgeval overview in README

7673415

Add online eval quickstart

57b9c24

Add code quickstart to README

5dbef32

Add trajectory screenshot to README quickstart section

0318807

Update README with multi-turn RL content

3d0c1eb

Add better description for RL in README

cbc767f

JCamyre commented Sep 17, 2025

View reviewed changes

SecroLoL and others added 3 commits September 17, 2025 22:47

Update README with code examples

4702399

Update README to mention free account creation

b511bf4

Enhance README with scorer customization details

88b9049

Clarified the functionality of judgeval's scorer customization and added details about its secure container hosting.

SecroLoL requested review from Mandolaro, abhishekg999, adivate2021, alanzhang25, andrewli2403, justinsheu, rishi763, seancfong and yifan1207 September 18, 2025 17:24

abhishekg999 approved these changes Sep 18, 2025

View reviewed changes

Mandolaro reviewed Sep 18, 2025

View reviewed changes

seancfong reviewed Sep 18, 2025

View reviewed changes

rishi763 approved these changes Sep 18, 2025

View reviewed changes

Update README.md

6253197

Co-authored-by: Sean Fong <[email protected]>

alanzhang25 closed this Oct 10, 2025


		## Why Judgeval?

		• Custom Evaluators: Judgeval provides simple abstractions for custom evaluators and their applications to your agents, supporting LLM-as-a-judge and code-based evaluators that connect to datasets our and metric-tracking infrastructure. [Learn more](https://docs.judgmentlabs.ai/documentation/evaluation/scorers/custom-scorers)


		• Custom Evaluators: No restriction to only monitoring with prefab scorers. Judgeval provides simple abstractions for custom python evaluators and their applications, supporting any LLM-as-a-judge rubrics and code-based scorers that integrate to our live agent-tracking infrastructure. [Learn more](https://docs.judgmentlabs.ai/documentation/evaluation/scorers/custom-scorers)

		• Production Monitoring: Run any custom scorer to flag agent behaviors online in production. Group agent runs by behavior type into buckets for deeper analysis. Get Slack alerts for failures and add custom hooks to address regressions before they impact users. [Learn more](https://docs.judgmentlabs.ai/documentation/performance/online-evals)

Andrew/readme updates #558

Andrew/readme updates #558

Uh oh!

Conversation

JCamyre commented Sep 16, 2025

📝 Summary

✅ Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

JCamyre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seancfong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rishi763 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants