The introduction of AI agents is being considered to address the challenges faced by many workplaces, such as the aging of the population, lack of human resources, and delays in decision-making. In order to improve the functionality of AI agents, we have developed and provided a benchmark suite to evaluate AI agents by extending the evaluation method of web operations to field operations.
FieldWorkArena is a groundbreaking benchmark suite for evaluating AI agents. By using data and tasks from Fujitsu's actual factories and warehouses, we quantitatively evaluate how effectively AI agents work in the field. This clarifies the challenges of AI adoption and ensures evidence when applied in the field.
See below for more details.
https://en-documents.research.global.fujitsu.com/fieldworkarena/
- 2025-06-30: The Retail dataset has been released on Hugging Face. If you would like to obtain it, please apply here.
- 2025-06-30: The Warehouse dataset has been released on Hugging Face. If you would like to obtain it, please apply here.
- 2025-02-27: The Factory dataset has been released on Hugging Face.
The current reporting functionality of FieldWorkArena utilizes
Browsergym and WorkArena. Therefore, it is necessary to use ServiceNow instance in this implementation.
In the future, the implementation may change in line with modification to the action space and task definitions.
- Go to https://developer.servicenow.com/ and create an account.
- Click on
Request an instance
and select theWashington
release (initializing the instance will take a few minutes), If you can't select release, once you request an instance for default release, doRelease instance
and clickRequest an instance
again. - Once the instance is ready, you should see your instance URL and credentials. If not, click Return to the Developer Portal, then navigate to Manage instance password and click Reset instance password.
- You should now see your URL and credentials. Based on this information, set the following environment variables:
SNOW_INSTANCE_URL
: The URL of your ServiceNow developer instanceSNOW_INSTANCE_UNAME
: The username, should be "admin"SNOW_INSTANCE_PWD
: The password, make sure you place the value in quotes "" and be mindful of escaping special shell characters. Runningecho $SNOW_INSTANCE_PWD
should print the correct password.
- Log into your instance via a browser using the admin credentials. Close any popup that appears on the main screen (e.g., agreeing to analytics).
Warning: Feel free to look around the platform, but please make sure you revert any changes (e.g., changes to list views, pinning some menus, etc.) as these changes will be persistent and affect the benchmarking process.
git clone https://github.com/FujitsuResearch/FieldWorkArena.git
cd FieldWorkArena
pip install -r requirements.txt
pip install .
Then, install Playwright
playwright install
Finally, run this command in a terminal to upload the benchmark data to your ServiceNow instance:
workarena-install
- Go to https://en-documents.research.global.fujitsu.com/fieldworkarena/ .
- Click link on
Evaluation dataset
and apply from Forms page, - Confirm the download URL in email sent from FieldWorkArena. (It may take a few business days.)
- Unzip downloaded file. The files should be organized in the following directory structure:
FieldWorkArena \
├── ...\
├── data\
│ ├── document \
│ ├── image\
│ └── movie\
└── ...
set environment variable
OPENAI_API_KEY=You OpenAI API key
To run the demo, Xserver environment is required.
In these demos, the tasks is to search for incidents in the image according to the query and to report any incidents found.
python demo/run_demo.py --task_name fieldworkarena.demo.1.report
python demo/run_demo.py --task_name fieldworkarena.demo.2.report
python demo/run_demo.py --task_name fieldworkarena.demo.3.report
python demo/run_demo.py --task_name fieldworkarena.demo.4.report
Run the following script, the results will be saved in the results
directory.
All tasks
bash run_tasks.sh all
Each tasks
# for factory
bash run_tasks.sh factory
# for warehouse
bash run_tasks.sh warehouse
# for retail
bash run_tasks.sh retail
All tasks
.\run_all_tasks.bat all
Each tasks
# for factory
.\run_tasks.bat factory
# for warehouse
.\run_tasks.bat warehouse
# for retail
.\run_tasks.bat retail
Agent is defined in 'demo/agent.py'. For testing your agent, you should mainly modify 'get_action()' method.
Attention!!
After you added your own agents and scenarios, Plase call pip install .
.
Compress the results
directory and reply it to the email address with the download URL of the evaluation data .
To submit an inquiry, please follow these steps:
- Visit our page
- Click the "Inquiry" button on the bottom.
- Fill out the form completely and accurately.
It may take a few business days to reply.
This implementation was created with reference to the source code for WorkArena, developed by ServiceNow Research.
- github: https://github.com/ServiceNow/WorkArena
- arxiv:
When the browser launched and the proxy auth dialog blocks the startup, please install chrome extension "Proxy Helper". After that, fill PAC URL and your account/password.