-
Notifications
You must be signed in to change notification settings - Fork 217
Description
Thank you for submitting an issue to the diego-release repository. We appreciate the feedback. To help us address your issue, please fill out the sections in the following template to the best of your ability:
Summary
We have seen in our very busy production env we see the BBS will crash and failover. Once the failover occurs the new BBS sometimes takes a while to recover and be in a performant state.
Expected Result
When the BBS is running there should be no unexpected crashes and failovers.
Actual Result
The BBS crashes :)
Context
This is IBM Public Cloud us-south deployment. We have about 780 Cells. This is diego 2.25.0 version with CF-Deployment 6.8.0.
This is on SoftLayer. This is a very large 16 core VM, we are using postgres as the backend and have 400 db connections (min and max). We cannot change this due to limitations in the postgres deployment.
The file description limit is 100000 as set in bpm.yml.
Steps to Reproduce
Unable to reproduce in smaller envs as this must be due to load or particular errors in some queries.
Possible Causes or Fixes (optional)
Additional Text Output or Screenshots (optional)
I will be attaching logs (or sending them via slack as they are very large in the next day or two.
This was already discussed here ...
https://cloudfoundry.slack.com/archives/C02FM2BPE/p1556639064033100
There are some inital goroutine dumps there (small parts) but will try and get the rest shortly.