Skip to content

BBS Crashing and Failing over #428

@andrew-edgar

Description

@andrew-edgar

Thank you for submitting an issue to the diego-release repository. We appreciate the feedback. To help us address your issue, please fill out the sections in the following template to the best of your ability:

Summary

We have seen in our very busy production env we see the BBS will crash and failover. Once the failover occurs the new BBS sometimes takes a while to recover and be in a performant state.

Expected Result

When the BBS is running there should be no unexpected crashes and failovers.

Actual Result

The BBS crashes :)

Context

This is IBM Public Cloud us-south deployment. We have about 780 Cells. This is diego 2.25.0 version with CF-Deployment 6.8.0.

This is on SoftLayer. This is a very large 16 core VM, we are using postgres as the backend and have 400 db connections (min and max). We cannot change this due to limitations in the postgres deployment.
The file description limit is 100000 as set in bpm.yml.

Steps to Reproduce

Unable to reproduce in smaller envs as this must be due to load or particular errors in some queries.

Possible Causes or Fixes (optional)

Additional Text Output or Screenshots (optional)

I will be attaching logs (or sending them via slack as they are very large in the next day or two.

This was already discussed here ...

https://cloudfoundry.slack.com/archives/C02FM2BPE/p1556639064033100

There are some inital goroutine dumps there (small parts) but will try and get the rest shortly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions