After recent changes in bbs selection process, passive nodes randomly causing errors

## Issue

Recently moved to cf-deployment 23.5.0 with an ops-file to bump to capi-release 1.142.0. We found that some of our jobs which test regular app pushes were reporting errors that we haven't seen before.

Things like 

```
Staging app and tracing logs...

Stager is unavailable: ipaddr value changed, but session already started

FAILED
```

and 

```
Waiting for app to start...

Runner is unavailable: Process Guid: f4c41835-eaa0-44c9-9c96-026ae08e295e-7ff13535-d702-4997-b0c4-4a063c426a1a: ipaddr value changed, but session already started

FAILED
```


## Context

We think it's related to https://github.com/cloudfoundry/cloud_controller_ng/pull/3048

We looked into the cloud_controller_ng logs and inserted more logging. When the request goes is made to diego-api (BBS), occasionally it now goes to the passive BBS node which seems to cause these errors.

The passive BBS node looks like it's rejecting the connection. Here's the cloudcontroller logs for the request when it attempts to go to the passive node.

```
ERROR: #<Errno::ECONNREFUSED: Failed to open TCP connection to passive_node_ip:8889 (Connection refused - connect(2) for passive_node_ip:8889)>
```

## Steps to Reproduce

Push apps frequently and review the errors. you'll need to be using capi-release 1.142.0, bosh v276.1.1 and bosh-dns using the `/features/use_dns_addresses` flag set to true. Failure rate is v intermittent. You could have 20 successful pushes before the errors start coming in.

## Expected result

Successful pushes 100% of the time (I know I expect a lot)

## Current result

Staging failures during app pushes.

```
Staging app and tracing logs...

Stager is unavailable: ipaddr value changed, but session already started

FAILED
```

and 

```
Waiting for app to start...

Runner is unavailable: Process Guid: f4c41835-eaa0-44c9-9c96-026ae08e295e-7ff13535-d702-4997-b0c4-4a063c426a1a: ipaddr value changed, but session already started

FAILED
```

## Possible Fix

I managed to workaround this different behaviour by adding a `.sort` to https://github.com/cloudfoundry/cloud_controller_ng/blob/main/lib/diego/client.rb#L143 making it read `ips_remaining.sort.first`

This may just be working because my primary happens to be the lower IP so it's probably not a good fix.

Not sure if it's related but it looks like it's relying on Bosh DNS (bbs.service.cf.internal:8889) to report the active diego-api (BBS) node as the `.first` entry. This isn't always the case. Bosh DNS just gives a list of possibles, not necessarily active nodes and not necessarily in the "right" order.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

After recent changes in bbs selection process, passive nodes randomly causing errors #3109

Issue

Context

Steps to Reproduce

Expected result

Current result

Possible Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

After recent changes in bbs selection process, passive nodes randomly causing errors #3109

Description

Issue

Context

Steps to Reproduce

Expected result

Current result

Possible Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions