Skip to content

Conversation

@GabrielBrascher
Copy link
Member

Description

When running get_bridge_physdev(brname) from security_group.py it is returned the bridge device as brname: instead of the expected brname.

We experienced this issue on CloudStack 4.13.1.0 with Security Groups enabled for Advanced Networking. Additionally, KVM nodes are running on Ubuntu 18.04.

PR #4303 (merged in 4.15) added support for Ubuntu 20.04 which turned out to fix get_bridge_physdev(brname); however, we faced the very same issue with the previous CloudStack and Ubuntu versions.

Even though we might not get a 4.13.2, and CloudStack 4.14.1.0 has just been released, this PR proposes merging the fix into branch 4.13 and then get it forwarded into 4.14. Thus, allowing users to have this fix referenced and also opening the possibility of addressing this in case of a potential 4.14.2.0.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

How Has This Been Tested?

We had manually changed the security_group.py script on our KVM nodes. All is working fine after applying this change.

@weizhouapache
Copy link
Member

@GabrielBrascher
should the target be 4.13 ?

@GabrielBrascher GabrielBrascher changed the base branch from master to 4.13 March 2, 2021 14:50
@GabrielBrascher
Copy link
Member Author

Thanks for the ping @weizhouapache, updated base branch.

@wido wido requested a review from rohityadavcloud March 3, 2021 09:31
@wido
Copy link
Contributor

wido commented Mar 3, 2021

@rhtyd I would like to stress that this issue also occurs on Ubuntu 18.04 systems and not only on 20.04 as @weizhouapache mentioned in the original commit.

Two Ubuntu 18.04 systems running with VXLAN:

root@hv-138-e14-33:~# bridge -o link show | grep vxlan500
28: vxlan500 state UNKNOWN :  mtu 1500 master brvx-500 state forwarding priority 32 cost 100 
root@hv-138-e14-25:~# bridge -o link show | grep vxlan500
9: vxlan500:  mtu 1500 master brvx-500 state forwarding priority 32 cost 100 

I checked their kernels:

Linux hv-138-e14-33.ams02.cldin.net 5.4.0-60-generic
Linux hv-138-e14-25.ams02.cldin.net 5.4.0-66-generic

Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.5 LTS
Release:	18.04
Codename:	bionic

This causes Instance to loose their network connectivity after a initial boot or migration.

A very small fix which prevents major issues for (end-)users.

I would urge to merge this in 4.13 and then 4.14 so that we can release 4.14.2 with this fix in there.

@rohityadavcloud
Copy link
Member

Thanks for submitting @GabrielBrascher @wido and sharing under which circumstances this happens. Is this a case specific to VXLAN or VLAN as well? (agnostic of isolation method)
@blueorangutan package

@rohityadavcloud
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos7 ✖centos8 ✔debian. JID-2859

@wido
Copy link
Contributor

wido commented Mar 3, 2021

Thanks for submitting @GabrielBrascher @wido and sharing under which circumstances this happens. Is this a case specific to VXLAN or VLAN as well? (agnostic of isolation method)
@blueorangutan package

I think it's agnostic for the isolation method as these are just Linux bridges.

Deeper inside the kernel there is the knowledge if this is VXLAN or not, but the bridge util just outputs information from the Linux bridges.

@blueorangutan
Copy link

@wido a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos7 ✖centos8 ✔debian. JID-2861

@rohityadavcloud
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3645)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33630 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4740-t3645-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermittent failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 75 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_02_vpc_privategw_static_routes Failure 168.64 test_privategw_acl.py
test_03_vpc_privategw_restart_vpc_cleanup Failure 165.56 test_privategw_acl.py
test_04_rvpc_privategw_static_routes Failure 220.48 test_privategw_acl.py
test_01_migrate_VM_and_root_volume Error 52.84 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 46.88 test_vm_life_cycle.py

@rohityadavcloud rohityadavcloud merged commit 6e7516c into apache:4.13 Mar 4, 2021
@rohityadavcloud
Copy link
Member

Test failures seem unrelated with the change, which is SG only.

nlgordon pushed a commit to ippathways/cloudstack that referenced this pull request Aug 2, 2022
…t "device" (apache#4740)

When running get_bridge_physdev(brname) from security_group.py it is returned the bridge device as brname: instead of the expected brname.

We experienced this issue on CloudStack 4.13.1.0 with Security Groups enabled for Advanced Networking. Additionally, KVM nodes are running on Ubuntu 18.04.

PR apache#4303 (merged in 4.15) added support for Ubuntu 20.04 which turned out to fix get_bridge_physdev(brname); however, we faced the very same issue with the previous CloudStack and Ubuntu versions.

Even though we might not get a 4.13.2, and CloudStack 4.14.1.0 has just been released, this PR proposes merging the fix into branch 4.13 and then get it forwarded into 4.14. Thus, allowing users to have this fix referenced and also opening the possibility of addressing this in case of a potential 4.14.2.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants