Recently we migrated 1000’s of servers to multiple 4 vCPU ESG’s… after cutting over gateways from legacy physical devices we started to get 100% CPU usage alerts (ahhh!!) when SSHing to the edge we would only see 2-3% CPU usage being reported in the console (ehh???) after lots of head scratching, talking to GSS they confirmed this is a known issue with ESG’s… Long story short using esxtop we can determine the actual usage see: KB2137685

So for those of you fortunate enough to have experienced this issue, freaked out and then come to the realisation that vCenter & vRops is miss reporting CPU usage for the edges this article is for you… I wanted to track the CPU usage over time & possibly alert on “real” high usage so I built a super metric for vRops and applied it only to our ESG’s.

A breakdown… after looking at the individual core mhz usage values in vRops and the values in esxtop… I determined that the values in vRops were accurate and I could calculate the average CPU usage % observed in esxtop for the VM on ESG’s up to 4vCPU’s. NOTE that on an X-Large Edge, the last two vCPUs are reserved for encryption, load balancing and management function, meaning their % used may be very low compared to the other 4 vCPUs so the 1st super metric example I provide for a X-Large ESG only considers the CPU load on vCPU 0-3 and the 2nd includes all 6 vCPU’s you pick or use them both…

Here is the ESG super metric for vRops, after creating it don’t forget to assign it to the “Virtual Machine” resource kind and then enable it in your ESG specific policy.

Super Metric, X-LARGE only fisrt 4 vCPU’s

${this, metric=cpu|corecount_provisioned} == 1 ? (SUM(${this, metric=cpu:0|usagemhz_average})/${this,metric=cpu|vm_capacity_provisioned}*100) : (${this, metric=cpu|corecount_provisioned} == 2 ? (SUM(${this, metric=cpu:0|usagemhz_average}+${this, metric=cpu:1|usagemhz_average})/${this,metric=cpu|vm_capacity_provisioned}*100) : (${this, metric=cpu|corecount_provisioned} == 4 ? (SUM(${this, metric=cpu:0|usagemhz_average}+${this, metric=cpu:1|usagemhz_average}+${this, metric=cpu:2|usagemhz_average}+${this, metric=cpu:3|usagemhz_average})/${this,metric=cpu|vm_capacity_provisioned}*100) : (${this, metric=cpu|corecount_provisioned} == 6 ? (SUM(${this, metric=cpu:0|usagemhz_average}+${this, metric=cpu:1|usagemhz_average}+${this, metric=cpu:2|usagemhz_average}+${this, metric=cpu:3|usagemhz_average})/(${this,metric=cpu|vm_capacity_provisioned}/${this, metric=cpu|corecount_provisioned}*4)*100) : 0)))

Super Metric, X-LARGE All 6 vCPU’s

${this, metric=cpu|corecount_provisioned} == 1 ? (SUM(${this, metric=cpu:0|usagemhz_average})/${this,metric=cpu|vm_capacity_provisioned}*100) : (${this, metric=cpu|corecount_provisioned} == 2 ? (SUM(${this, metric=cpu:0|usagemhz_average}+${this, metric=cpu:1|usagemhz_average})/${this,metric=cpu|vm_capacity_provisioned}*100) : (${this, metric=cpu|corecount_provisioned} == 4 ? (SUM(${this, metric=cpu:0|usagemhz_average}+${this, metric=cpu:1|usagemhz_average}+${this, metric=cpu:2|usagemhz_average}+${this, metric=cpu:3|usagemhz_average})/${this,metric=cpu|vm_capacity_provisioned}*100) : (${this, metric=cpu|corecount_provisioned} == 6 ? (SUM(${this, metric=cpu:0|usagemhz_average}+${this, metric=cpu:1|usagemhz_average}+${this, metric=cpu:2|usagemhz_average}+${this,
metric=cpu:3|usagemhz_average}+${this,
metric=cpu:4|usagemhz_average}+${this,
metric=cpu:5|usagemhz_average})/${this,metric=cpu|vm_capacity_provisioned}*100) : 0)))

And below is a generic super metric to calculate CPU usage using the per core value for VM’s up to 8 vCPU’s…. but don’t use this one for the ESG… as the ESG only has 1,2,4 or 6 vCPU’s.

${this, metric=cpu|corecount_provisioned} == 1 ? (SUM(${this, metric=cpu:0|usagemhz_average})/${this,metric=cpu|vm_capacity_provisioned}*100) : (${this, metric=cpu|corecount_provisioned} == 2 ? (SUM(${this, metric=cpu:0|usagemhz_average}+${this, metric=cpu:1|usagemhz_average})/${this,metric=cpu|vm_capacity_provisioned}*100) : (${this, metric=cpu|corecount_provisioned} == 3 ? (SUM(${this, metric=cpu:0|usagemhz_average}+${this, metric=cpu:1|usagemhz_average}+${this, metric=cpu:2|usagemhz_average})/${this,metric=cpu|vm_capacity_provisioned}*100) : (${this, metric=cpu|corecount_provisioned} == 4 ? (SUM(${this, metric=cpu:0|usagemhz_average}+${this, metric=cpu:1|usagemhz_average}+${this, metric=cpu:2|usagemhz_average}+${this, metric=cpu:3|usagemhz_average})/${this,metric=cpu|vm_capacity_provisioned}*100) : (${this, metric=cpu|corecount_provisioned} == 5 ? (SUM(${this, metric=cpu:0|usagemhz_average}+${this, metric=cpu:1|usagemhz_average}+${this, metric=cpu:2|usagemhz_average}+${this, metric=cpu:3|usagemhz_average}+${this, metric=cpu:4|usagemhz_average})/${this,metric=cpu|vm_capacity_provisioned}*100) : (${this, metric=cpu|corecount_provisioned} == 6 ? (SUM(${this, metric=cpu:0|usagemhz_average}+${this, metric=cpu:1|usagemhz_average}+${this, metric=cpu:2|usagemhz_average}+${this, metric=cpu:3|usagemhz_average}+${this, metric=cpu:4|usagemhz_average}+${this, metric=cpu:5|usagemhz_average})/${this,metric=cpu|vm_capacity_provisioned}*100) : (${this, metric=cpu|corecount_provisioned} == 7 ? (SUM(${this, metric=cpu:0|usagemhz_average}+${this, metric=cpu:1|usagemhz_average}+${this, metric=cpu:2|usagemhz_average}+${this, metric=cpu:3|usagemhz_average}+${this, metric=cpu:4|usagemhz_average}+${this, metric=cpu:5|usagemhz_average}+${this, metric=cpu:6|usagemhz_average})/${this,metric=cpu|vm_capacity_provisioned}*100) : (${this, metric=cpu|corecount_provisioned} == 8 ? (SUM(${this, metric=cpu:0|usagemhz_average}+${this, metric=cpu:1|usagemhz_average}+${this, metric=cpu:2|usagemhz_average}+${this, metric=cpu:3|usagemhz_average}+${this, metric=cpu:4|usagemhz_average}+${this, metric=cpu:5|usagemhz_average}+${this, metric=cpu:6|usagemhz_average}+${this, metric=cpu:7|usagemhz_average})/${this,metric=cpu|vm_capacity_provisioned}*100) : 0)))))))

If you have any personal experience with this NSX Edge CPU usage issue or have general feedback on the super metric…. your comments would be very welcome!!!

Hope you found this helpful.

vMan