July 9, 2018
Mystery of Health Score – Performance Monitoring Toolset
Comments
(5)
July 9, 2018
Mystery of Health Score – Performance Monitoring Toolset
Newbie 10 posts
Followers: 0 people
(5)
The mystery of calculating Health Score in Performance Monitoring Toolset
Performance Monitoring Toolset can measure how your system is performing. In various pages of our brand new toolset, you can find health scores for node/application/cluster/group. These health scores reflect the state of various components in your ColdFusion setup.
We will see how they are calculated and how to configure them.
Node Health Score
Health Score is dependent on 4 parameters. They are – ART, Error rate, CPU usage, and Memory Usage. We calculate scores for each of the 4 parameters as shown below.
  • ART Score: To calculate an ART score, the ART of the last 5 minutes is calculated and compared against the ART provided by you.
  • For example, if Abaseline  is the baseline ART provided you and Aactual is ART of last 5 minutes and let (2*Abaseline  – Abaseline) / 5 = diff, then,
    • ART Score = 100    if Aactual  <=  Abaseline
    • ART Score = 83      if Abaseline <= Aactual  <   Abaseline + diff
    • ART Score = 66      if Abaseline + diff    <= Aactual  <   Abaseline + 2*diff
    • ART Score = 50      if Abaseline + 2*diff<= Aactual  <   Abaseline + 3*diff
    • ART Score = 33      if Abaseline + 3*diff<= Aactual  <   Abaseline + 4*diff
    • ART Score = 16      if Abaseline + 4*diff<= Aactual  <   Abaseline + 5*diff
    • ART Score = 0        if Aactual  >   Abaseline + 5*diff
  • Error Score : To calculate Error score, the Error percentage of last 5 minutes is calculated and compared against error percentage provided by the user.
  • For example, if Ebaseline  is error baseline provided by user and Eactual is error percentage of last 5 minutes and let (Min(5*Ebaseline  , 100) –  Ebaseline) / 5 = diff, then,
    • Error Score = 100    if Eactual  <=  Ebaseline
    • Error Score = 83      if Ebaseline <= Eactual  <   Ebaseline + diff
    • Error Score = 66      if Ebaseline + diff    <= Eactual  <   Ebaseline + 2*diff
    • Errror Score = 50    if Ebaseline + 2*diff<= Eactual  <   Ebaseline + 3*diff
    • Error Score = 33      if Ebaseline + 3*diff<= Eactual  <   Ebaseline + 4*diff
    • Error Score = 16      if Ebaseline + 4*diff<= Eactual  <   Ebaseline + 5*diff
    • Error Score = 0        if Eactual  >  Ebaseline + 5*diff
  • CPU Score : To calculate the CPU score, CPU usage percentage of last 5 minutes is calculated and compared against cpu usage percentage provided by the user.
  • For example, if Cbaseline  is the CPU usage baseline provided by you and Cactual is the CPU usage percentage of last 5 minutes, and let (Min(5*Cbaseline  , 100) –  Cbaseline) / 5 = diff, then,
    • CPU Score = 100    if Cactual  <=  Cbaseline
    • CPU Score = 83      if Cbaseline <= Cactual  <   Cbaseline + diff
    • CPU Score = 66      if Cbaseline + diff    <= Cactual  <  Cbaseline + 2*diff
    • CPU Score = 50      if Cbaseline + 2*diff<= Cactual  <  Cbaseline + 3*diff
    • CPU  Score = 33     if Cbaseline + 3*diff<= Cactual  <  Cbaseline + 4*diff
    • CPU Score = 16      if Cbaseline + 4*diff<= Cactual  < Cbaseline + 5*diff
    • CPU Score = 0        if Cactual  >  Cbaseline + 5*diff
  • Memory Score :  To calculate the memory score, the heap usage percentage of the last 5 minutes is calculated and compared against heap usage percentage provided by the user.
  • For example, if Mbaseline  is heap usage baseline provided by user and Mactual is heap usage percentage of last 5 minutes and let (Min(5*Mbaseline  , 100) –  Mbaseline) / 5 = diff, then,
    • Memory Score = 100    if Mactual  <=  Mbaseline
    • Memory Score = 83      if Mbaseline <= Mactual  <   Mbaseline + diff
    • Memory Score = 66      if Mbaseline + diff    <= Mactual  <  Mbaseline + 2*diff
    • Memory Score = 50      if Mbaseline + 2*diff<= Mactual  <  Mbaseline + 3*diff
    • Memory Score = 33       if Mbaseline + 3*diff<= Mactual  <  Mbaseline + 4*diff
    • Memory Score = 16      if Mbaseline + 4*diff<= Mactual  < Mbaseline + 5*diff
    • Memory Score = 0        if Mactual  >  Mbaseline + 5*diff
  • Node Health Score = (ART Score * ART Weightage + Error Score * Error Weightage + CPU Score * CPU Weightage + Memory Score * Memory Weightage) / (ART Weightage + Error Weightage + CPU Weightage + Memory Weightage)

Note that the error score has veto power. This means that if error score is zero, the entire health score becomes zero.

Application Health Score
Health Score is dependent on 2 parameters. They are – ART and Error rate. We calculate scores for each of the 2 parameters as shown below.
  • ART Score : To calculate ART score, ART of last 5 minutes is calculated and compared against the ART provided by the user.
  • Suppose, Abaseline  is ART baseline provided by user and Aactual is ART of last 5 minutes and let (2*Abaseline  – Abaseline) / 5 = diff
    • ART Score = 100    if Aactual  <=  Abaseline
    • ART Score = 83      if Abaseline <= Aactual  <   Abaseline + diff
    • ART Score = 66      if Abaseline + diff    <= Aactual  <   Abaseline + 2*diff
    • ART Score = 50      if Abaseline + 2*diff<= Aactual  <   Abaseline + 3*diff
    • ART Score = 33      if Abaseline + 3*diff<= Aactual  <   Abaseline + 4*diff
    • ART Score = 16      if Abaseline + 4*diff<= Aactual  <   Abaseline + 5*diff
    • ART Score = 0        if Aactual  >  Abaseline + 5*diff
  • Error Score : To calculate Error score, Error percentage of last 5 minutes is calculated and compared against error percentage provided by the user.
  • Suppose, Ebaseline  is error baseline provided by user and Eactual is error percentage of last 5 minutes and let (Min(5*Ebaseline  , 100) –  Ebaseline) / 5 = diff
    • Error Score = 100    if Eactual  <=  Ebaseline
    • Error Score = 83      if Ebaseline <= Eactual  <   Ebaseline + diff
    • Error Score = 66      if Ebaseline + diff    <= Eactual  <   Ebaseline + 2*diff
    • Errror Score = 50    if Ebaseline + 2*diff<= Eactual  <   Ebaseline + 3*diff
    • Error Score = 33      if Ebaseline + 3*diff<= Eactual  <   Ebaseline + 4*diff
    • Error Score = 16      if Ebaseline + 4*diff<= Eactual  <   Ebaseline + 5*diff
    • Error Score = 0        if Eactual  >  Ebaseline + 5*diff
  • Application Health Score = (ART Score * ART Weightage + Error Score * Error Weightage) / (ART Weightage + Error Weightage)

Please note that error score has veto power. This means that if error score is zero, entire health score becomes zero.

 
Cluster Health Score :  Cluster health score is simple mean of all nodes in cluster.
Group Health Score :  Group health score is simple mean of all nodes in group.
Setting Baseline for different parameters:  In Health Score and Baseline section of settings, admin can set baseline and weightage parameters for calculation of various health scores.
For node :
ext_blog
For application :
ext_blog
5 Comments
2021-11-06 04:03:22
2021-11-06 04:03:22

Geoff, I did some testing (on a machine with no request activity at all), and I think I’ve found the bug. At least from my testing, it’s in the CPU baseline (on the server page). To see if this is it for you, there’s one thing you can easily do, and then I think I see what the mistake is in the PMT scoring process.

First, if you set the CPU “weightage” value to 0, do you see the score change from 75% to 100? I do. BTW, I found that the score change took effect immediately (on the next refresh, which seems to be every 10 seconds). That confirms for me that it IS the CPU score that’s causing the drop of 25% (and as there are 4 measures, if they all have the same weight, that’s why this have an issue drops the score by 25%).

Second, as for the real bug, the CPU baseline is clearly indicated in a % value (with a default of 60), and you would think (and the blog post above here says) that the actual CPU use of CF would have to be above that 60% to degrade the score…but like you I can attest that throughout my testing of things my CPU use (in CF and even in total) was less than 10% the entire time of my testing. So clearly something’s not right with that scoring.

So I set the weight back to 1 (as all the other 3 were) and I made the cpu “%” value first 9, then 99, then 999, and it wasn’t until I made it 9999 that the score became 100%. So what can we make of that? Well, my guess is that they are mistakenly dividing the CPU value by 1000. (I can imagine a possible copy/paste coding error if they may have for some reason done the same thing with the ART value above it, which of course is in ms.) And for now, to correct for that, we need to make our %value 1000x larger than we’d think.

So anyone else wanting to try this: look at your avg CPU for CF (as reported in the OS or in the PMT’s CF Server>System page and its CPU graph). If yours might be typically 10% at the time you try this, then make it more like 10000…or even 11000 or above so that yours is BELOW that, once it’s divided by 1000.

And if you really feel you have 0% CPU use, Geoff, you might be tempted to try 999 (since 999/1000 would be less than 1), but if your really CPU creeps up to the 2-3% range, then it would always be above that and so contribute negatively to the score. So just try 9999, which would cover up to 10% CPU use, by my assertions above.

Let us know what you experience. (And if anyone from Adobe may see this before you do, perhaps they could look at the underlying code of the PMT to see if they spot this mistake in it.)

Finally, while we’re on the subject of the health score and these configuration values, I am finding if I restart the PMT service (not CF or the PMT Datastore, but the PMT service itself), the values I put into this healthscore configuration screen are all reset to their defaults (values of 5000, 10, 60, and 60, for ART, Error, CPU, and Memory, respectively), all with a “weightage” of 1. That seems another bug.

If you might confirm that, Geoff (or anyone else), I could then file a bug report at tracker.adobe.com, for both matters. HTH.

Like
(2)
>
Charlie Arehart
's comment
2021-11-09 00:07:42
2021-11-09 00:07:42
>
Charlie Arehart
's comment

I can confirm, that the restart of the PMT service does indeed reset the existing parameters to their default settings, I can also confirm that updating the CPU to a number 100 times seems to correct the health score arithmetic.

Like
>
Charlie Arehart
's comment
2022-02-01 22:21:42
2022-02-01 22:21:42
>
Charlie Arehart
's comment

Hi Charlie, did you ever file any bug reports for these issues? I can confirm that the CPU 100(0) x baseline workaround is still necessary in PMT 2021.0.03.329792. The baseline values getting reset after a restart doesn’t appear to happen in this version however.

Like
2021-11-05 21:29:14
2021-11-05 21:29:14

Given the general complexity of this maybe an example of a typical setting would help.

I have tried a variety of settings and never get a good health score.  This is on a Saturday after a service restart with no-one on the site, I figured at this point the health score should be 100 or close to it.  I have no errors the CPU is zero the memory at 4gb out of 16gb.  I really think something is wrong with the scoring.  Maybe in the PMT a popup window could break down which of the weighted scores is having the biggest effect.

Like
(1)
>
gbarth
's comment
2021-11-06 04:06:53
2021-11-06 04:06:53
>
gbarth
's comment

Sorry, I meant to offer my note above as a “reply” to your comment, Geoff, but instead I caused it to get posted as a new comment. See above (saying this as much for future readers who may find this post).

Like
Add Comment