WHERE DO I START?
You are here:Policies > Default policies
  • Top ↑

Default policies

Default policies are created by Netuitive and intended to provide recommendations for ways to monitor the behavior of the elements in your environment. Default policies can be found on the Policies page and are marked as Netuitive in the Created By column. You can edit default policies as needed to suit the behavior of your environment. When new default policies are provisioned to your account, Netuitive will not overwrite any changes you made to existing default policies. Furthermore, any new default policies added to your account will be disabled by default.

Before reading about default policies, you first should understand the concepts of scope, conditions, duration, notifications, and event categories.

AWS

ASG

Table 1-5 below describes the default policies for AWS Auto Scaling Group (ASG) elements.

Policy name Duration Conditions Category Description
AWS ASG - Elevated CPU Activity (Normal Network Activity) 30 minutes
  1. aws.ec2.cpuutilization...
    1. ... has an
    2. ... has an
  2. AND

  3. netuitive.aws.ec2.bytesinpersec...
    1. ... does NOT have a deviation
    2. ... does NOT have a deviation
  4. AND

  5. netuitive.aws.ec2.bytesoutpersec...
    1. ... does NOT have a deviation
    2. ... does NOT have a deviation
INFO

Increases in CPU activity are not uncommon when there is a rise in network activity. Increased traffic to a server means more work for that server to do. This policy is designed to catch cases where CPU activity is higher than than normal, and said behavior cannot be explained by a corresponding increase in network traffic. It may or may not represent a problem, but it is useful to know about.

Note   This policy is the same as the corresponding EC2 policy, but it operates on the average CPU and network utilization across all EC2's in the ASG.
AWS ASG - Elevated Network Activity 30 minutes
  1. netuitive.aws.ec2.bytesinpersec...
    1. ... has an
    2. ... has an
  2. AND

  3. netuitive.aws.ec2.bytesoutpersec...
    1. ... has an
    2. ... has an
INFO

Indicates an increase in network activity above what is considered to be normal.

Note   This policy is the same as the corresponding EC2 policy, but it operates on the average network utilization across all EC2's in the ASG.
AWS ASG - Elevated Ephemeral Disk Activity 30 minutes
  1. netuitive.aws.ec2.diskreadopspersec...
    1. ... has an
    2. ... has an
  2. AND

  3. netuitive.aws.ec2.diskwriteopspersec...
    1. ... has an
    2. ... has an
INFO

Indicates an increase in disk activity above what is considered to be normal.

Note   This policy is the same as the corresponding EC2 policy, but it operates on the average disk utilization across all EC2's in the ASG.

Table 1-4: ASG Global Policies.

DynamoDB

Policy Name Duration Conditions Category Description
AWS DynamoDB - Elevated Read Capacity Utilization 30 Minutes
  1. netuitive.aws.dynamodb.readcapacityutilization...
    1. ... has an
    2. ... has an
    3. ... has a ≥ 50
WARNING Read Capacity Utilization has been higher than expected for over 30 minutes; also, the actual value has been above 50% for that time.
AWS DynamoDB - Elevated Write Capacity Utilization 30 Minutes
  1. netuitive.aws.dynamodb.writecapacityutilization...
    1. ... has an
    2. ... has an
    3. ... has a ≥ 50
WARNING Write Capacity Utilization has been higher than expected for over 30 minutes; also, the actual value has been above 50% for that time.

EBS

Before reading about the EBS default policy, it is important to understand the following Netuitive computed metrics. For more information about computed metrics, see Computed metrics.

  • Average Latency: Average Latency is straightforward as it represents the average amount of time that it takes for a disk operation to complete.
  • Queue Length Differential: Queue Length Differential measures the difference between the actual disk queue length and the "ideal" disk queue length.The ideal queue length is based on Amazon's rule of thumb that for every 200 IOPS you should have a queue length of 1. In theory, a well-optimized volume should have a queue length differential that tends to hover around 0. In practice, we have seen volumes with extremely low latency (< 0.0001) have queue length differentials that are higher than 0; presumably this is because the latency is much lower than Amazon is assuming for their rule of thumb. Even in these cases, the differential is a pretty steady number.

Table 1-5 below describes the default policy for EBS elements.

Policy name Duration Conditions Category Description
Elevated Queue Length Differential with Elevated Latency 30 minutes
  1. netuitive.aws.ebs.queuelengthdifferential...
    1. ... has an
    2. ... has a > 1
  2. AND

  3. netuitive.aws.ebs.averagelatency...
    1. ... has an
CRITICAL

Because the Queue Length Differential tends to be steady, the first condition of the policy looks for an upper deviation as the first indication that the disk may be getting more traffic than it can keep up with.

The second condition in the policy checks for the differential to be greater than 1 in order to avoid false alarming in cases where the differential is very low.

The third condition is being checked because an elevated queue differential by itself is not necessarily a bad thing. It means you are queueing at a higher rate than is suggested by Amazon's rule of thumb, but again, if your latency is low enough, this is ok. Thus, we only want to alarm if your differential is higher than normal AND your latency is higher than normal.

Table 1-5: EBS default policies.

EC2

Table 1-6 below describes the default policies for AWS EC2 elements.

Policy name Duration Conditions Category Description
AWS EC2 - Elevated CPU Activity (Normal Network Activity) 30 minutes
  1. aws.ec2.cpuutilization...
    1. ... has an
    2. ... has an
  2. AND

  3. netuitive.aws.ec2.bytesinpersec...
    1. ... does NOT have a deviation
    2. ... does NOT have a deviation
  4. AND

  5. netuitive.aws.ec2.bytesoutpersec...
    1. ... does NOT have a deviation
    2. ... does NOT have a deviation
INFO Increases in CPU activity are not uncommon when there is a rise in network activity. Increased traffic to a server means more work for that server to do. This policy is designed to catch cases where CPU activity is higher than than normal, and said behavior cannot be explained by a corresponding increase in network traffic. It may or may not represent a problem, but it is useful to know about.
AWS EC2 - Elevated Network Activity 30 minutes
  1. netuitive.aws.ec2.bytesinpersec...
    1. ... has an
    2. ... has an
  2. AND

  3. netuitive.aws.ec2.bytesoutpersec...
    1. ... has an
    2. ... has an
INFO Indicates an increase in network activity above what is considered to be normal.
AWS EC2 - Elevated Ephemeral Disk Activity 30 minutes
  1. netuitive.aws.ec2.diskreadopspersec...
    1. ... has an
    2. ... has an
  2. AND

  3. netuitive.aws.ec2.diskwriteopspersec...
    1. ... has an
    2. ... has an
INFO Indicates an increase in disk activity above what is considered to be normal.
AWS EC2 - Elevated Ephemeral Disk Activity 30 minutes

aws.ec2.cpuutilization has a > 95%

WARNING The CPU on the EC2 instance has exceeded 95% for at least 15 minutes.

Table 1-6: EC2 default policies.

Elasticache

Policy name Duration Conditions Category Description
AWS Elasticache Memcached - CPU Threshold Exceeded 5 minutes aws.elasticache.cpuutilization has a > 90% CRITICAL The Memcached Node has exceeded the CPU threshold of 90%. The cache cluster may need to be scaled, either by using a larger node type or by adding more nodes.
AWS Elasticache Memcached - Elevated CPU Utilization 30 minutes
  1. aws.elasticache.cpuutilization...
    1. ... has an
    2. ... has a > 50%
WARNING CPU utilization for the Memcached Node has been higher than expected for at least 30 minutes.
AWS Elasticache Memcached - Elevated Swap Usage 5 minutes aws.elasticache.swapusage has a > 53428800 CRITICAL Swap usage on the Memcached Node has exceeded 50 MB. It is recommended that you increase the value of the ConnectionOverhead parameter.
AWS ElastiCache Redis - Elevated Command Executions 30 minutes
  1. aws.elasticache\..*cmds
    1. ... has an
    2. ... has an
WARNING

One or more command types on the Redis node have been experiencing a higher than expected number of executions for at least 30 minutes.

AWS ElastiCache Redis - Elevated CPU Utilization 30 minutes
  1. aws.elasticache.cpuutilization...
    1. ... has an
    2. ... has a > 30%
WARNING CPU utilization for the Redis Node has been higher than expected for at least 30 minutes.
AWS ElastiCache Redis - Elevated Network Activity 30 minutes
  1. aws.elasticache.networkbytesin
    1. ... has an
    2. ... has an
  2. AND

  3. aws.elasticache.networkbytesout
    1. ... has an
    2. ... has an
WARNING

Network activity to/from the Redis node has been higher than expected for at least 30 minutes.

AWS Elasticache Redis - Elevated Number of New Connections 30 minutes
  1. aws.elasticache.newconnections
    1. ... has an
    2. ... has an
WARNING

The number of new connections being opened to the Redis node has been higher than expected for at least 30 minutes.

AWS Elasticache Redis - Elevated Replication Lag 30 minutes

aws.elasticache.replicationlag has an

WARNING

Replication lag for the Redis node has been higher than expected for at least 30 minutes.

AWS Elasticache Redis - Elevated Swap Usage 30 minutes

aws.elasticache.swapusage has an

WARNING

Swap usage on the Redis Node has been higher than expected for at least 30 minutes. Extended swapping indicates a low physical memory condition, and can lead to performance degradation.

AWS Elasticache Redis - Extended Period of Evictions 30 minutes

aws.elasticache.swapusage has a > 0

WARNING

Evictions for the Redis node have been greater than 0 for at least 30 minutes. This could indicate a low memory condition, and may impact performance.

AWS Elasticache Redis - Low Cache Hit Rate 30 minutes
  1. aws.elasticache.cachehitrate
    1. ... has an
    2. ... has an
WARNING

The cache hit rate for the Redis node has been lower than expected for at least 30 minutes.

ELB

Table 1-7 below describes the default policies for ELB elements.

Policy name Duration Conditions Category Description
AWS ELB - Elevated Backend Error Rate (Low Volume) 15 minutes
  1. netuitive.aws.elb.httpcodebackenderrorpercent...
    1. ... has an
    2. ... has an
  2. AND

  3. netuitive.aws.elb.requestcount...
    1. ... has a < 1,000
WARNING

This is the first of three policies that look at elevated backend error rates. This policy looks specifically at low traffic volume cases. When traffic volumes are low, elevated error rates tend to be less important. For example, a 50% error rate is pretty significant if the total number of requests is 1 million; it is less so if the total number of requests is 10. Thus, this policy will generate a Warning if error rates are higher than normal and traffic volumes are low. By default, "low" is defined as less than 1,000 requests; you may wish to tune this for your own environment.

AWS ELB - Elevated Backend Error Rate (High Volume, Low Error Rate) 15 minutes
  1. netuitive.aws.elb.httpcodebackenderrorpercent...
    1. ... has an
    2. ... has an
    3. ... has a < 2%
  2. AND

  3. netuitive.aws.elb.requestcount...
    1. ... has a ≥ 1,000
WARNING

This is the second of three policies that look at elevated backend error rates. For many customers, an error rate which is low enough is not cause for concern even if it is higher than normal. For example, if the normal error rate is between 0.25% and 0.75%, and observed error rate of 1.1% is higher than expected, but may not be worth more than a Warning. Thus, this policy looks for those cases where the error rate is higher than expected, but is under 2%. It also looks for traffic volumes to not be low, since the low traffic scenario is covered by the "Elevated Backend Error Rate (Low Volume)" policy. You may wish to tune either the 1,000 request count threshold, the 2% error threshold, or both, to better suit your environment.

AWS ELB - Elevated Backend Error Rate (High Volume, High Error Rate) 15 minutes
  1. netuitive.aws.elb.httpcodebackenderrorpercent...
    1. ... has an
    2. ... has an
    3. ... has a ≥ 2%
  2. AND

  3. netuitive.aws.elb.requestcount...
    1. ... has a ≥ 1,000
CRITICAL This is the third of three policies that look at elevated backend error rates. In this case, we are looking for both high traffic volumes (> 1000) as well as error rates that are not just higher than normal, but are above the 2% threshold. In those cases, a Critical event will be generated. You may wish to tune either the 1,000 request count threshold, the 2% error threshold, or both, to better suit your environment.
AWS ELB - Elevated Latency 30 minutes
  1. aws.elb.latency...
    1. ... has an
    2. ... has an
  2. AND

  3. netuitive.aws.elb.requestcount...
    1. ... has a ≥ 1,000
CRITICAL This policy will generate a Critical event when average latency is higher than normal for half an hour or longer. Note that there must also be a minimum number of requests for this policy to trigger; this is because with too few requests, the average can tend to be skewed by outliers. The default request threshold is 1,000; you may wish to tune this for your environment.
AWS ELB - Surge Queue Utilization Greater Than 5% 15 minutes

netuitive.aws.elb.surgequeueutilization has a > 5%

WARNING The ELB surge queue holds requests until they can be forwarded to the backend servers. The surge queue can hold a maximum of 1,024 requests, after which it will be full and will start rejecting requests. Netuitive's Surge Queue Utilization metric reflects as a percentage how full the surge queue currently is. If the surge queue is more than 5% full for 15 minutes or longer, a Warning event is generated.
AWS ELB - Surge Queue Utilization Greater Than 50% 15 minutes

netuitive.aws.elb.surgequeueutilization has a > 50%

CRITICAL The ELB surge queue holds requests until they can be forwarded to the backend servers. The surge queue can hold a maximum of 1,024 requests, after which it will be full and will start rejecting requests. Netuitive's Surge Queue Utilization metric reflects as a percentage how full the surge queue currently is. If the surge queue is more than 50% full for 15 minutes or longer, a Critical event is generated.
AWS ELB - Unhealthy Host Percent Above 50% 15 minutes
  1. netuitive.aws.elb.unhealthyhostpercent...
    1. ... has a ≥ 50%
    2. ... has a < 75%
WARNING More than half (50%) of the hosts associated with this ELB are in an unhealthy state.
AWS ELB - Unhealthy Host Percent Above 75% 5 minutes

netuitive.aws.elb.unhealthyhostpercent has a ≥ 75%

CRITICAL More than three quarters (75%) of the hosts associated with this ELB are in an unhealthy state.
AWS ELB - Elevated ELB Error Rate 15 minutes
  1. netuitive.aws.elb.httpcodeelberrorpercent...
    1. ... has an
    2. ... has an
    3. ... has a ≥ 2%
  2. AND

  3. aws.elb.requestcount has a ≥ 1000

CRITICAL This is another error rate policy, but rather than looking at backend error rates, it is looking at errors from the ELB itself. In this case, we look for both high traffic volumes (> 1000) as well as error rates that are not just higher than normal, but are above a 2% threshold. In those cases, a Critical event will be generated. You may wish to tune either the 1,000 request count threshold, the 2% error threshold, or both, to better suit your environment.

Table 1-7: ELB default policies.

Lambda

Policy name Duration Conditions Category Description
AWS Lambda - Elevated Invocation Count 30 minutes
  1. aws.lambda.invocations...
    1. ... has an
    2. ... has an
WARNING The number of calls to the function (invocations) have been greater than expected for at least the last 30 minutes.
AWS Lambda - Depressed Invocation Count 10 minutes
  1. aws.lambda.invocations...
    1. ... has an
    2. ... has an
WARNING The number of calls to the function (invocations) have been lower than expected for at least the last 10 minutes.
AWS Lambda - Elevated Latency 30 minutes
  1. aws.lambda.duration...
    1. ... has an
    2. ... has an
WARNING The average duration per function call (latency) has been higher than expected for at least the past 30 minutes.

RDS

Table 1-8 below describes the default policies for RDS elements.

Policy name Duration Conditions Category Description
Elevated RDS CPU Activity (Normal Network Activity) 30 minutes
  1. netuitive.aws.rds.cpuutilization...
    1. ... has an
    2. ... has an
    3. ... has a > 20

  2. AND

  3. netuitive.aws.rds.networkreceivethroughput...
    1. ... does NOT have a deviation
    2. ... does NOT have a deviation
  4. AND

  5. netuitive.aws.rds.networktransmitthroughput...
    1. ... does NOT have a deviation
    2. ... does NOT have a deviation
INFO Increases in CPU activity are not uncommon when there is a rise in network activity. Increased traffic to a server means more work for that server to do. This policy is designed to catch cases where CPU activity is higher than than normal, and said behavior cannot be explained by a corresponding increase in network traffic. It may or may not represent a problem, but it is useful to know about.
Elevated RDS Network Activity 30 minutes
  1. netuitive.aws.rds.networkreceivethroughput...
    1. ... has an
    2. ... has an
  2. AND

  3. netuitive.aws.rds.networktransmitthroughput...
    1. ... has an
    2. ... has an
INFO Indicates an increase in network activity above what is considered to be normal.
Elevated RDS Disk Activity 30 minutes
  1. netuitive.aws.rds.readiops...
    1. ... has an
    2. ... has an
  2. AND

  3. netuitive.aws.rds.writeiops...
    1. ... has an
    2. ... has an
INFO Indicates an increase in disk activity above what is considered to be normal.
Elevated RDS Latency 30 minutes
  1. netuitive.aws.rds.readlatency...
    1. ... has an
    2. ... has an
  2. AND

  3. netuitive.aws.rds.writelatency...
    1. ... has an
    2. ... has an
  4. AND

  5. netuitive.aws.rds.totalthroughput...
    1. ... has a ≥ 1,000
CRITICAL This policy will generate a Critical event when both read and write latency is higher than normal for half an hour or longer. Note that there must also be a minimum number of requests for this policy to trigger; this is because with too few requests, the average can tend to be skewed by outliers. The default request threshold is 1,000; you may wish to tune this for your environment.
AWS RDS - Elevated Number of Connections 15 minutes
  1. netuitive.aws.rds.databaseconnections...
    1. ... has an
    2. ... has an
WARNING The number of database connections open on the RDS instance is higher than expected.
AWS RDS - Elevated Read IOPS 15 minutes
  1. netuitive.aws.rds.readiops...
    1. ... has an
    2. ... has an
WARNING Read activity on the RDS instance is greater than expected.
AWS RDS - Elevated Write IOPS 15 minutes
  1. netuitive.aws.rds.writeops...
    1. ... has an
    2. ... has an
WARNING Write activity on the RDS instance is greater than expected.

Table 1-8: RDS default policies.

SQS

Table 1-6 below describes the default policies for AWS Simple Queue Service (SQS) elements.

Policy name Duration Conditions Category Description
AWS SQS - Queue Falling Behind 2 hours

netuitive.aws.sqs.arrivalrate has a > netuitive.aws.sqs.completionrate

CRITICAL The arrival rate for the queue has been greater than the completion rate for at least 2 hours. This is an indication that processing of the queue is falling behind.

Table 1-9: SQS Global Policies.

Microsoft Azure

Note   Some of the policies below require that you enable basic metric collection on your virtual machine. To learn about how to enable basic metrics, check out the Azure integration help page.
Policy name Metrics Required Duration Conditions Category Description
Azure VM - CPU Threshold Exceeded Boot Diagnostics 15 minutes

Processor.PercentProcessorTime has a > 50%

WARNING The CPU on the Azure Virtual Machine has exceeded 95% for at least 15 minutes.
Azure VM - Elevated CPU Activity (Normal Network Activity) Boot Diagnostics 30 minutes
  1. Processor.PercentProcessor Time...
    1. ... has an
    2. ... has an
    3. ... has a > 20%
  2. AND

  3. NetworkInterface.BytesReceived
    1. ... has
  4. AND

  5. NetworkInterface.BytesTransmitted
    1. ... has
INFO Increases in CPU activity are not uncommon when there is a rise in network activity. Increased traffic to a server means more work for that server to do. This policy is designed to catch cases where CPU activity is higher than than normal and said behavior cannot be explained by a corresponding increase in network traffic. It may or may not represent a problem, but it is useful to know about. This policy will not fire if CPU utilization is less than 20% though.
Azure VM - Elevated Disk Activity Boot Diagnostics 30 minutes
  1. PhysicalDisk.ReadsPerSecond...
    1. ... has an
    2. ... has an
  2. AND

  3. PhysicalDisk.WritesPerSecond...
    1. ... has an
    2. ... has an
INFO Disk activity has been higher than expected for at least 30 minutes.
Azure VM - Elevated Memory Utilization Basic Metrics 15 minutes
  1. Memory.PercentUsedMemory...
    1. ... has an
    2. ... has an
WARNING The memory utilization on the Azure Virtual Machine is higher than expected.
Azure VM - Elevated Network Activity Boot Diagnostics 30 minutes
  1. NetworkInterface.BytesReceived...
    1. ... has an
    2. ... has an
  2. AND

  3. NetworkInterface.BytesTransmitted...
    1. ... has an
    2. ... has an
INFO Network activity has been higher than expected for at least 30 minutes.
Azure VM - Heavy Disk Load Basic Metrics 5 minutes
  1. PhysicalDisk.AverageDiskQueueLength...
    1. ... has an
    2. ... has an
WARNING Average disk queue length is greater than expected, which could indicate a problem with heavy disk load.

Cassandra

Policy name Duration Conditions Category Description
Cassandra - Depressed Key Cache Hit Rate 30 minutes
  1. cassandra.Cache.KeyCache.HitRate...
    1. ... has an
    2. ... has a ≤ 0.85
WARNING The hit rate for the key cache is lower than expected and is less than 85%. This condition has been persisting for at least the past 30 minutes.
Cassandra - Elevated Node Read Latency 30 minutes

cassandra.Keyspace.ReadLatency.OneMinuteRate has an

WARNING The overall keyspace read latency on this Cassandra node has been higher than expected for at least 30 minutes.
Cassandra - Elevated Node Write Latency 30 minutes

cassandra.Keyspace.WriteLatency.OneMinuteRate has an

WARNING The overall keyspace write latency on this Cassandra node has been higher than expected for at least 30 minutes.
Cassandra - Elevated Number of Pending Compaction Tasks 15 minutes

cassandra.Compaction.PendingTasks has an

WARNING The number of pending compaction tasks has been higher than expected for at least the past 15 minutes. This could indicate that the node is falling behind on compaction tasks.
Cassandra - Elevated Number of Pending Thread Pool Tasks 15 minutes

cassandra.ThreadPools.*.PendingTasks has an

WARNING For at least the past 15 minutes, the number of pending tasks for one or more thread pools has been higher than expected. This could indicate that the pools are falling behind on their tasks.
Cassandra - Unavailable Exceptions Greater Than Zero 5 minutes

cassandra.*Unavailables.OneMinuteRate has a ≤ 1

CRITICAL The required number of nodes were unavailable for one or more requests.

Collectd

Table 1-10 below describes the default policies for colllectd elements.

Policy name Duration Conditions Category Description
Elevated Memory Usage (Collectd) 30 minutes

netuitive.collectd.memory.utilizationpercent has an

INFO Indicates an increase in memory usage above what is considered to be normal.
Elevated Process Count 30 minutes

netuitive.collectd.processes.total has an

INFO Indicates that the total number of processes has increased above what is considered to be normal.
Elevated Percentage of Blocked Processes 30 minutes

netuitive.collectd.processes.blockedpercent has an

WARNING Indicates a higher-than-normal percentage of blocked processes.
Elevated Percentage of Zombie Processes 30 minutes

netuitive.collectd.processes.zombiepercent has an

WARNING Indicates a higher-than-normal percentage of zombie processes.

Table 1-10: Collectd default policies.

Diamond / Linux

Table 1-11 below describes the default policies for Diamond and Linux elements.

Important    Before reading about these default policies, note that both the Elevated User CPU and Elevated System CPU policies assume that the CPU Collector is configured to collect aggregate CPU metrics, rather than per core metrics. It also assumes that the metrics are being normalized.

This is done by setting the percore setting set to FALSE (it is TRUE by default) and the normalize setting set to TRUE (it is FALSE by default) in your configuration file. After adjusting these settings, save the configuration file and restart the agent to apply the changes. See the Linux or Diamond agent documentation for more information.

Policy name Duration Conditions Category Description
Linux - CPU Threshold Exceeded 15 minutes

cpu.total.utilization.percent has a > 95%

CRITICAL The CPU on the SERVER instance has exceeded 95% for at least 15 minutes.
Linux - Elevated System CPU 30 minutes
  1. netuitive.linux.cpu.total.system.normalized...
    1. ... has an
    2. ... has a ≥ 30%
INFO This policy will generate an Informational event when CPU usage by system processes is higher than normal, but only if the actual value is also above 30%. Customers typically don't want to be informed of deviations in CPU behavior when the actual values are too low; you may want to tune the 30% threshold for your environment.
Linux - Elevated User CPU 30 minutes
  1. netuitive.linux.cpu.total.user.normalized...
    1. ... has an
    2. ... has a ≥ 50%
INFO This policy will generate an Informational event when CPU usage by user processes is higher than normal, but only if the actual value is also above 50%. Customers typically don't want to be informed of deviations in CPU behavior when the actual values are too low; you may want to tune the 50% threshold for your environment.
Linux - Heavy CPU Load 15 minutes
  1. netuitive.linux.cpu.total.user.normalized...
    1. ... has an
    2. ... has an
  2. AND

  3. netuitive.linux.loadavg.05.normalized has a > 2
CRITICAL This is a CRITICAL event indicating that the server's CPU is under heavy load, based upon upper deviations on CPU utilization percent and the normalized loadavg.05 metric being greater than 2. Rule of thumb is that the run queue size (represented by the loadavg) should not be greater than 2x the number of CPUs.
Linux - Disk Utilization Threshold Exceeded 15 minutes

netuitive.linux.diskspace.*.byte_percentused has a > 95%

CRITICAL The consumed disk space on the SERVER instance has exceeded 95% for at least 15 minutes.
Linux - Heavy Disk Load 15 minutes
  1. iostat.*\.average_queue_length...
    1. ... has an
    2. ... has an
WARNING This is a WARNING which indicates that the disk is experiencing heavy load, but performance has not yet been impacted.
Linux - Heavy Disk Load with Slow Performance 15 minutes
  1. iostat.*\.await...
    1. ... has an
    2. ... has an
  2. AND

  3. iostat.*\.average_queue_length...
    1. ... has an
    2. ... has an
CRITICAL This is a CRITICAL event which indicates that the disk is not only experiencing heavy load, but performance is suffering.
Linux - Memory Utilization Threshold Exceeded 15 minutes netuitive.linux.memory.utilization.percent has a > 95% CRITICAL This is a CRITICAL event which is raised when memory utilization exceeds 95%.
Elevated Memory Usage 30 minutes
  1. netuitive.linux.memory.utilizationpercent...
    1. ... has an
    2. ... has a > 50%
INFO This policy will generate an Informational event when memory usage is higher than normal, but only if the actual value is also above 50%. Customers typically don't want to be informed of deviations in memory usage when the actual values are too low; you may want to tune the 50% threshold for your environment.

Table 1-11: Diamond/Linux default policies.

Docker

Policy name Duration Conditions Category Description
Docker Container - CPU Throttling 15 minutes

netuitive.docker.cpu.container_throttling_percent has a > 0

WARNING

The Docker container has had its CPU usage throttled for at least the past 15 minutes.
Docker Container - Elevated CPU Utilization 30 minutes
  1. netuitive.docker.cpu.container_cpu_percent...
    1. ... has an
    2. ... has an
INFO CPU usage on the Docker container has been higher than expected for 30 minutes or longer.
Docker Container - Elevated Memory Utililzation 30 minutes
  1. netuitive.docker.cpu.container_memory_percent...
    1. ... has an
    2. ... has an
INFO Memory usage on the Docker container has been highter than expected for 30 minutes or longer.
Docker Container - Extensive CPU Throttling 1 hour 5 minutes

netuitive.docker.cpu.container_throttling_percent has a > 0

CRITICAL The Docker container has had its CPU usage throttled for over an hour.

Elastic Search

Policy name Duration Conditions Category Description
Elevated CPU Activity 15 minutes

elasticsearch.process.cpu.percent has an

WARNING

This policy generates a warning event when the Elastic Search CPU activity is higher than expected.
Elevated JVM Heap Usage 15 minutes

elasticsearch.jvm.mem.heap_used_percent has an

WARNING This policy generates a warning event when the Elastic Search JVM's heap usage is higher than expected.
Elevated JVM Threads 15 minutes

elasticsearch.jvm.threads.count has an

WARNING This policy generates a warning event when the number of threads used by the Elastic Search JVM is higher than expected.
Elevated Processing Time 15 minutes

elasticsearch.indices._all.*time_in_millis has an

WARNING This policy generates a warning event if any of the "time in millis" metrics on the "_all" index deviate above the baseline for 15 minutes or more.
Reject Count Greater Than Zero 5 minutes

elasticsearch.thread_pool.*.rejected has a > 0

WARNING This policy generates a warning if any of the Elastic Search thread pools has a "rejected" count greater than 0.

Java

Policy name Duration Conditions Category Description
Elevated JVM CPU Activity 15 minutes
  1. cpu.used.percent...
    1. ... has an
    2. ... has an
    3. ... has a > 50%
WARNING This policy will generate a WARNING event when the JVM's CPU activity is higher than expected. Additionally, the CPU usage is above 50%.
Elevated JVM Heap Usage 15 minutes
  1. netuitive.jvm.heap.utilizationpercent...
    1. ... has an
    2. ... has an
WARNING This policy will generate a WARNING event when the JVM's heap usage is higher than expected.
Elevated JVM System Threads 15 minutes
  1. system.threads...
    1. ... has an
    2. ... has an
WARNING This policy will generate a WARNING event when the number of system threads used by the JVM is higher than expected.

Windows

Table 1-1 below describes the default policies for Windows elements.

Policy name Duration Conditions Category Description
Windows - Elevated Disk Latency 15 minutes
  1. physical_disk._Total.avg_sec_per_read...
    1. ... has an
  2. AND

  3. physical_disk._Total.avg_sec_per_write...
    1. ... has an
WARNING This policy will generate a WARNING event when both disk read and write times are higher than their expected baselines
Windows - Elevated Memory Utilization 10 minutes
  1. netuitive.winsrv.memory.utilizationpercent...
    1. ... has an
    2. ... has an
WARNING This policy will generate a WARNING event when memory utilization on the Windows server is higher than expected.
Windows - Heavy CPU Load 15 minutes
  1. netuitive.winsrv.system.processor_queue_length_normalized...
    1. ... has a > 2
  2. AND

  3. processor._Total.percent_processor_time...
    1. ... has an
    2. ... has an
  4. AND

  5. system.context_switches_per_sec...
    1. ... has an
    2. ... has an
CRITICAL High CPU values by themselves are not always a good indicator of a server being under heavy load. This policy looks for upper deviations not only in CPU, but in run queue size (system.processor_queue_length) and context switches as well. Taken together, upper deviations in all three of these key metrics are a good indication of an overloaded server.
Windows - Heavy Disk Load 15 minutes
  1. physical_disk._Total.avg_queue_length...
    1. ... has an
    2. ... has an
WARNING This policy will generate a WARNING event if the average disk queue length for the server is higher than expected, indicating a potential problem with heavy disk load.

Table 1-12: WinOS Global Policies.