Alarm and Event Filtering

Let us first turn our attention to filtering, not just of alarms, but of events in general. To focus an operator's or a management application's attention on those events that really matter, it is important to block out as many irrelevant or less important events as possible. This is analogous to the way in which the human brain is able to deal with the massive flow of data that it is constantly exposed to, such as sounds, visual images, and sensory data. To focus, it filters out massive amounts of data that would otherwise be distracting, for example, background noise when following a conversation.

One way to enable filtering is to allow users (operators or management applications) to subscribe only to those alarms and events that are of potential relevance to them and what they need to accomplish, as specified by some criteria. This way, users receive only events that meet those criteria. Here are some examples of using this technique effectively: Users might choose to subscribe only to alarms that involve a particular system or subsystem. For instance, they could be concerned with always ensuring that the company's CEO receives excellent communication service and, therefore, subscribe specifically to alarms that affect the port through which the company's CEO's office is connected. Users might also choose to subscribe only to alarms of a certain type. For instance, operating personnel for voice services might be interested only in alarms that indicate problems that are related to voice service. Finally, users might choose to receive only alarms that have a certain severity. They might decide to receive only critical alarms and to have everything else discarded (well, perhaps not discarded, but simply stored in a logfile so that it can be used for analysis when needed, as opposed to being brought directly to their attention). This could be important when high alarm volumes occur, so they can avoid the small stuff and ensure that high-impact items are dealt with.

Another way to filter alarm concerns deduplication of alarms. In some cases, the same alarm condition might cause the same alarm to be sent repeatedly. Because each new instance of the same alarm contains no new information, the new instances might simply be thrown away. The process of discarding the redundant alarms is referred to as deduplication. A similar scenario to which similar considerations apply is that of oscillating alarms. In that case, there is an underlying oscillating alarm condition, causing alarms to be sent and then cleared again immediately before occurring again in rapid succession multiple consecutive times. Although oscillating alarms relate to only a singe condition and are hence relatively easy to spot, they can lead to a high overall alarm volume that drowns out other events that are happening in the network. Therefore, the alarms should be turned off.

An infamous example concerns the "door open" alarm. Such an alarm can often be sent by equipment that can be installed in publicly accessible locations whenever a sensor detects that its door is opened. Having a door to a piece of equipment opened can indicate a serious problem because it could mean that an unauthorized person might be tampering with the equipment. The problem in this case is that thousands of alarms could be generated per hour when the sensor on a particular piece of equipment is faulty and mistakenly detects that the door is open, only to correct itself by reporting that it is closed, every other second. Until the faulty sensor is fixed, the oscillating alarms need to be filtered.

Of course, with oscillating alarms, it could still be useful to know the frequency with which oscillation occurs, or, with redundant alarms, how many duplicates there are. For example, is the door reported open three times in an hour? If so, the door might really have been opened three times because someone is in fact tampering with the equipment, perhaps while performing maintenance. Or is the door reported open 3,000 times in an hour, in which case a sensor or contact is probably bad? If the repeated occurrences of the alarm are simply filtered, this information is lost. A better solution is to record the information that duplicates or oscillations have been observed, along with how many there were, but to throw away the alarms themselves. Of course, if the rate at which oscillations or seemingly redundant alarms occur drops below a certain threshold, it might even be advisable to not filter those alarms at all.

For example, here is one technique that can be applied in the case of redundant alarms: The first occurrence of the alarm message needs to be forwarded without delay to the intended recipient. If duplicates of the same alarm occur, at least 1 minute should be allowed to pass before notifying the recipient of the same alarm again. At this point, the alarm message is sent again, annotated with a counter that tells the recipient how many instances of the alarm message have actually occurred.

Figure 5-5 also illustrates this. If alarm A1 occurs in the initial state, S1, of the system, it is forwarded immediately and another state, S2, is entered. If no more alarm A1s are received within the minute time period, the system reverts back to its initial state. However, if additional A1 alarms occur, they are not forwarded immediately. Instead, the system enters another state, S3, in which the duplicate counter is increased for each occurrence of A1. Eventually, the minute timer expires.

At that point, the system enters state S4, in which it sends the alarm A1 along with the count of its number of occurrences. It then immediately enters again state S2, waiting for more duplicates of A1 or, if no more are received within the minute interval, reverts back to the initial state.

Figure 5-5 Deduplication of Alarms

S1

A1 occurs

*

S3

T1 expires^

S4

"*T1 expires

S1: Initial state; wait for A1 to occur

S2: Send A1; start timer T1; initialize duplicate count to 0

S3: Increment duplicate count by 1

S4: Send A1 annotated with number of occurrences

Of course, strictly speaking, we are now no longer simply filtering messages. Although it is true that we throw away many of the duplicates, we maintain a little counter for the number of occurrences and add this counter to the duplicate alarm message that we sent. This means that we have actually started to aggregate and preprocess information across alarm messages—what we have here is really a very simple form of correlating alarms, which leads us to the next topic.

Was this article helpful?

+1 0

Post a comment