Y.1731 Fault management

11 Feb in CCM, Fault Management, ITU-T Y.1731, Carrier Ethernet
PDF versiePDF versie

SwitchIn my latest blog on Y.1731 I finished saying the next article will be on the fault management functions.
So, in this article I'll start where I left off.
 

Fault management overview

Y.1731 has only defined the ability to measure performance parameters. The MEF plans to address performance management requirements that are not addressed by ITU-T or IEEE 802.1ag.
For the first phase, Ethernet Service OAM, the measurement of performance management parameters is limited to point-topoint MA/MEG. Y.1731 uses the same performance parameter definitions as used in the MEF 10 Standard, “Ethernet Service Attributes Phase 1.”

The following performance parameters are measured:
 

Frame Loss Ratio (FLR)

FLR is defined as a ratio, expressed as a percentage, of the number of service frames not delivered divided by the total number of service frames during time interval T, where the number of service frames not delivered is the difference between the number of service frames sent to an ingress UNI and the number of service frames received at an egress UNI.
Two types of FLR measurement are possible, Dual-ended LM and Single-ended LM. Dual-ended LM is accomplished by exchanging CCM OAM frames that include appropriate counts of frames transmitted and frames received. These counts do not include OAM frames at the MEPs ME Level. Dual-ended LM enables the proactive measurement of both Near End and Far End FLR at each end of a MEG.
Single-ended LM is accomplished by the on-demand exchange of LMM and LMR OAM frames. These frames include appropriate counts of frames transmitted and received. Single-ended LM only provides Near End and Far End FLR at the end that initiated the LM Request.
 

Frame Delay (FD)

FD is specified as round trip delay for a frame, where FD is defined as the time elapsed since the start of transmission of the first bit of the frame by a source node until the reception of the last bit of the loop backed frame by the same source node, when the loopback is performed at the frame’s destination node.
 

Frame Delay Variation (FDV)

FDV is a measure of the variations in the FD between a pair of service frames, where the service frames belong to the same CoS instance on a point-to-point ETH connection.

There are two types of FD measurements, One-way and Two-way. One-way FD is measured by MEPs periodically sending 1DM frames, which include appropriate Transmit Time Stamps. FD is calculated at the receiving MEP by taking the difference between the Transmit Time Stamp and a Receive Time Stamp, which is created when the 1DM frame is received. One-way DM requires synchronized clocks between the two MEPs.

Two-way DM measures round trip delay and does not require synchronized clocks. It is accomplished by MEPs exchanging DMM and DMR frames. Each of these DM OAM frames includes Transmit Time Stamps. Y.1731 allows an option for inclusion of additional time stamps such as a Receive Time Stamp and a return Transmit Time Stamp. These additional time stamps compensate for DMR processing time.
FDV is calculated exactly as the difference between two consecutive Two-way FD measurements.

These parameter are measured by using the folowing messages,
 

Continuity Check Messages (CCM)

CCMs are periodic hello messages multicast by a MEP within the maintenance domain to detect continuity failures. If a MEP stops receiving periodic CCMs from a peer MEP on a remote bridge, it assumes that either the remote bridge has failed or the continuity of the path between the two bridges has been interrupted.
 

Loopback Messages (LBM/LBR)

LBM is a Unicast message used to verify the connectivity between a MEP and a peer MEP or MIP. Loopback messages are also used for fault localization.
To verify the connectivity between a MEP and a peer MEP or a MIP, an LBM is initiated by the source MEP with a destination MAC address set to the MAC address of desired peer MEP or MIP. The receiving MIP or MEP responds to the LBM with a (Unicast) Loopback Reply (LBR) addressed to the source MEP.
LBM helps a MEP identify the location of a continuity fault along a given MA. A MIP in front of the continuity fault responds with a loopback reply. A MIP or MEP behind the continuity fault does not respond. For loopback to work, the MEP must know the MAC address of the target MIP or MEP. These MAC addresses can be discovered using the Linktrace Message.
 

Linktrace Messages (LTM)

LTM is a multicast message used by a source MEP to trace the path to other MEPs in the same MA. All reachable MIPs and MEPs respond back with a Linktrace Reply (LTR) message addressed to the source MEP. The originating MEP can then determine the MAC addresses of all MIPs and MEPs belonging to the same MA.
Note that the source MEP sends a single LTM to the next hop along the trace path. However, it can receive many LTR messages from different MIPs along the trace path and different MEPs terminating the branches of the trace path.
Linktrace can also be used when no faults are apparent in order to discover the routes normally taken by data through the network.

 There are other messages used like LTR, AIS, LCK, DMM, DMR etc.

 

Hierarchical Fault Management

As already mentioned, 802.1ag CFM defines a domain hierarchy in which customers, service providers, and operators use different MD levels. This hierarchy is also used for fault detection.
The figure below shows a complete networkmodel with both inward- and outward facing MEPs, MIPs and maintenance levels (shown as numbers inside the triangle or circle)

 

 

When a service continuity fault occurs inside Operator B’s network. The customer can detect an end-to-end service continuity fault using CCM, but it cannot determine the location of the fault within the operators’ network. Operator A can detect that a service continuity fault exists within Operator B’s network.
Note that the customer, Operator A, and Operator B can concurrently and independently detect the continuity fault and run Linktrace to determine the location of the fault. That is, Operator A does not have to wait for the customer to tell it that the service is broken before it runs its own Continuity Check.
Note that the CCMs shown in the above figure can be set up to run continuously to detect potential continuity faults or they can be set up on demand as needed.
  

Operational issues

Within the realm of fault management, Ethernet OAM can support fault detection, fault verification, fault isolation, fault notification, and fault recovery. In the realm of performance management, Ethernet OAM provides the tools to measure frame loss, delay, and delay variation, and service availability.
For fault detection, Ethernet OAM provides a means to detect both hard and soft failures such as missconfiguration or software failure. Due to the fact that CCMs are multicast, if an MEP receives a CCM with a MEP ID that is not within its configured MA/MEG, a miss-configuration or cross connect error is likely.
A customer’s EVC may include an unauthorized site and an appropriate alarm will be generated. A good feature of Ethernet OAM is that if a service instance is taken out of service.
The key operational issue for Ethernet OAM is scalability. CCM can be sent as fast as every 3.3 ms. There can be 4,094 VLANs per port and up to eight maintenance levels. This yields a worst case CCM transmission rate of 9.8 million CCMs per second. Also as previously noted supporting an optional MIP CCM database may present some scalability issues.
An operational issue related to Ethernet OAM is MEP and MIP provisioning and discovery. An MEP must be provisioned with information about its peer MEPs. This information can be potentially discovered. MEPs can proactively discover other MEPs by CCM messages.
ITU-T has defined a multicast loopback, which can be used to discover other MEPs on an on-demand basis. MIPs can be discovered by using linktrace.
Another administrative issue is negotiation, agreement, and provisioning of ME Levels across customer, provider, and operator. An associated issue with MIPs and multiple administrative levels is this question: will service providers support customer MIP functions within their network?

Fault verification is accomplished by using loopback messages. The principal operational issue is MEP knowledge of remote MEP/MIP addresses. Fault isolation can be addressed by using the linktrace message.
The main operational issue for linktrace is Ethernet MAC address learning and aging. When there is a network fault, the MAC address of a target node can age out in several minutes (e.g. typically five minutes).
Solutions are to launch linktrace within the age out time or to maintain a separate target MEP database at intermediate MIPs. However, this requires a MIP CCM database.

 

In my next blog I will talk about SOAM-PM to represent the performance attributes defined by MEF 10.2