While doing basic monitoring we saw in logs below message that the sockets are showing reset and managed servers status on admin console were outdated.
[weblogic.rjvm.PeerGoneException: No message was received for: '240' seconds] [ServiceException]>
On more investigation we found that it was due to a network failure between machines and managed servers. The inconsistency of the cluster was caused because of the network connectivity issues.
AdminMBean (Runtime Bean) will poll to the members in the cluster periodically. The polling failed due to the Peer Gone exception.
Usually this exception occurrs due to network failure. The exceptions are thrown due to the issue in synchronization of the members in the cluster
Eg: Exception stack trace:
Unexpected Exception - with nested exception: [weblogic.rjvm.PeerGoneException: No message was received for: '240' seconds] [ServiceException]> [weblogic.rjvm.PeerGoneException: No message was received for: '240' seconds] [ServiceException]> weblogic.jms.dispatcher.DispatcherException: Could not register a HeartbeatMonitorListener for [Delegate(32547220) [[email protected] RMI:weblogic.jms.dispatcher.DispatcherImpl:0000000000000000 @ipaddress:port , class weblogic.jms.dispatcher.DispatcherImpl]] for weblogic.jms.C:SP-Scale-93241:3kx:-i at weblogic.jms.dispatcher.DispatcherWrapperState.addPeerGoneListener(DispatcherWrapperState.java:580)
To fix this kind of issues make sure all the network issues are resolved across the cluster configuration eg. providing Valid IP Addresses/DNS names/Port numbers and restart the WebLogic Server.
For more knowledge on what exactly the network issue we can use tools like below and analyse it.
mtr, tcpdump and tshark – Useful for seeing what’s happening in network.
Netcat. Test if TCP services are listening.