Troubleshooting High CPU Utilization and Poor Application Server Throughput
The first step in resolving this problem is to identify the root cause of the high CPU utilization. Consider the following observations and recommendations:
– Most likely the problem will reside in the application itself, so a good starting point is to profile the application code to determine which areas of the application are using excessive processor resources. These heavyweight operations or subsystems are then optimized or removed to reduce CPU utilization.
– Profile the garbage collection activity of the application. This can be accomplished using application profiling tools or starting your application with the -verbose:gc option set. If the application is spending more than 25 percent of its time performing garbage collection, there may be an issue with the number of temporary objects that the application is creating. Reducing the number of temporary objects should reduce garbage collection and CPU utilization substantially.
– Refer to tuning resources available from Oracle to make sure the application server is tuned properly.
– Add hardware to meet requirements.
Troubleshooting Low CPU Utilization and Poor Application Server Throughput
This problem can result from bottlenecks or inefficiencies upstream, downstream, or within the application server. Correct the problem by walking through a process similar to the following:
1. Verify that the application server itself is functioning normally using the weblogic.Admin command-line administration tool to request a GETSTATE and a series of PING operations. Chapter 12 walked through the use of this tool and the various command-line options and parameters available. Because the GETSTATE and PING operations flow through the normal execute queue in the application server, good response times are an indication that all is well within the server. Poor response times indicate potential problems requiring additional analysis.
2. If the GETSTATE operation reports a healthy state but the PING operations are slow, check to see whether the execute queue is backed up by viewing the queue depth in the WebLogic Console.
3. A backed-up execute queue may indicate that the system is starved for execute threads. If all execute threads are active and CPU utilization is low, adding execute threads should improve throughput, so check work managers for any maximum threads constraints and raise them, as appropriate.
4. If the queue appears starved but adding execute threads does not improve performance, there may be resource contention. Because CPU utilization is low, the threads are probably spending much of their time waiting for some resource, quite often a database connection. Use the JDBC monitoring facilities in the console to check for high levels of waiters or long wait times. Adding connections to the JDBC connection pool may be all that is required to fix the problem.
5. If database connections are not the problem you should take periodic thread dumps of the JVM to determine if the threads are routinely waiting for a particular resource. Take a series of four thread dumps about 5 to 10 seconds apart, and compare them with one another to determine if individual threads are stuck or waiting on the same resource long enough to appear in multiple thread dumps. The problem threads may be waiting on a resource held by another thread or may be waiting to update the same table in the database. The JRockit Latency Analyzer can easily identify any resource contention issues without resorting to thread dumps and other types of monitoring. Once the resource contention is identified you can apply the proper remedies to fix the problem.
6. If the application server is not the bottleneck, the cause is most likely upstream of the server, perhaps in the network or web server. Use the system monitoring tools you have in place to check all of the potential bottlenecks upstream of the application server and troubleshoot these components.
Troubleshooting Low Activity and CPU Utilization on All Physical Components with Slow Throughput
If CPU utilization stays low even when user load on the system is increasing, you should look at the following:
1. Is there any asynchronous messaging in the system? If the system employs asynchronous messaging, check the message queues to make sure they are not backing up. If the queues are backing up and there are no message ordering requirements, try adding more dispatcher threads to increase throughput of the queue.
2. Check to see if the web servers or application servers are thread starved. If they are, increase the number of server processes or server threads to increase parallelism.
Troubleshooting Slow Response Time from the Client and Low Database Usage
These symptoms are usually caused by a bottleneck upstream of the database, perhaps in the JDBC connection pooling. Monitor the active JDBC connections in the WebLogic Console and watch for excessive waiters and wait times; increase the pool size, if necessary. If the pool is not the problem, there must be some other resource used by the application that is introducing latency or causing threads to wait. Often, periodic thread dumps can reveal what the resource might be.
Troubleshooting Erratic Response Times and CPU Utilization on the Application Server
Throughput and CPU will always vary to some extent during normal operation, but large, visible swings indicate a problem. First look at the CPU utilization, and determine whether there are any patterns in the CPU variations. Two patterns are common:
– CPU utilization peaks or patterns coincide with garbage collection. If your application is running on a multiple CPU machine with only one application server, you are most likely experiencing the effects of non-parallelized garbage collection in the application server. Depending on your JVM settings, garbage collection may be causing all other threads inside the JVM to block, preventing all other processing. In addition, many garbage collectors use a single thread to do their work so that all of the work is done by a single CPU, leaving the other processors idle until the collection is complete. Try using one of the parallel collectors or deploying multiple application servers on each machine to alleviate this problem and use server resources more efficiently. The threads in an application server not performing the garbage collection will be scheduled on processors left idle by the server performing collection, yielding a more constant throughput and more efficient CPU utilization. Also consider tuning the JVM options to optimize heap usage and improve garbage collection using techniques described earlier in this chapter.
– CPU peaks on one component coincide with valleys on an adjacent component. You should also observe a similar oscillating pattern in the application server throughput. This behavior results from a bottleneck that is either upstream or downstream from the application server. By analyzing the potential bottlenecks being monitored on the various upstream and downstream components you should be able to pinpoint the problem. Experience has shown that firewalls, database servers, and web servers are most likely to cause this kind of oscillation in CPU and throughput. Also, make sure the file descriptor table is large enough on all Unix servers in the environment.
Troubleshooting Performance Degrading with High Disk I/O
If a high disk I/O rate is observed on the application server machine, the most likely culprit will be excessive logging. Make sure that WebLogic Server is set to the proper logging level, and check to see that the application is not making excessive System.out.println() or other logging method calls.
System.out.println() statements make use of synchronized processing for the duration of the disk I/O and should not be used for logging purposes. Unexpected disk I/O on the server may also be a sign that your application is logging error messages. The application server logs should be viewed to determine if there is a problem with the application.