Monitoring Lock Contention on Solaris
Many Java applications that do not scale suffer from jvm lock contention. Identifying that lock contention in Java applications can be difficult and the tools to identify lock contention are limited.
In addition, optimizations have been made in modern JVMs to improve the performance of applications that experience lock contention. For example, in Java 5, optimizations were integrated into the Java HotSpot VM (also referred to HotSpot
VM hereafter) to implement much of the locking logic, the artifact resulting from Java synchronized methods and synchronized blocks, in user code rather than relying immediately on operating system lock primitives. Prior to Java 5, the HotSpot VM delegated almost all of the locking logic to operating system locking primitives.
This allowed for operating system tools such as Solaris mpstat to easily monitor a Java application for lock contention by observing the “smtx” (spin on mutex) column along with observing system or kernel CPU utilization. As a result of the Java 5 HotSpot VM optimizations to implement much of locking logic in user code, using Solaris mpstat and observing the “smtx” column and “sys” CPU utilization columns no longer work as well. Instead, an alternative approach is needed.
A high level simplistic description of the lock optimization added to Java 5 Hot- Spot VMs and later is given as follows; spin in a tight loop trying to acquire a lock, if not successful after a number of tight loop spins, park the thread and wait to be notified when to try acquiring the lock again. The act of parking a thread along with awaking a thread results in an operating system voluntary context switch. Hence, an application experiencing heavy lock contention exhibits a high number of voluntary context switches. The cost of a voluntary context switch at a processor clock cycle level is an expensive operation, generally upwards of about 80,000 clock cycles.
Context switching can be monitored on Solaris with mpstat by observing the “csw” column. The value reported by the “csw” column in mpstat is the total number of context switches including involuntary context switches. Involuntary context switching is also reported in mpstat in the “icsw” column. Hence, the number of voluntary context switches is the “csw” minus “icsw.” A general rule to follow is that any Java application experiencing 5% or more of its available clock cycles in voluntary context switches is likely to be suffering from lock contention. Even a 3% to 5% level is worthy of further investigation. An estimate of the number of clock cycles spent in voluntary context switching can be calculated by taking the number of thread context switches (csw) observed in an mpstat interval, minus the involuntary context switches observed in an mpstat interval, (icsw), multiplying that number by 80,000 (the general cost of a context switch in number clock cycles), and dividing it by the total number of clock cycles available in the mpstat interval.
To illustrate with an example, the following Solaris mpstat output captured at a 5 second interval from a 3.0GHz dual core Intel Xeon CPU executing a Java application shows context switches (csw) at about 8100 per 5 second interval and involuntary context switches (icsw) at about 100 per 5 second interval.
An estimate of the number of clock cycles wasted due to voluntary context switches is roughly (8100 – 100) * 80,000 = 640,000,000 clock cycles. The number of clock cycles available in a 5 second interval is 3,000,000,0001 * 5 = 15,000,000,000. Hence, 640,000,000 / 15,000,000,000 = 4.27%. About 4.27% of the available clock cycles are consumed in voluntary context switches. Based on the general rule of a Java application spending 3% to 5% or more of available clock cycles in voluntary clock cycles implies this Java application is suffering from lock contention. This lock contention is likely coming from areas where multiple threads are trying to access the same synchronized method or synchronized block of code, or a block of code that is guarded by a Java locking construct such as a java.util.concurrent.locks.Lock.
Monitoring Lock Contention on Linux
It is possible to monitor lock contention by observing thread context switches in Linux with the pidstat command from the sysstat package. However, for pidstat to report context switching activity, a Linux kernel version of 2.6.23 or later is required. The use of pidstat -w reports voluntary context switches in a “cswch/s” column. It is important to notice that Linux pidstat -w reports voluntary context switches, not a sum of all context switches that Solaris mpstat reports. Additionally, Linux pidstat -w reports the number of voluntary context switches per second, not per measurement interval like Solaris mpstat. Therefore, the estimate of the percentage of clock cycles wasted on voluntary context switching is calculated as the number of pidstat -w voluntary context switches divided by the number of virtual processors. Remember that pidstat -w reports voluntary context switches for all virtual processors. As a result, the number of voluntary context switches
times 80,000 divided by the number of clock cycles per second of the CPU provides the percentage of CPU clock cycles spent in voluntary context switches. The following is an example from pidstat -w monitoring a Java application having a process id of 9391 reporting results every 5 seconds.
$ pidstat -w -I -p 9391 5 Linux 2.6.24-server (payton) 07/10/2008 08:57:19 AM PID cswch/s nvcswch/s Command 08:57:26 AM 9391 3645 322 java 08:57:31 AM 9391 3512 292 java 08:57:36 AM 9391 3499 310 java
To estimate the percentage of clock cycles wasted on context switching, there are about 3500 context switches per second occurring on the system being monitored with pidstat -w, a 3.0GHz dual core Intel CPU. Hence, 3500 divided by 2, the number of virtual processors = 1750. 1750 * 80,000 = 140,000,000. The number of clock cycles in 1 second on a 3.0GHz processor is 3,000,000,000. Thus, the percentage of clock cycles wasted on context switches is 140,000,000/3,000,000,000 = 4.7%. Again applying the general guideline of 3% to 5% of clock cycles spent in voluntary context switches implies a Java application that may be suffering from lock contention.
Monitoring Lock Contention on Windows
On Windows, in contrast to Solaris and Linux, observing Java lock contention using built-in operating system tools is more difficult. Windows operating systems that include the Performance Monitor and typeperf have the capability to monitor context switches. But the capability to distinguish between voluntary and involuntary context switching is not available via a performance counter. To monitor Java lock contention on Windows, tools outside the operating system are often used, such as Intel VTune or AMD CodeAnalyst. Both of these tools have Java lock profiling capabilities along with capabilities to monitor other performance statistics and CPU performance counters.
Isolating Hot Locks
Tracing down the location in Java source code of contended locks has historically been a challenge. A common practice to find contended locks in a Java application has been to periodically take thread dumps and look for threads that tend to be blocked on the same lock across several thread dumps.
Oracle Solaris Studio Performance Analyzer, which is available for Linux and Solaris, is one of the best tools the authors have used to isolate and report on Java lock contention. Using Performance Analyzer to find contended locks in a Java application.
Other profilers can identify contended locks on Windows. Profilers that are similar in functionality to the Oracle Solaris Studio Performance Analyzer are Intel VTune and AMD CodeAnalyst.
Monitoring Involuntary Context Switches
Involuntary context switching was mentioned earlier but not explained in any detail, or how it differs from voluntary context switching. In contrast to voluntary context switching where an executing thread voluntarily takes itself off the CPU, involuntary thread context switches occur when a thread is taken off the CPU as a result of an expiring time quantum or has been preempted by a higher priority thread. Involuntary context switches can be monitored with Solaris mpstat by observing the “icsw” column.
Involuntary context switching can also be observed using Solaris prstat -m.
High involuntary context switches are an indication there are more threads ready to run than there are virtual processors available to run them. As a result it is common to observe a high run queue depth in vmstat, high CPU utilization, and a high number of migrations (migrations are the next topic in this section) in conjunction with a large number of involuntary context switches. Strategies to reduce involuntary context switches include using the Solaris command psrset to create processor sets for systems running multiple applications and assigning applications to specific processor sets, or reducing the number of application threads being run on the system.
An alternative strategy, and usually less effective, is profiling the application to identify areas of the application where you can reduce CPU usage by using improved algorithms so they consume fewer CPU cycles.
Involuntary context switches can also be monitored on Linux using pidstat -w.
But as mentioned earlier, pidstat -w reporting of involuntary context switching requires Linux kernel 2.6.23 or later. On Linux, creation of processor sets and assigning applications to those processor sets can be accomplished using the Linux taskset command. See your Linux distribution’s documentation for details on how to use Linux taskset.
On Windows systems, applications can be assigned to a processor or set of processors by using Task Manager’s Process tab. Select a target process, right-click, and select Set Affinity. Then choose the processors the selected process should execute on. On Windows Server operating systems, Windows Vista and Windows 7, an application can be launched from the command line with start /affinity <affinity mask>, where <affinity mask> is the processor affinity mask in hexadecimal. See the Windows operating system’s documentation for the use of start command and affinity mask.
Monitoring Thread Migrations
Migration of ready-to-run threads between processors can also be a source of observed performance degradation. Most operating systems’ CPU schedulers attempt to keep a ready-to-run thread on the same virtual processor it last executed. If that same virtual processor is busy, the scheduler may migrate that ready-to-run thread to some other available virtual processor. Migration of threads can impact an application’s performance since data, or state information, used by a ready-to-run thread may not be readily available in a virtual processor’s cache. On Solaris you can use mpstat and observe the “migr” column to monitor whether thread migrations are an issue to a Java application’s performance. If you are running a Java application on a multicore system and observing a high number of migrations, a strategy to reduce thread migrations is creating processor sets and assigning Java applications to those processor sets. As a general guideline, Java applications scaling across multiple cores or virtual processors and observing migrations greater than 500 per second could benefit from binding Java applications to processor sets. In extreme cases, the Solaris kernel tunable rechoose_interval can be increased as a means to reduce thread migrations. The former, creating processor sets, is the preferred strategy, and the latter, tuning the kernel, should be considered only as a secondary approach.