The Java J9 JVM for Java 6 shipped with WAS 7, with other IBM software, and for use on IBM hardware is based on the third generation of the J9 JVM that was introduced with the Java 5 release. The core code is still derived from a licensed Sun JVM implementation that IBM enhances and builds upon for its own products.
For the Java 5 release, the product was rewritten to incorporate features from a traditional optimizing compiler, to support pluggable features and libraries suited to different types of platforms, to have enhanced type safe and generational garbage collection features, and to support class sharing of read-only immutable optimized class information. The Java 6 release essentially tuned the environment more in terms of improved optimization, further garbage collection enhancements, and improved class sharing.
The original features of the Java J9 JVM for Java 5 were
• Shared classes
• Type safe and generational garbage collection
• Asynchronous, queued compilation in a background thread
• Five levels of code optimization with continuous sampling and profiling
• Support for specific hardware optimizations such as large page sizes
• Support for Reliability, Availability, and Serviceability (RAS) enhancements
• Common code base for all Java platform editions with pluggable interfaces
• Class library independence
• Java 5 language features (autoboxing, generics, enumerations, annotations, for each loop, library enhancements for concurrency and management, JVM Tool Interface [JVMTI])
The Java 6 release adds the following features:
• The use of some Apache Harmony code in place of some Sun code, which greatly improves performance in some areas (e.g., TreeMap).
• Compressed references (i.e., use offsets rather than full 64-bit addresses to improve memory footprint and performance)
• Incremental improvements to garbage collection and class sharing
• Exploitation of new hardware features on POWER and z/Architecture platforms
What Does a Java Virtual Machine Do?
In its purest sense a JVM simply runs an instruction set for a machine that doesn’t exist in hardware but rather in software. Essentially, it needs to do all of the things that a real machine, with its processor and operating system, does. So, it must
• Handle multitasking/multithreading
• Allocate and manage memory
• Load and free applications and classes
• Verify the classes loaded will not damage the virtual machine or other classes
• Secure the environment to ensure the code can do only what it is authorized to do
• Manage I/O
• Run the instructions of the given instruction set
The definition of what the JVM has to support in terms of instructions and behavior is provided in the Java Virtual Machine Specification available from Sun and can be ratified by an extensive test suite. The general architecture of a generic JVM is outlined in below Figure
Inside the IBM J9 JVM for Java 6
The IBM J9 JVM is derived from the Sun JVM source code, with a considerable number of IBM extensions added. It supports various sizes and implementations of JVM for different platforms (MIDP up to Java SE) with a common code base by having some components be “pluggable”—that is, they are loaded dynamically into the JVM to provide additional func- tionality. The JVM Profiler, JIT Compiler, Debugger, and Realtime Profiler are all pluggable subsystems, with the JVM Profiler and JIT Compiler being essential for normal WAS opera- tion. To support different runtime environments, the JVM also supports pluggable Java class libraries.
The IBM J9 JVM has a number of important subsystems, as shown in below Figure. When set up for use by WAS, where the library required is always Java SE 6, which uses the Apache Harmony TreeMap implementation that greatly improves performance, some of the subsystems that are pluggable can always be assumed to be present.
At the core of the J9 implementation is the Virtual Machine Services subsystem, along with the code generators and the runtime. The JIT Compiler with the optimization facilities and the Profiling Services are pluggable components that are key for WAS. The class-sharing mechanism is external, but with the help of the memory-mapped files implementation and the com.ibm.cds.jar code with the Eclipse runtime, this provides optimized class native code sharing across JVM instances and, with the help of Virtual Partition Memory on System p/AIX with the POWER6 Hypervisor, even across LPARs and OS instances.
The Virtual Machine Services include the memory-management facilities, which include the mixed pointer support for 32-bit addresses used for most operations and 64-bit only where necessary to get the extra performance of 32-bit addresses and the extra memory accessibility of 64-bit addressing. This is known as compressed references support.
The JVM also includes the garbage collection facilities that assist the memory management. The bootstrap and shared classes class loaders and class verification are implemented within the Virtual Machine Services subsystem, as would be expected from the generic JVM architecture.
Thread and Exception management and Platform Services access are managed to support the actual runtime facilities required by the language and the mapping of the Java implementation onto the target OS environment. Finally, the Byte Code Interpreter is part of these services.
The Optimizer is part of the JIT Compiler and provides a number of optimization techniques to improve code performance, and this requires the Profiling Services support to generate the statistics to support the cost-benefit heuristics to identify the most appropriate optimizations to apply for each piece of code.
The platform code generator is platform specific and supports the generation of the native code for the platform, and on the Java 6 release this native code is also shared between WAS instances within an OS image, if the platform supports it, and across OS images using the class-sharing mechanism.
Pluggable Class Libraries
To support the needs of different environments, the IBM J9 JVM supports pluggable class libraries, as shown in below Figure. These are derivatives of the Sun class libraries, but for the Java 6 version, the more optimal Apache Harmony code is used in some areas (e.g., the TreeMap class) that target the different types of environment, i.e., Java 5 SE, Java 6 SE, CLDC, CDC, or MIDP. Each of these class libraries is implemented using JNI code using the Java Class Library (JCL) natives, a set of a native implementations that can be composed to support the Java class libraries.
The Port library that underpins the JCL natives is a thin native layer that isolates the use of OS-specific services and resources and handles memory access, file management, threading, IP sockets, locks, I/O, and interrupts. This protects the higher layers of the class libraries from having to know the details of the underlying platform.
Multiple independent port libraries can be simultaneously supported. The shared objects and DLLs that are installed with the JVM provide this functionality. Under the WAS installation directory, there is a java subdirectory and under this a lib directory. For the given platform, in this case Linux, the DLLs or shared objects can be found in a directory that is named in relation to the platform type. For a Linux platform, this is shown in below Listing.
IBM J9 JVM Shared Objects
71198 2 Jun 05:04 libJdbcOdbc.so 12400 2 Jun 05:04 libattach.so 592021 2 Jun 05:04 libawt.so 398182 2 Jun 05:04 libcmm.so 181185 2 Jun 05:04 libdcpr.so 89831 2 Jun 05:04 libdeploy.so 18108 2 Jun 05:04 libdt_socket.so 636345 2 Jun 05:04 libfontmanager.so 231258 2 Jun 05:04 libhprof.so 45881 2 Jun 05:04 libinstrument.so 20138 2 Jun 05:04 libioser12.so 83316 2 Jun 05:04 libiverel24.so 98406 2 Jun 05:04 libj9bcv24.so 235828 2 Jun 05:04 libj9dbg24.so 160515 2 Jun 05:04 libj9dmp24.so 149695 2 Jun 05:04 libj9dyn24.so 662303 2 Jun 05:04 libj9gc24.so 81603 2 Jun 05:04 libj9gcchk24.so 18227 2 Jun 05:04 libj9hookable24.so 6479 2 Jun 05:04 libj9jar24.so 1068918 2 Jun 05:04 libj9jextract.so 5363542 2 Jun 05:04 libj9jit24.so 347232 2 Jun 05:04 libj9jitd24.so 130975 2 Jun 05:04 libj9jnichk24.so 255521 2 Jun 05:04 libj9jvmti24.so 170749 2 Jun 05:04 libj9prt24.so 20359 2 Jun 05:04 libj9rdbi24.so 322595 2 Jun 05:04 libj9shr24.so 58518 2 Jun 05:04 libj9thr24.so 63188 2 Jun 05:04 libj9trc24.so 106925 2 Jun 05:04 libj9ute24.so 597923 2 Jun 05:04 libj9vm24.so 185345 2 Jun 05:04 libj9vrb24.so 60154 2 Jun 05:04 libj9zlib24.so 6019 2 Jun 05:04 libjaas.so 8786 2 Jun 05:04 libjaasauth.so 161860 2 Jun 05:04 libjava.so 25123 2 Jun 05:04 libjava_crw_demo.so 80858 2 Jun 05:04 libjavaplugin_jni.so 262805 2 Jun 05:04 libjavaplugin_nscp.so 5316 2 Jun 05:04 libjawt.so 521396 2 Jun 05:04 libjclscar_24.so 275883 2 Jun 05:04 libjdwp.so 209044 2 Jun 05:04 libjpeg.so 147099 2 Jun 05:04 libjpkcs11.so 11306 2 Jun 05:04 libjsig.so 238170 2 Jun 05:04 libjsound.so 75182 2 Jun 05:04 libjsoundalsa.so 5508 2 Jun 05:04 libmanagement.so 833884 2 Jun 05:04 libmlib_image.so 99352 2 Jun 05:04 libnet.so 41461 2 Jun 05:04 libnio.so 12840 2 Jun 05:04 libnpt.so 4835 2 Jun 05:04 librmi.so 268767 2 Jun 05:04 libsplashscreen.so 132819 2 Jun 05:04 libunpack.so 77080 2 Jun 05:04 libzip.so
With the third-generation J9 JVM, the features of an optimizing compiler are used, but at run- time in the JVM. Byte code may run interpreted on the JVM, but in most cases commonly used code is optimized by the Testarossa JIT compiler. Traditionally, byte code is loaded by the JVM and either interpreted on an instruction-by-instruction basis or passed to a JIT compiler on a synchronous thread for compilation. The IBM J9 JVM handles things differently. Cost-based analysis is used to assess whether a class should be interpreted or compiled asynchronously on a background thread by the Testarossa JIT compiler and what level of optimization is worth performing. Optimization is expensive on CPU and memory resources, so heuristics determine the expense of each level of optimization for a particular piece of code, and this is compared with the time taken to execute the code at each level of optimization and an assessment as to how often it is likely to be run based on the previous number of executions.
There are five levels of execution, with interpreted being the lowest and slowest, and native code compilation and multilevel optimization making up the other four levels. The JVM identifies the potential benefits of optimization for a method by assigning it a temperature level from cold (poorly optimized), through hot, to scorching.
During code execution, the JVM profiles each class to gather the statistics, using a combination of methods. The interpreted code is profiled using JVM interpreter sampling of execution time. The JIT compiler also performs sampling. The JVM may even insert profiling hooks into the code itself.
The JVM, and thus WAS running on it, is self-optimizing, and the applications running on it may get faster over time as the statistics of application usage get ever more detailed. This is attributable to a process known as dynamic recompilation with runtime execution profiling where heavily used classes and methods are identified, assessed for the benefits of additional optimization, and then recompiled with further optimizations on a background thread with the resulting code transactionally replacing the code in the native compiled code pool.
Every class and method starts out as interpreted code running directly on the JVM at level
1 optimization. The heuristics are used against the statistics gathered as the code executes
to assess the benefits of optimization, and this may result in commonly executed code being passed on a background thread to the Testarossa JIT compiler for optimization to cold or warm levels to replace the interpreted code execution. In the background, the sampling threads executes to identify hot methods that could benefit from further optimization. Hot methods may be considered worthy of profiling to gain statistics for a cost-benefit analysis of further scorch- ing optimization, so for these, profiling hooks are inserted. Both the JVM interpreter and the JIT compiler perform profiling and analysis of methods to get the statistics used for optimization.
So, the five levels of optimization are interpreted, cold compiled, hot compiled, hot compiled and profiling, and scorching. This can best be understood with an example. If a method takes 0.2 second to execute and could be optimized to execute in 0.1 second if 0.2 second of effort was put into optimizing it, there is no benefit for the resource cost if the code is only executed once. However, if the code is executed more than twice, the benefits outweigh the cost because the optimization takes place only once but the benefits are realized on every execution.
The IBM J9 JIT compiler performs a large number of optimizations. Its functionality is similar to that of the IBM C/C++ optimizing compilers, and in many cases the dynamic nature of its operation and runtime environment leads to the code it produces running faster on the
J9 JVM than if equivalent code was compiled and run natively.
When the Testarossa JIT compiler operates, it takes byte code and converts it to an inter- mediate language (IL) tree form at a level between that of machine code and byte code, and this level facilities control graph analysis and optimization. The optimizations performed are outlined in below Table.
Inlining – The call to a method is replaced by the method body itself. This saves the setup of the call stack, but adds to code size in memory so is only for relatively small methods. Cache sizes affect the performance impact in that the bigger code sequence must fit in the cache.
Cold block outlining – Code that isn’t used often is moved out of the normal code sequence to reduce gaps in the instruction stream to make better use of the cache and to avoid jump instructions. Often the code moved out is for error or exception handling.
Value propagation – Variables and their values are tracked and the variables are replaced by the constant values, where it makes sense, to avoid extra memory/stack references.
Block hoisting – Code that is invariant and that need not be executed repeatedly in a loop is taken outside of the loop.
Loop unroller – The number of iterations of a loop is reduced by duplicating the body of the loop sequentially within it.
Asynchronous check removal – Java insists on checks on code to maintain its integrity. Those checks that are unnecessary because they have already been performed elsewhere in the code path are removed.
Copy propagation – Replace occurrences of direct assignment targets with their values. Loop versioning Array index checking is expensive, particularly when repeated in a loop, so this creates unsafe code that performs array bounds checks and safe code that doesn’t require the check outside the loop. The index is then checked on entry to decide which version of the code should be used.
Common subexpression – Instances of identical expressions or subexpressions that equate elimination to the same value are replaced with a variable and a single sub- expression calculation. This sort of optimization is often used in looping.
Partial redundancy – Partially redundant code exists in only some paths through elimination a program, so the aim of partial redundancy elimination is to place the expression on all paths through the program and compensate for it to make it fully redundant. It reduces access to instance variables and moves invariant code out of loops.
Optimal store placement – Heuristics are applied to determine what variables should be cached in registers from the stack and local variables and what can be left in memory.
Simplifier – Complex expressions and code are replaced by simpler and faster equivalents.
Escape analysis – The scope of references are checked (i.e., whether they get passed to other methods) and any optimizations of simplifications are performed based on the results.
Dead tree removal – Code is compiled by building expression trees that are closer to machine code representations. Some trees are unreachable or have no effect on the behavior of the code and thus are removed.
Switch analysis – Branch conditions are optimized and combined and invariant code is moved out of the conditions.
Redundant monitor elimination – The synchronized keyword puts monitors into the code to elimination handle re-entrancy, but sometimes this is unnecessary because re-entrancy is already handled, so the monitors are removed.
Devirtualization – Virtual method calls are replaced with faster nonvirtual or static calls to avoid the overhead of late binding.
Partial inlining – Large methods are broken up into hot (frequently executed) and cold (infrequently executed) portions, which may make the hot code small enough to benefit from being inlined.
Lock coarsening – In adjacent code protected by locks (synchronized), the lock over- head can be reduced by a merged set of code with a single lock.
Register allocators- Analysis is performed to make the best use of system registers to maximize performance.
Live range reduction – Where the range of a variable is known at runtime, its value bounds can be used to perform code optimization.
Idiom recognition Commonly used bytecode sequences are replaced by optimized code that cuts down on the overhead of stack usage
All of these features are added to by the additional facility of supporting hot code replace (HCR), where recently modified code is loaded transactionally to replace that currently executing to assist in development, and full-speed debug (FSD).
Class Cache and Sharing
When a class is first used, it is loaded and split into ROMClass and RAMClass information, where the ROMClass information is the immutable information pertaining to the class, such as the byte code and constants, and the RAMClass information is the local data that changes with every instance of the class. The ROMClass information has a signature associated with it that is used to detect if the class has changed on disk during its lifetime and, if so, it is reloaded.
The ROMClass information is loaded in the class cache for class sharing. The com.ibm.cds.jar plug-in for Eclipse handles integrating this into the WAS component-loading mechanism.
In the previous, Java 5 version of the third-generation J9 JVM, the class-sharing mecha- nism was limited by using shared memory. This effectively limited its memory usage to about 50MB on a Unix system to avoid impacting other processes, but it did substantially improve startup time and reduce memory overhead because the shared class cache was shared between JVM instances running in a single OS image. It contained the ROMClass information, as this can safely be shared across JVMs, and optimized byte code/IL code to speed up all JVMs. So, in general usage, as one WAS instance on a node gets faster, the performance of all JVMs improves. Security is provided by sharing the class cache between users in the same pri- mary group, i.e., the user IDs under which WAS runs. The semaphores protecting concurrent access to the class cache can be identified by looking in the /tmp/javasharedresources direc- tory on a Unix platform.
The use of shared memory for the class-sharing implementation was a limiting factor in the Java 5 release, as the 50MB practical limit was brought about because shared memory gets mapped into every process address space. For the Java 6 implementation, memory-mapped files are used, the advantage of which is that the file backing up memory-mapped files can be stored in a shared location on the physical machine and mapped into multiple OS images to allow class sharing across virtualized partitions. With the Java 6 version, the optimized native code is stored in the shared cache if available. If this is coupled with IBM features like the POWER Virtual Partition Memory facility, then common code across partitions results in a single copy of the binaries being in real memory no matter how many OS images are running and no matter how many WAS instances are running.
Garbage collection is one of the key features of the JVM that a vendor can capitalize on. There are a number of algorithms implemented that vary as to how the objects are stored on the heap and how the applications running behave with respect to object creation.
The default mark-and-sweep mechanism that is suitable for most applications is given by the optthruput setting. This has some intermingling of garbage collection and application work for good raw throughput for an average application. The application is paused each time garbage collection occurs, so some GC (garbage collection) pause must be acceptable for this option to be used. As this is the default and all objects are considered in the same way, this is the starting point for tuning.
For many applications, particularly with web services–based JAX-RPC applications or those using rules engines where a large number of objects are repeatedly created and destroyed, generational collection is advised and is set by the gencon setting. In this there are some objects at the application level that are long lived, but on repeated code or code answer- ing a request, there are large numbers of objects created in a short time that are used and then released when the request completes, the transaction commits, or the loop ends. These objects are young objects so they should be separated from the older, long-lived objects to avoid heap space memory fragmentation issues. In this the nursery, or young, objects are kept in their own space that can be managed and freed via direct copying and deallocation, and
a separate garbage collection phase called a minor collection takes place on this space regu- larly. Older objects migrate to the old generation space, and this is managed via the traditional mark-and-sweep garbage collection that covers both the young and old objects followed by a compaction. This mechanism improves performance and reduces memory fragmentation.
Some applications (but not WAS normally unless it is operating with a heap above 4GB and is running on a 64-bit JVM) benefit from the use of the optavgpause garbage collection set- ting, which minimizes the GC pause time at the cost of a little overall performance reduction.
For use on large multiprocessor machines where there is an overhead on contention for the heap from multiple concurrent threads running on different processors, the subpool GC policy can be set, which effectively splits the heap into multiple pools that are managed separately.
All of the preceding mechanisms are implemented with a set of algorithms for garbage collection and a set of algorithms for memory management that are handled by the JVM according to a single policy set by the gcpolicy command-line switch that causes the appropriate algorithms to be loaded.
Java Standard Library and Apache Harmony
I have briefly mentioned the mechanism used to implement the Java standard libraries and how the native code and port mechanisms work. One of the issues IBM has had with the JVMs is the license restriction from Sun on using and deploying the IBM J9 JVM except with another IBM product. Some of the licensed code carries an additional burden in performance. IBM has found major performance enhancements for WAS in the use of the Apache Harmony TreeMap implementation instead of the Sun version. IBM is a contributor to Apache Harmony, so the integration was not problematic.
Much of the security, JavaBeans handling, JNDI handling, logging, JDBC, and the under- lying JVM shared object/DLL layer implementation is implemented using Apache Harmony code. To see this, execute grep –R harmony * in the Java directory tree for the WAS installation. The use of Apache Harmony code in WAS is likely to continue to increase in the future.