Java from the trenches: improving reliability

Java and the JVM are great things. In contrast to writing native code, making a mistake in your Java code will not (or should not) crash the virtual machine. However, in my new position working for a SAAS company I have been closer to production systems then ever before and in the short time I have been there I have already gained a lot of experience with the JVM. In any case, crashes and hangs occur but there is something we can do about it. These experiences are based on running Java 1.6 update 29 on Centos 6.2 and RHEL 6 as well as windows server 2003.

Java Service Wrapper

To start of with, I would like to recommend the Java Service Wrapper. This is a great little piece of software which allows you to run Java as a service with both a Windows and Linux implementation. The service wrapper monitors your java process and restarts it when it crashes or restarts it explicitly when it appears hung. The documentation is excellent and it works as advertized. It has given us no problems at all apart from tweaking the timeout to consider a java process hung.

The service wrapper writes its own log file but we found that it contained also every log statement written by the application. The cause of this turned out to be the ConsoleLogger of java.util.Logging which was still enabled. This problem was easily solved by setting the handler property empty in jre/lib/logging.properties

handler=
#handlers= java.util.logging.ConsoleHandler

This also solved a performance problem whereby  due to a bug in the application, excessive logging was being done and the java service wrapper simply could not keep up anymore.

With a default JRE logging configuration, the logging output can also be disabled by setting the following properties in the wrapper.conf file:

wrapper.syslog.loglevel=NONE
wrapper.console.loglevel=NONE
wrapper.logfile.loglevel=STATUS
wrapper.java.command.loglevel=STATUS

Of course, with the console logging turned off, it should be possible to remove the wrapper.console.loglevel setting (not tried yet).

Garbage collection

Since we would like to achieve low response time and minimize server freezes due to garbage collection, we settled on the CMS (Concurrent Mark and Sweep) garbage collector.

Using the CMS collector we found one important issue where on windows, the server would run perfectly but on linux it would become unresponsive after just a couple of hours traffic. The cause was quickly found to be permgen space. It turns out that garbage collection behavior on windows differed from linux. In particular, garbage collection of the permgen space was being done on windows but not on linux. After hours and hours of searching, we found this option that fixed this behavior:

-XX:+CMSClassUnloadingEnabled

The full list of options we use for garbage collection is now as follows:

-XX:+UseConcMarkSweepGC
-XX:+ExplicitGCInvokesConcurrent
-XX:+CMSClassUnloadingEnabled
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-verbose:gc
-Xloggc:/var/log/gc.log

The last four options are for garbage collection logging which is useful for troubleshooting potential garbage collection issues after the fact.

One of the issues with the above configuration is that upon restart of the JVM, the garbage collection log file is overwritten instead of being appended to, thereby losing information when the JVM crashes. This problem can be worked around by using a ‘tail -F gc.log > gc.log.all’ command, but this solution is not nice as it will create very large log files. An optimal solution would be if the JVM would cooperate with standard facillities on linux such as logrotate. Similar to how, for instance, apache handles logging, the JVM could simply close the gc.log file when it receives a signal and then reopen it again. That would be sufficient for logrotate to work. Unfortunately, this is not yet implemented in the JVM as far as I can tell.

Crashes in libzip.so or zip.dll

It turns out that this problem can occur when a zip file is being overwritten while it is being read. The causes of this could be in the application of course, but still the JVM should not crash on this. It appears to be a known issue which was fixed in 6u21-rev-b09, but the solution for this is disabled by default.

If you set the system property

-Dsun.zip.disableMemoryMapping=true

then memory mapped IO will no longer occur for zip files which solves this issue. This system property only works on linux and solaris, and not on windows. Luckily a colleague found this solution. It is very difficult to find this setting on the internet, which is full of stories about crashes in the zip library, even if you know what you are looking for.

Crashes in networking libraries/general native code

Another issue we ran into were occasional crashes, mostly in networking libraries’ native code. This also appears to be a known issue with 64 bit JVMs. The cause of this is that there is insufficient stack space left for native code to execute.

How it works is as follows. First of all, the java virtual machine uses a fixed size for the stack of a thread. This size can be specified with the -Xss option if needed. While executing java code, the JVM can figure out whether there is enough space to execute the call and throw a StackOverflowError if there’s not. However, with native code, the JVM cannot do that so in that case it checks whether a minimum space is left for the native code. The minimum space is configured using the StackShadowPages option. It turns out that by default, this space is configured too low on older 64 bit JVMs, causing crashes in for instance socket libraries (e.g. when database access is being done). See for instance here. In particular, on JDK 1.6 update 29, the default value is 6 and on JDK 1.7 update 5 it is 20.

Therefore, a good setting of this flag is to use 20

-XX:StackShadowSize=20

The size of 1 page is 4096 bytes so increasing the stack shadow pages from 6 to 20 would mean that you need 56KB additional stack size. This page size can be verified by running java with a low stack size and passing different values for stack shadow pages like this:

erik@pelican> java -Xss128k -XX:StackShadowPages=19 -version
The stack size specified is too small, Specify at least 156

The stack size per thread may be important on memory constrained systems. For instance, with a stack size of 512KB a 1000 threads would consume about 500MB of memory. This may be important for smaller systems (especially 32 bit if these are still around), but are no issue at all for a modern server.

Debug JVM options

To find out what the final (internal) settings are for the JVM, execute:

java -XX:+PrintFlagsFinal <myadditionalflags> -version

Logging

If your environment still uses log4j for some reason then be aware that log4j synchronizes your entire application. We found an issue where an exception with a huge message string and stack trace was being logged. The toString() method of the exception in this case took about one minute during which time the entire application froze. To reduce these synchronization issues of log4j use AsyncAppender and specify a larger buffer size (128 is default) and set blocking to false. The async appender may have some overhead in single-threaded scenarios, but for a server application it is certainly recommended.

This entry was posted in Devops/Linux, Java, Software. Bookmark the permalink.

One Response to Java from the trenches: improving reliability

  1. Excellent items from you, man. I have be mindful your
    stuff previous to and you are just extremely magnificent.
    I actually like what you have got here, certainly like what you’re saying and the way in which wherein you are saying it. You’re making it entertaining and you
    continue to take care of to keep it wise. I can not wait to learn far more from you.
    That is actually a terrific site.

    Here is my web page :: financial business case development template

Leave a Reply

Your email address will not be published. Required fields are marked *