Garbage Collection and its performance impact

Hadoop is a beautiful abstraction that allows us to deal with the numerous complexities of data without delving into the details of the infrastructure. But once in a while to see why the performance of your applications are stalled, one has to look underneath the hood and find ways to extract performance or find why the application is failing. This blog tries to wade through the murky world of garbage collection in big data and how to identify and tune your application.

The reason why we all love Java is due to the fact that we can be careless with memory creations and the work of cleaning the mess is performed by the JVM. On a high level, Java heap memory is classified into two phases:

1) Young (eden) space

2)Old space

The eden space is where newly created objects goto. There are various algorithms for garbage collection, but all of them try to first free memory from the young space and for those long lasting memory objects, they are transferred to the old space.

One common issue that can be noticed in running Map Reduce Applications are GC overhead limit exceeded.

This error comes from java GC process when it finds out that it performed a full GC collection event (this includes removing objects from young(eden) space and old space) and the allocated free memory is less than 2% of the heap size. So this can be thought as a fail fast way to make sure cpu is not being over utlilized by the GC Threads. As the default garbage collection algorithm uses multiple threads and this causes cpu pressure.

One simple way to get this kind of error is to just keep allocating new objects in java tightly inside a while loop. Very soon you will find that the application process hits this error .

Normally in the bigdata world whenever this happens, it means the task is performing a memory intensive work and the current heap size which is controlled by for map tasks or for reduce tasks is not sufficient. The recommendation is to increase the heap sizes and thereby container sizes to fit the intensive nature of the task . Try increasing these values and see if your application succeeds.

If you are however interested in the intricate details of garbage collection, try spitting out these logs and analyze them.

To get these logs, you have to change the setting in ambari:

If you want for map task, then change -Xmx512m -verbose:gc -Xloggc:/tmp/@taskid@.gc

This means that 512Mb of heap size is allocated to the map task and verbose gc logs are enabled and the log location is specified. Note that the 512Mb is illustrative, it is best to keep your cluster defaults and append the next part of the line. The nice part about this setting is each task will have this log created separately in a different file and then one can look into the task of inerest.

A sample snippet of the logs can look like this:

0.378: [GC (Allocation Failure)  18432K->3854K(222208K), 0.0066910 secs]

0.477: [GC (Allocation Failure)  22286K->3662K(240640K), 0.0037145 secs]

0.849: [GC (Allocation Failure)  40526K->5684K(240640K), 0.0069826 secs]

1.209: [GC (Allocation Failure)  42548K->10778K(255488K), 0.0120971 secs]

1.654: [GC (Allocation Failure)  62490K->14866K(255488K), 0.0111095 secs]

1.666: [GC (Metadata GC Threshold)  16926K->14942K(243200K), 0.0039568 secs]

1.670: [Full GC (Metadata GC Threshold)  14942K->7255K(76288K), 0.0305284 secs]

As you can see here in the first line, eden gc happens and it is able to reduce it from 18Mb to 3Mb and it took 0.0066910. A full GC also happened in the last line. The Allocation failure means that no more space is available in the young generation, hence GC got triggered.There are a bunch of tools that can examine these kind of logs and give out nice histograms and other visualizations

Another thing that  is good to do is to monitor all your applications in the cluster, and use GC counters which say how much time is elapsed to find a trend in general with the GC times over a period of time. Correlate with memory usage and cpu usage. If one constantly sees that memory usage is pretty high and frequent GC is happening, it is time to increase heap size.

If one sees that during times of GC cpu utilization is high and that is bringing down application throughput, it is time to increase cores.

It is important to realize that the nice feature of Garbage collection does not come for free and monitoring and tweaking is just as important.