Java Garbage Collection | Part-1

Java Garbage Collection | Part-1

One of the many things that make Java special is its automatic garbage collection. As an application developer using Java, this comes in very handy as you don't have to think about memory leaks (for the most part)

For me, GC and JVM used to be like an Abracadabra 🦄🦄 . The main motto of this and the following blogs is to dig a bit deeper into the whats and hows of Java.

We will explore the basics of Java's garbage collection system, including how it works, different garbage collection algorithms, and ways to optimize and troubleshoot your application's performance.

This blog assumes that you have some working experience with Java.

Memory Model

Before getting into the nitty-gritty of GC it is important to have a basic understanding of how Java arranges data in RAM.

On a high-level JVM segregates the data into two parts Stack and Heap

  • All objects are stored on heaps.

  • Primitive data types are however stored on stacks.

  • Local variables are stored on the stack with a reference to the value inside the heap.

Important Differences

  • size of heaps >> stack

  • Heap is shared across threads

  • Each thread has got its own stack. Once the thread finishes stack is destroyed.

Let's explain the above using some examples -->

public static void main(String[] args) {
    List<String> myList = new ArrayList<String>();
    myList.add("One");
    myList.add("two");
    myList.add("Three"); // execution-stop 1

    printList(myList);
}

public static void printList(List<String> data) {
    System.out.println(data); // execution-stop 2
}

execution-stop 1

The memory would look something like the image below. The stack contains the variable myList which has a reference to the object in the heap containing the data which in turn has references to individual list elements.

execution-stop 2

When the code reaches this point, we have a new element on the stack pointing to the same reference

Good to know information -->

💡 final keyword means a variable can be assigned only once. However, objects in the heap can be changed.

💡 It’s not possible in Java to fix the state of an object. This is called const correctness in other languages.

💡Strings are immutable in Java. i.e. a new obj is created and reassigned every time.

Advanced Optimisations

JVM sometimes creates objects on the stack and may sometimes not create them at all and use the one already present on a heap

But the basics of what goes into the stack and what goes into a heap in the same in most places

String one = "hello";
String two = "hello";
if (one == two) {
    System.out.println("true");
}
// output --> true

Garbage Collection

Why the Need for Garbage Collection?

Let's consider you have a server exposing a way for others to access the below code and filter out foods. This ground-breaking algo became famous and thousands of health freaks are using this.

public List<Food> filterFood(int maxCal, Food[] food) {

    List<Food> collect = Arrays.stream(food)
                            .filter(f -> f.getCal() < maxCal)
                            .collect(Collectors.toList());

    System.out.println("collect = " + collect);
    return collect;
}

After learning about stacks and heaps, we know that data saved in stacks does not require cleanup as stacks are destroyed after the call is finished. However, objects in a heap, like the filtered list referenced by collect, are not in use by any variable and will remain in this state indefinitely.

If we do not free this memory for use, we may run out of memory after a while. Consider the potential for thousands of such memory spaces being created every second. 😱😱

Advantages:

  • No manual memory allocation/deallocation handling is required.

  • No overhead of handling Dangling Pointer

  • Automatic Memory Leak management (Soft leaks can still happen)

Disadvantages:

  • Requires more CPU power than the original application, which can affect performance

  • No control over the scheduling of CPU time dedicated to freeing objects.

  • Using some GC implementations might result in the application stopping unpredictably.

Garbage Eligibility

“Any object on the heap which cannot be reached through a reference from the stack is eligible for GC"

“The above statement holds true for cyclic dependency b/w heap obj as well”

👌 Unlike C and other languages, Java devs don’t have to call the free() method to clear the memory.

System.gc() —> is a suggestion to run garbage collection, however, it’s not a command/guarantee

Even if the GC runs it is not a guarantee that a particular object will be cleaned. The responsibility of GC is to keep the heap nice and tidy and might decide not to clean objects if the heap is already in a good state.

Soft Leaks (not an official term)

Soft Leaks - an object is referenced on the stack even though it will never be used again

Java avoids Memory leaks by —>

  • Running on a Virtual machine (JVM)

  • Garbage Collection

Detecting Soft Leaks —>

Eg tool: JavaVisualVM

For eg in the graph below for an app with soft-leak (max heap is 50MB) after a while, GC can’t find objects to delete and the app crashes
“It is constant at 40 and not at 50 because of young and old heap separation explained below”

JVM Xmx —> max heap size

How Garbage Collection Works?

Mark and Sweep Algo

Rather than searching for all the objects to remove, GC finds those objects which are required and saves them.

💡 GC doesn't collect any garbage but collects safe objects. GC is faster if there are a small number of referenced objects.

Mark Phase

In this phase, GC identifies all the live objects in memory by traversing the object graph from the root.

  • GC pauses all the threads in the application therefore it is called stop the world operation

  • GC then follows all the references from the stack and marks the objects that are referenced.

Sweep Phase

Objects which are not marked are removed from the heap

Compaction phase

The remaining objects are moved to a single contiguous block of memory. This process is known as compaction. (Makes the heap less fragmented)
Compaction makes it easier to allocate memory to new objects.

Generational Garbage Collection

Generational Garbage Collection is a way to divide the heap into young and old spaces.
The above mark & sweep algo becomes in-efficient as more and more objects are created, having to mark and sweep the complete JVM.

Young Heap

  1. New Objects are created in the young heap

  2. GC is faster (since most objects are garbage)

  3. GC in young is called a minor operation (b/c GC has fewer objects to mark)

Old Heap

  1. Objects saved by GC in Young are moved to the old heap (After about 8 cycles of GC)

  2. GC runs only when it is required

  3. GC here is called a major operation

Intuition →
- Most Objects in Java don't live for long
- If an object survives it is likely to live forever

In the above graph, the Y axis shows the number of bytes allocated over time.
Green Line --> Eden Space (part of young memory)
Yellow line --> Old Space

As you can see there are many dips in the green line suggesting that most of the objects were garbage whereas the yellow line remains almost constant.

VM args -->

The below image can be used to size the various components of the GC. Below in this blog, we would be discussing the various performance metrics for a GC however tuning with various parameters is out of scope and requires a separate blog.

For interested folks, more info can be found here --> https://docs.oracle.com/en/java/javase/11/gctuning/garbage-first-garbage-collector-tuning.html#GUID-43ADE54E-2054-465C-8376-81CE92B6C1A4

"This is for hotspot VM, but concepts are more or less the same"

Types of Garbage Collectors

Performance Metrics

Before starting with the types of GC. We would first touch on the metrics used for comparing different GCs or tuning a specific one for a particular task.

Footprint (Memory): Amount of memory assigned to a program

Throughput: More garbage collected per time unit (higher the better)

Latency: Amount of time code pauses for GC (lower the better)

Types

Won't go into detail for this section, that is for the next article in the series. Here is a high-level overview.

Serial

  1. Single Thread for both major and minor collections

  2. Smallest Memory Footprint as there is only one thread being used.

Parallel GC / Throughput Collector

  1. Multiple Threads for both minor and major GC.

  2. The name is a little confusing. This does not run with the application

  3. Great Throughput (Best for batch processes as latency is not a key metric there)

CMS

  1. Can run concurrently with the application. However, there are still stop-the-world operations

  2. A Thread that performs GC along with the application (sweeping and compaction). Marking however is still stopping the world.

  3. Throughput < Parallel Collector && Memory > Serial

  4. Best for General Operations

G1

  1. Region Based

  2. Good for apps with predictable latency

  3. The default for Java >=9

That's it

This was part-1 of the Java series. Where we introduce the Java memory model and the GC. In the next part, we would be exploring more on different GC algo (we only touched mark and sweep), G1GC implementation details and other stuff.