One of the many things that make Java special is its automatic garbage collection. As an application developer using Java, this comes in very handy as you don't have to think about memory leaks (for the most part)
For me, GC and JVM used to be like an Abracadabra 🦄🦄 . The main motto of this and the following blogs is to dig a bit deeper into the whats and hows of Java.
We will explore the basics of Java's garbage collection system, including how it works, different garbage collection algorithms, and ways to optimize and troubleshoot your application's performance.
This blog assumes that you have some working experience with Java.
Memory Model
Before getting into the nitty-gritty of GC it is important to have a basic understanding of how Java arranges data in RAM.
On a high-level JVM segregates the data into two parts Stack and Heap
All objects are stored on heaps.
Primitive data types are however stored on stacks.
Local variables are stored on the stack with a reference to the value inside the heap.
Important Differences
size of heaps >> stack
Heap is shared across threads
Each thread has got its own stack. Once the thread finishes stack is destroyed.
Let's explain the above using some examples -->
public static void main(String[] args) {
List<String> myList = new ArrayList<String>();
myList.add("One");
myList.add("two");
myList.add("Three"); // execution-stop 1
printList(myList);
}
public static void printList(List<String> data) {
System.out.println(data); // execution-stop 2
}
execution-stop 1
The memory would look something like the image below. The stack contains the variable myList
which has a reference to the object in the heap containing the data which in turn has references to individual list elements.
execution-stop 2
When the code reaches this point, we have a new element on the stack pointing to the same reference
Good to know information -->
💡 final keyword means a variable can be assigned only once. However, objects in the heap can be changed.
💡 It’s not possible in Java to fix the state of an object. This is called const correctness in other languages.
💡Strings are immutable in Java. i.e. a new obj is created and reassigned every time.
Advanced Optimisations
JVM sometimes creates objects on the stack and may sometimes not create them at all and use the one already present on a heap
But the basics of what goes into the stack and what goes into a heap in the same in most places
String one = "hello";
String two = "hello";
if (one == two) {
System.out.println("true");
}
// output --> true
Garbage Collection
Why the Need for Garbage Collection?
Let's consider you have a server exposing a way for others to access the below code and filter out foods. This ground-breaking algo became famous and thousands of health freaks are using this.
public List<Food> filterFood(int maxCal, Food[] food) {
List<Food> collect = Arrays.stream(food)
.filter(f -> f.getCal() < maxCal)
.collect(Collectors.toList());
System.out.println("collect = " + collect);
return collect;
}
After learning about stacks and heaps, we know that data saved in stacks does not require cleanup as stacks are destroyed after the call is finished. However, objects in a heap, like the filtered list
referenced by collect
, are not in use by any variable and will remain in this state indefinitely.
If we do not free this memory for use, we may run out of memory after a while. Consider the potential for thousands of such memory spaces being created every second. 😱😱
Advantages:
No manual memory allocation/deallocation handling is required.
No overhead of handling Dangling Pointer
Automatic Memory Leak management (Soft leaks can still happen)
Disadvantages:
Requires more CPU power than the original application, which can affect performance
No control over the scheduling of CPU time dedicated to freeing objects.
Using some GC implementations might result in the application stopping unpredictably.
Garbage Eligibility
“Any object on the heap which cannot be reached through a reference from the stack is
eligible for GC
"
“The above statement holds true for cyclic dependency b/w heap obj as well”
👌 Unlike C and other languages, Java devs don’t have to call the free()
method to clear the memory.
System.gc()
—> is a suggestion to run garbage collection, however, it’s not a command/guarantee
Even if the GC runs it is not a guarantee that a particular object will be cleaned. The responsibility of GC is to keep the heap nice and tidy and might decide not to clean objects if the heap is already in a good state.
Soft Leaks (not an official term)
Soft Leaks - an object is referenced on the stack even though it will never be used again
Java avoids Memory leaks by —>
Running on a Virtual machine (JVM)
Garbage Collection
Detecting Soft Leaks —>
Eg tool: JavaVisualVM
For eg in the graph below for an app with soft-leak (max heap is 50MB) after a while, GC can’t find objects to delete and the app crashes
“It is constant at 40 and not at 50 because of young and old heap separation explained below”
JVM Xmx —> max heap size
How Garbage Collection Works?
Mark and Sweep Algo
Rather than searching for all the objects to remove, GC finds those objects which are required and saves them.
💡 GC doesn't collect any garbage but collects safe objects. GC is faster if there are a small number of referenced objects.
Mark Phase
In this phase, GC identifies all the live objects in memory by traversing the object graph from the root.
GC pauses all the threads in the application therefore it is called
stop the world operation
GC then follows all the references from the stack and
marks
the objects that are referenced.
Sweep Phase
Objects which are not marked
are removed from the heap
Compaction phase
The remaining objects are moved to a single contiguous block of memory. This process is known as compaction
. (Makes the heap less fragmented)
Compaction makes it easier to allocate memory to new objects.
Generational Garbage Collection
Generational Garbage Collection is a way to divide the heap into young and old
spaces.
The above mark & sweep algo becomes in-efficient as more and more objects are created, having to mark and sweep the complete JVM.
Young Heap
New Objects are created in the young heap
GC is faster (since most objects are garbage)
GC in young is called a minor operation (b/c GC has fewer objects to mark)
Old Heap
Objects saved by GC in Young are moved to the old heap (After about 8 cycles of GC)
GC runs only when it is required
GC here is called a major operation
Intuition →
- Most Objects in Java don't live for long
- If an object survives it is likely to live forever
In the above graph, the Y axis shows the number of bytes allocated over time.
Green Line --> Eden Space (part of young memory)
Yellow line --> Old Space
As you can see there are many dips in the green line suggesting that most of the objects were garbage whereas the yellow line remains almost constant.
VM args -->
The below image can be used to size the various components of the GC. Below in this blog, we would be discussing the various performance metrics for a GC however tuning with various parameters is out of scope and requires a separate blog.
For interested folks, more info can be found here --> https://docs.oracle.com/en/java/javase/11/gctuning/garbage-first-garbage-collector-tuning.html#GUID-43ADE54E-2054-465C-8376-81CE92B6C1A4
"This is for hotspot VM, but concepts are more or less the same"
Types of Garbage Collectors
Performance Metrics
Before starting with the types of GC. We would first touch on the metrics used for comparing different GCs or tuning a specific one for a particular task.
Footprint (Memory): Amount of memory assigned to a program
Throughput: More garbage collected per time unit (higher the better)
Latency: Amount of time code pauses for GC (lower the better)
Types
Won't go into detail for this section, that is for the next article in the series. Here is a high-level overview.
Serial
Single Thread for both major and minor collections
Smallest Memory Footprint as there is only one thread being used.
Parallel GC / Throughput Collector
Multiple Threads for both minor and major GC.
The name is a little confusing. This does not run with the application
Great Throughput (Best for batch processes as latency is not a key metric there)
CMS
Can run concurrently with the application. However, there are still stop-the-world operations
A Thread that performs GC along with the application (sweeping and compaction). Marking however is still stopping the world.
Throughput < Parallel Collector && Memory > Serial
Best for General Operations
G1
Region Based
Good for apps with predictable latency
The default for Java >=9
That's it
This was part-1 of the Java series. Where we introduce the Java memory model and the GC. In the next part, we would be exploring more on different GC algo (we only touched mark and sweep), G1GC implementation details and other stuff.