#performance #java #memory
In this article we will look at different ways to figure out and fix unnecessary memory consumption by JVM. This is useful for JVM applications that store huge amount of data in memory and can help avoid OutOfMemoryError if application data cannot be partitioned and scaled horizontally.
In part 3, we will take a look on more advanced techniques used to reduce JVM heap size, which involves manual memory management. This is the last part of series following part 2.
Space occupied by Java object
Our best result in part 2 was 481300 orders on a heap of 64MB, in an example 11. This is on average 139 bytes per object. At this point we optimised the encoding of data and extra memory required by data structures. Let’s look at the heap dump and see how much space every Order
object is taking. Eclipse MAT is claiming it is 56 bytes, let’s see how this adds up:
Field name | Field type | Field size (bytes) |
---|---|---|
id | long | 8 |
user | reference (byte[]) | 4 |
articleNr | int | 4 |
count | int | 4 |
pricePence | int | 4 |
addressNumber | reference (String) | 4 |
addressStreet | reference (String) | 4 |
addressCity | reference (String) | 4 |
addressRegion | reference (String) | 4 |
addressPostCode | reference (String) | 4 |
TOTAL | 44 |
However, this is not a full picture as String
, for example, is an indirection over byte[]
and ads extra reference, byte and int field for each object.
Therefore, we are wasting:
- 5x (4 + 1 + 4) = 45 bytes for
String
objects we could be leaving off asbyte[]
- 12 bytes of object header
Let’s try to tackle these two.
Pooling byte arrays
In previous part we were able to deduplicate String
instances by using JVM’s pool. If we are to get rid of strings, we have to replace JVM pool with a custom one, handling byte[]
deduplication for us.
However, I would like to take this a step further, as eventually we want to store Order
object off heap. This means that we must avoid storing references on an Order
. Object references in Java can be mutated by JVM during GC cycles and if we want to avoid serializing duplicate string values, we must replace byte[]
references with keys to byte array pool.
In previous part we already tried to pool byte[]
(see example 12), but did not succeed because of tree-structure wrapper overheads. To make this pooling worthwhile, we can use open-addressing implementation with the following interface:
public class PooledByteArrayMap {
public int put(byte[] value);
public byte @ImmutableByteArray [] get(int key);
public void free(int key);
}
This data structure is efficient in deduplicating byte arrays with time complexity of a hash map and lookup/remove complexity of O(1). Source code is available here.
Using such pool, we are left without any references in Order
object, but we can create temporary String
objects during order processing:
public final class Order {
private final long id;
private final int user;
private final int articleNr;
private final int count;
private final int pricePence;
private final int addressNumber;
private final int addressStreet;
private final int addressCity;
private final int addressRegion;
private final int addressPostCode;
@JsonCreator
public Order(
@JsonProperty("id") long id,
@JsonProperty("user") byte[] user,
// other fields omitted
@JacksonInject PooledByteArrayMap byteArrayPool
) {
this.id = id;
this.user = byteArrayPool.put(user);
// other fields omitted
}
// other getters omitted
public String getAddressPostCode(PooledByteArrayMap byteArrayPool) {
return restoreString(addressPostCode, byteArrayPool);
}
@SuppressWarnings("byte.array.weakening")
private String restoreString(int key, PooledByteArrayMap byteArrayPool) {
return new String(byteArrayPool.get(key), StandardCharsets.US_ASCII);
}
}
With this approach we can fit 576700 orders (+20%) into memory, which is about 116 bytes (-23) per object. We were not able to get all 45 bytes worth of gain by ditching String
due to some overheads of a pool, but it is an improvement nonetheless.
This change has 2 unintended consequences which are discussed below.
Replacing object encapsulation guarantees with compile checks
One of key features of Java standard library String
class is its immutability. By removing it altogether we are not protected from accidentally changing byte[]
content, which would alter results of business transactions and compromise integrity of pool. The cost of safety was too high to bear at runtime, however we can move that cost to compile time.
Since Java 8, javac
allows adding custom type checkers which would be run as part of code compilation. For example, this allows developers to annotate certain variables as not nullable and avoid pesky null checks throughout the code. Similarly, we can declare all byte[]
instances, returned by the pool immutable using annotation:
public byte @ImmutableByteArray [] get(int key);
I’ve decided to implement such type as a pair of annotation/checker within Checker Framework. This ensures that all attempts to modify byte[]
returned by pool or to downcast them to unannotated version would be flagged as compiler error. Checker source can be found here.
This approach gives back confidence, without incurring run time costs. Similar approaches can be used to replace type-safe wrapper objects in Java (AKA tiny-types) with primitives to avoid runtime costs and maintain type-safety.
Array utilisation
For heap sizes like in this scenario there is another important consideration to take into account. Running previous example 14 with GC logging enabled we can see heap statistics at point of failure:
[33.920s][info][gc,heap,exit ] Heap
[33.920s][info][gc,heap,exit ] garbage-first heap total 65536K, used 61378K [0x00000000fc000000, 0x0000000100000000)
[33.920s][info][gc,heap,exit ] region size 1024K, 1 young (1024K), 0 survivors (0K)
[33.920s][info][gc,heap,exit ] Metaspace used 12518K, capacity 12795K, committed 13056K, reserved 1060864K
[33.920s][info][gc,heap,exit ] class space used 1139K, capacity 1197K, committed 1280K, reserved 1048576K
From this we can observe that about 4MB of heap space was not utilised. One reason for this is humongous object allocation. Every time JVM with G1 GC tries to allocate an object with size larger than region size (1MB in this case) it will combine multiple regions, which then will be used solely by that humongous object. A few megabytes wasted on a multiple gigabyte heap may be unnoticed, however for 64MB heap it makes a considerable difference.
Other thing to consider is array list utilisation. Some space can be wasted, if the list get resized too early and underlying array grows prematurely exacerbating issues with heap fragmentation. In our case, array lists get resized at 50% capacity (default for Agrona collections).
To counter all these effects, I’ve tried following improvements (see example 15):
- changing load factor for collections to 85%
- custom byte array pool would split underlying arrays into smaller chunks and allocate an extra chunk on resize
- do not use G1 to reduce memory fragmentation due to humongous allocations. This will not alleviate fragmentation completely as every GC has trouble defragmenting memory with object sizes approaching 10% of total heap. I’ve used serial GC, because for small heaps CMS and throughput collectors can take very long time to fail.
This allows fitting 686200 orders (+19%) into memory, which is about 98 bytes (-18) per object.
SLAB allocation
Since Order
object does not hold any object references we can manage allocation ourselves and avoid JVM object header overheads. To accomplish that, we need some sort of allocator to store/retrieve objects for us and provide an abstraction over plain memory access. SLAB is one of simpler allocators that can achieve that. I’ve used the simplest form of SLAB, similar to Linux SLOB allocator (see source), with a following interface:
public class OrderSlabAllocator {
public int allocate();
public OrderView get(int key);
public void free(int key);
}
Using this store we no longer need to store a reference to Order
object, but only an object key within allocator. In this case key would correspond to some contiguous chuck of memory. However, we still would like a simplicity of using familiar getter/setter API, instead of manipulating bits in memory. To do this we can utilize a flyweight pattern:
public final class OrderView {
private final PooledByteArrayMap byteArrayPool;
private ByteBuffer buffer;
private int startPosition;
OrderView(PooledByteArrayMap byteArrayPool) {
this.byteArrayPool = byteArrayPool;
}
OrderView wrap(ByteBuffer buffer, int startPosition) {
this.buffer = buffer;
this.startPosition = startPosition;
return this;
}
public OrderView set(
long id,
byte[] user,
int articleNr,
int count,
int pricePence,
String addressNumber,
String addressStreet,
String addressCity,
String addressRegion,
String addressPostCode
) {
buffer.putLong(startPosition + ID_OFFSET, id);
buffer.putInt(startPosition + USER_OFFSET, byteArrayPool.put(user));
// other fields write omitted
buffer.putInt(startPosition + ADDRESSNUMBER_OFFSET, internString(addressNumber));
return this;
}
public long getId() {
return buffer.getLong(startPosition + ID_OFFSET);
}
public byte @ImmutableByteArray [] getUser() {
return byteArrayPool.get(getUserPoolKey());
}
public String getAddressNumber() {
int addressNumber = buffer.getInt(startPosition + ADDRESSNUMBER_OFFSET);
return restoreString(addressNumber);
}
// the rest of getters are omitted
private int internString(String str) {
return byteArrayPool.put(str.getBytes(StandardCharsets.US_ASCII));
}
@SuppressWarnings("byte.array.weakening")
private String restoreString(int key) {
return new String(byteArrayPool.get(key), StandardCharsets.US_ASCII);
}
}
We only need single instance of OrderView
to manipulate orders. When we ask allocator for an order, it would wrap this view around correct memory location and allow us to use order as if it was bog-standard POJO.
This technique is one step away from doing off-heap storage, where byte buffers are replaced with appropriate DirectBuffer
or MappedByteBuffer
. Compression technique can also be more beneficial when applied after this step, because having multiple orders compressed together is more likely to contain repetitive data allowing for better compression.
This storage pattern also has an added performance benefit, as memory access is much more cache-friendly.
This approach lets us fit 814900 orders (+19%), which is 82 bytes (-16) per object.
Conclusion
Working through multiple optimization techniques we were able to reduce memory consumption by 300% (4 times). This is quite good result for data, which is not highly compressible. Of course, there is an inherent limit on what can be fit in given unit of memory. This limit is defined by the amount of data represented using minimal viable types, and can be reduced only by data sharing between objects or compression.
For this particular example, we have not reached that limit yet. Higher data density can be achieved by changing key types from int
to 3 bytes (we know we cannot have more than 2^23 object fit into memory anyway) and compressing SLABs.