Starting up Java Applications

9 min readNov 15, 2023

Java internals that slow down and quicken your application startup time.

There are several steps a Java application has to follow through upon startup. It needs to look up the class files from the disk, load those classes into memory, verify the class files (verify the byte-code produced especially by the classes in the class path injected by dependencies since they might have been compiled with a different compiler that can violate situations such as type safety), and store/load into an internal representation (a Class object) that can be accessed by the JVM. The whole process can, on aggregate, be called “loading classes” (loading, linking and initialisation with <clinit>). Following this step, there is one other, quite important set of optimisations carried out by the JVM depending on how you have configured your JVM. If you’re running the default JVM without any configurations, it is likely that following class loading, it would step into a procedure called “Just-In Time” (JIT) compilation.

By default, javac produces byte-code, which is what we’ve all learned, and the byte-code is the reason for Java’s “code once, run anywhere” philosophy. Byte-code, then, is interpreted by the JVM, and executed as the program runs at runtime. JVM by default, however, is quite capable of identifying certain optimisations that could be done at runtime; for instance, it would be fairly intuitive to suggest that code regions in the byte-code executing over and over again can be converted to a more “machine-friendly” set of instructions, so that the interpreter can execute them fairly quickly. Such optimisations — so called “hot path” identifications — are bread-and-butter to the JIT compiler (this is also called “adaptive optimisation”). The optimisations discussed here, however, are carried out at runtime, which means that every time you stop and run your application, such optimisations need to occur.

Source: https://dl.acm.org/doi/10.1145/3067695.3082521

For “hot-path” identification, the JVM uses two counters called invocation counter and backedge counter. Each method has one counter of each, and the counters are incremented as follows:

Invocation counter is incremented once the method is invoked (i.e., a method is entered into).
Backedge counter is incremented once the control flow flows from a higher byte-code index to a lower byte-code index. This is usually used to keep track of loops.

Each counter has its own threshold for each method, and when a calculated combination of the thresholds is exceeded, a request to optimise (compile) the method is queued into a request queue. The idle consumers of the queue (compiler threads) would pick the compilation jobs up, and would compile and store the compiled code in memory. The JVM, hence, dispatches subsequent invocation of the method from byte-code to the compile code, completely bypassing the byte-code interpretation. Note that the hot compilation step is an asynchronous process, and it has been designed to run parallel until the compilation is finished.

Note that there are circumstances where the optimised code are de-optimised, but such conditions are out of scope from this article as it does not affect much for the application startup.

Startup Problem

The JIT compiler is very good at runtime optimisations, but the catch is that such optimisations slow down application startup time(s). Now, let’s get a bit creative. The problem, it seems, boils down to two solutions that we can come up with.

We can come up with a lazy approach, where we do everything only when the required entities are invoked.
Or, we can come up with an approach that shifts the optimisations temporally. That is, we do the compilation at an earlier phase — compilation phase.

Lazy class-loading is well documented, and is already a part of JLS. For instance, if class A references class B, loading of class A does not necessarily imply loading of class B until the first statement that references B is reached in execution within class A (unless class B is required for verification in the linking phase). Another quite intuitive optimisation technology that we can think of for late optimisation is Garbage Collection. Garbage, by definition, are late arrivals.

In modern environments such as serverless (especially in FaaS tools like Lambda) that does not perform long-running tasks, startup time adds to the total execution time that amounts to a significant portion of the total execution time, precisely because the process is short-term. Longer startups that occur within situations like cold-startups in serverless infrastructure can cause degraded performance which indirectly lead to bad user experience, and worse, inconsistent data if the functions do not complete in the allotted timeout (make sure you execute atomic transactions in terms of data).

It is the second point that this article focuses on. How can we do the optimisations early, and arrive at a better startup time?

Class-Data Sharing

Well, this issue with startup time creeped Java developers since JDK 1.5 (they called it J2SE 5.0, back when Java was developed by the guys at Sun). In Java 1.5 (2004), they came up with a concept called “Class Data Sharing”, where they loaded up much of the necessary system jars (e.g. rt.jar) converted them into an internal representation accessible by the JVMs fairly quickly — so that each new application wouldn’t have to load them as if they were fresh class files — and dumped into a shared archive (located in $JAVA_HOME/lib/server/classes.jsa) that could be accessible even by multiple JVMs (a JVM can have two modes; either client or server. CDS worked only with JVMs in client mode and Serial Garbage Collector enabled). This feature kept hidden most of the time in commercial distributions, until it was merged into OpenJDK 9 in 2017. The CDS that was merged was evolved to be used in more JVM configurations to accommodate such as G1 GC and JVM server mode.

Note that other than the tasks that the JVM has to carry out that are discussed in the beginning of this section, there are a lot of application-specific tasks for the JVM to carry out — opening file sockets, database connections, web connections, bean initialisations (in IoC or application server containers), etc. These are more related to the application.

OpenJDK 10, in 2018, released Application Class-Data Sharing (AppCDS) with JEP-310, that “extended” CDS to store application-level classes (other than system classes) in the shared archive. This allowed multiple JVMs to share class data, allowing less memory footprint.

Ahead-Of-Time (AOT)

If the JIT compiler does everything on-the-fly, why couldn’t we do some optimisations at compile time? This is what AOT aims to achieve. If the compiler is switched from a JIT compiler to an AOT compiler, when the compilation (or the build process) starts, the AOT compiler would initiate a “static analysis” that tries to collect all the reachable classes from a certain entry-point and compiles them into a native executable. Unlike JIT, which would lead to an interpretable byte-code, AOT actually produces a native executable that is platform-specific (not so much “code once, run anywhere”).

Now, reading the above paragraph, it might sound a bit odd to you, as it speaks of “reachable” classes. Java, come to think of it, is actually a very dynamic language, even though we think of it as a more static language. Java has plenty of dynamic features — like dynamic class loading, garbage collection, and reflection. This is precisely what makes AOT a bit difficult. Using reflection, you can inject code and classes (e.g., Aspect Oriented Programming) that are allocated dynamically at runtime. Because of this (i.e., because such classes aren’t available at compile time), AOT operates under a “Closed-World Assumption” (CWA) that all the reachable classes are known at compile-time.

But this assumption, in essence, would impose constraints on us to not to use AOT and reflection together. Well, in many implementations of AOT — like GraalVM — the compiler takes into account certain “tracing agents” that assists in tracing the classes that are likely to be injected dynamically. A lot of the libraries are also incorporating support with providing certain trace manifests that helps AOT to inject the classes into the final build even though they cannot be reached at build-time.

h2database/h2/src/main/META-INF/native-image/reflect-config.json at…

H2 is an embeddable RDBMS written in Java. Contribute to h2database/h2database development by creating an account on…

github.com

Now, since AOT compilation disregards everything else, AOT build artefacts are much light-weight, which is great if you’re running a k8s cluster with lightweight container requirements. The resource usage is also likely to be lesser because of the optimisations and the reduction of unreachable code. In terms of security, since no dynamic injection could occur, there’s no “JIT spraying” vulnerability and therefore the attack surface for an AOT artefact is much less.

Is AOT a silver bullet?

It seems like that AOT is a silver bullet to the problems we’re having especially in serverless environments. There are quire a few drawbacks though, that one has to consider before diving head-first.

AOT is great for small-running tasks that the startup time affects more to the throughput of the application. AOT is not a requirement for an application you deploy in an application server (WebSphere, JBoss, Apache, etc.) because usually, application servers are long-running, and the startup time does not matter that much.

AOT artefacts do take a certain amount of time to build. The necessary steps to produce a native AOT artefact are both time and resource consuming, and if you integrate this to your CI, you might have to be aware of this consumption. I think the development environments are better off with JIT artefacts while the UAT and production pipeline runs fully-native final artefacts.

Note that AOT does not necessarily enhance throughput. It is likely that if you compare throughput of a fully spun-up JIT artefact and a native artefact (same application), the former would outperform the latter by a certain margin. This is because JIT does optimisations on the fly, and therefore is aware of certain situations and circumstances that are visible only at runtime — like hardware and language-specific constructs like appearance of null pointers, conditional branches, exceptions et cetera. However, there is a concept called “Profile-Guided Optimisations” in GraalVM AOT compiler, in which you build an instrumented artefact, collect some profiles by running the application, and injecting the profiles to the native build process so that the AOT compiler does have some runtime measurements to carry out certain other optimisations that can lead to on-par performance with JIT. However, this adds up an additional step of profiling and tuning the AOT compilation process. To achieve peak throughput, a certain memory will also have to be sacrificed to store the necessary optimisation data structures, and thus will be a trade-off between throughput and memory.

Remember the trace manifests? Libraries that aren’t yet compatible with GraalVM, and are using Java’s dynamic features like reflection, could also be a pain point. Such situations would lead the team members to traverse through and explicitly create the trace manifests of the libraries that the team didn’t write and take the manifests to the classpath.

It is important to identify the fact that discussed above is the concept of “startup”, and not “warmup”. Warmup is the time that an application takes to achieve the peak performance / throughput. In AOT mode, there are no other optimisations occurring after the compilation, which could have helped to optimise the warmup. However, in JIT, optimisations happen quite frequently, identifying hot paths and optimising other structures.

Project Leyden, presented by Mark Reinhold in 2022, attempts to solve this very issue of startup and warmup, and attends to the inherent issues present in “static image” approaches such as AOT’s CWA that hurts Java’s inherent dynamism. The idea is not to get rid of static images completely, but until the Java platform is capable enough to embrace CWA constraints without hurting its dynamism, start to develop incremental improvements through Leyden.

Disclaimer: I haven’t written about cloud-level optimisations that are available for problems such as cold-start in FaaS. In Lambda, there’s a quite favourable feature called “SnapStart” where it allows you to “cache” a “snapshot” version of your function via Firecracker VM, and lets you access the state with low latency. Through this, you’re able to mitigate the cold-start problem to a certain extent. I wrote this article focusing on the Java aspects of startup.