LLVM: Everything you need to know about this compiler

19 Mar 2024

min read

Data Science

While it is common practice for developers to write programs in a "high-level" language such as Python or Java, these same programs need to be "compiled" in order to make direct use of the microprocessor's capabilities. LLVM has been a major innovator in this field, promoting aspects such as modularity and real-time compilation.

Developers know that writing a sequence of instructions in a language such as Python, Java or C++ is just one step in the process leading to an executable program. The process that transforms the “source code” (what the developer writes) into code that can be used by the microprocessor is called compilation.

Languages such as Python or Java are called high-level languages. They feature instructions that are easy for humans to understand, such as the famous IF, which is used to test a condition and is found in most languages

Developing a program in machine code is extremely arduous, so it’s compilation that converts instructions written in a high-level language into machine code.

Designed by computer scientist Chris Lattner around 2002, at the University of Illinois, as part of his doctoral thesis, LLVM was conceived with a view to renewing the approach to compilation and optimizing its use.

Optimizing the compilation process

Optimizing compilers is nothing new. The GCC compiler – an integral part of the GNU (for Linux) operating system launched by Richard Stallman in 1984 – has over time incorporated numerous optimization options to improve the efficiency of the code it generates.

LLVM, for its part, appeared in the early 2000s, at a time when multi-core microprocessor architectures were becoming commonplace, as were GPUs or computer graphics processors.

LLVM stands for Low Level Virtual Machine, and this name partly sums up what it achieves: the creation of virtual machines (capable of emulating the behavior of given processors) at a low level (and therefore, close to the processor).

LLVM's innovative approach

LLVM has proposed a number of innovative approaches:

modular architecture ;
JIT compilation ;
IR Intermediate Representation;
a whole ecosystem of reusable tools.

Let’s take a look at these various points.

Modular architecture

Traditionally, most compilers consisted of a single program, making it difficult to optimize or extend their capabilities.

One of the main objectives for LLVM was to create a modular compiler architecture, in other words, to get different components to work harmoniously together. This has brought many benefits.

Compiler development can be entrusted to different teams, each responsible for developing one or more specific modules. This reduces development time and increases overall efficiency. The code produced by a team is usually cleaner and easier to understand by those who would be in charge of maintaining it.

When a compiler comes in the form of a series of modules, it is possible to optimize any one of them separately. And since maintenance can be carried out on this isolated module. The result is greater flexibility.

If, over the months, it turns out that a major function is missing from the compiler, it can be added, either by modifying one of the existing modules or by creating a new one.

In this way, the modular structure encourages experimentation and the implementation of new capabilities.
Modular architecture facilitates the integration of external tools.

JIT or "on-the-fly" compilation

In the traditional compilation process, the conversion is produced in its entirety before the program is executed, producing definitive, uneditable machine code.

With LLVM’s JIT (Just-In-Time) compilation, the source code written by the developer is converted into executable machine code at the very moment of execution.

JIT compilation takes place just before the execution of a sequence of machine code. This is known as “selective compilation”. Optimization takes place dynamically (in real time), according to the program’s behavior.

LLVM is not the only tool to use JIT compilation. The same is true of Microsoft’s .Net and many Java implementations.

IR or intermediate representation

As we saw above, LLVM stands for “low-level virtual machines”. And here we have another key to LLVM: it doesn’t generate code adapted to a particular microprocessor, but rather intermediate code, close to machine language but independent of a given computing unit.

This “intermediate representation” (IR) can be adapted to all kinds of processors, from computing units to graphics processors. Once again, this approach has a number of advantages:

IR can be optimized independently of the final, processor-specific code.
IR code is easier to analyze and debug than machine language.
When a new processor is released, it’s easy to get the compiler to integrate the corresponding machine language type.

A rich ecosystem

LLVM also benefits from a vast ecosystem of tools and extensions capable of extending its capabilities. This is due in part to the presence of high-quality user communities. For example, the Clang compiler (based on LLVM for languages such as C, C++ and Objective-C) is actively supported by employees from Google, Apple, Mozilla and ARM. Similarly, recent high-performance languages such as Rust and Swift, similarly based on LLVM, have large, participative communities.

DataScientest News

You are not available?

Leave us your e-mail, so that we can send you your new articles when they are published!

Data Analyst

Analytics Engineer

Data Scientist

AI / Machine Learning Engineer

Data Engineer

Cloud Engineer

DevOps Engineer

Data Marketing & AI

MLOps

ETL Developer

Data Ops Engineer

Amazon Web Services (AWS)

Microsoft Power BI

Overview

Bildungsgutschein

For Employees