Program Loading and Memory Mapping in Linux

August 6, 2021

This is a summary over program loading, dynamical paging, signal handling, and memory mapping in Linux.

execve Syscall

One of operating systems’ basic services is to load programs into memory to execute. Programs rely on execve syscall to get the OS to load the program into memory and start it executing as a process. The kernel version we used to testing is 5.4.0. Doing a quick search inside Elixir gives us:

SYSCALL_DEFINE3(execve,
        const char __user *, filename,
        const char __user *const __user *, argv,
        const char __user *const __user *, envp)
{
    return do_execve(getname(filename), argv, envp);
}

Follow the function call, we will eventually reach the call to __do_execve_file, the comment of this function says “sys_execve() executes a new program”, which is pretty straightforward. This function first checks the filename pointer. Then it checks the flags of the current process that limit of running processes is not exceeded:

if (IS_ERR(filename))
    return PTR_ERR(filename);

/*
 * We move the actual failure in case of RLIMIT_NPROC excess from
 * set*uid() to execve() because too many poorly written programs
 * don't check setuid() return code.  Here we additionally recheck
 * whether NPROC limit is still exceeded.
 */
if ((current->flags & PF_NPROC_EXCEEDED) &&
    atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
    retval = -EAGAIN;
    goto out_ret;
}

/* We're below the limit (still or again), so we don't want to make
    * further execve() calls fail. */
current->flags &= ~PF_NPROC_EXCEEDED;

The next important task is to allocate the struct linux_binprm structure defined here. This structure is used to hold the arguments that are used when loading binaries.

bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
    if (!bprm)
        goto out_files;

Next, the function performs a seireis of tasks to prepare the bprm struct. Refer to the linux-insides book to find more information on how exactly the bprm structure is filled up.

The most important function called by __do_execve_file is search_binary_handler. Based on the comment, this function cycles the list of binary formats handler, until one recognizes the image. We can find one section of the code surrounded by binfmt_lock:

list_for_each_entry(fmt, &formats, lh) {
    if (!try_module_get(fmt->module))
        continue;
    read_unlock(&binfmt_lock);

    bprm->recursion_depth++;
    retval = fmt->load_binary(bprm);
    bprm->recursion_depth--;

    read_lock(&binfmt_lock);
    put_binfmt(fmt);
    if (retval < 0 && !bprm->mm) {
        /* we got to flush_old_exec() and failed after it */
        read_unlock(&binfmt_lock);
        force_sigsegv(SIGSEGV);
        return retval;
    }
    if (retval != -ENOEXEC || !bprm->file) {
        read_unlock(&binfmt_lock);
        return retval;
    }
}

We can see it calls into load_binary:

retval = fmt->load_binary(bprm);

Here, the load_binary is a pointer in a linux_binfmt struct. For elf format, it can be found here:

static struct linux_binfmt elf_format = {
	.module		= THIS_MODULE,
	.load_binary	= load_elf_binary,
	.load_shlib	= load_elf_library,
	.core_dump	= elf_core_dump,
	.min_coredump	= ELF_EXEC_PAGESIZE,
};

We can find the load_elf_binary function defined in the fs/binfmt_elf.c file. Then the function will check the magic number in the ELF file header. You can find the ELF format from wiki. We can see for both 32-bit and 64-bit systems, the e-ident field should contain the magic number for ELF format files.

/* Get the exec-header */
loc->elf_ex = *((struct elfhdr *)bprm->buf);

retval = -ENOEXEC;
/* First of all, some simple consistency checks */
if (memcmp(loc->elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
    goto out;

Then, load_elf_binary will do some tasks to prepare for the executable file. After that, it will try to load the program header table:

elf_phdata = load_elf_phdrs(&loc->elf_ex, bprm->file);
if (!elf_phdata)
    goto out;

Then it will traverse the program header table and find the interpreter which is responsible of setting up the stack and map elf binary into the correct location in memory. After the interpreter is obtained, the function will perform simple consistency checks on the interpreter. It will load the interpreter program headers:

/* Load the interpreter program headers */
interp_elf_phdata = load_elf_phdrs(&loc->interp_elf_ex,
                    interpreter);
if (!interp_elf_phdata)
    goto out_free_dentry;

This function will call setup_arg_pages to finalize the stack vm_area_struct:

/* Do this so that we can load the interpreter, if need be.  We will
    change some of these later */
retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
                executable_stack);
if (retval < 0)
    goto out_free_dentry;

It will also mmap the elf image into the correct location in memory. The bss and brk sections are prepared for the executable file:

/* Now we do a little grungy work by mmapping the ELF image into
    the correct location in memory. */
for(i = 0, elf_ppnt = elf_phdata;
    i < loc->elf_ex.e_phnum; i++, elf_ppnt++) {
        
        ...

        /* There was a PT_LOAD segment with p_memsz > p_filesz
           before this one. Map anonymous pages, if needed,
           and clear the area.  */
        retval = set_brk(elf_bss + load_bias,
                    elf_brk + load_bias,
                    bss_prot);
        if (retval)
            goto out_free_dentry;
        nbyte = ELF_PAGEOFFSET(elf_bss);
        if (nbyte) {
            nbyte = ELF_MIN_ALIGN - nbyte;
            if (nbyte > elf_brk - elf_bss)
                nbyte = elf_brk - elf_bss;
            if (clear_user((void __user *)elf_bss +
                        load_bias, nbyte)) {
            }

It will also call elf_map to map the segment to [vaddr, vaddr + file size] and align and then perform some checks:

error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
				elf_prot, elf_flags, total_size);

The interpreter is then loaded:

elf_entry = load_elf_interp(&loc->interp_elf_ex,
                interpreter,
                &interp_map_addr,
                load_bias, interp_elf_phdata);

Finally, the elf talbe is created:

retval = create_elf_tables(bprm, &loc->elf_ex,
            load_addr, interp_load_addr);

After everything is prepared, we can call the start_thread function, which prepares the new task’s registers and segments for execution. We will pass the set of registers for the new task, the address of the entry point of the new task, and the address of the top of of the statck for the new task to this function.

start_thread(regs, elf_entry, bprm->p);

A lot of the information here can also be found at the linux-insides book. I found it very helpful clearing my confusion.

In our own implementations, we will not call the loaded program’s main function. Instead, our loader will transfer control to the entry point of the loaded program via the jmp instruction. It has two major differences:

Jumping to the entry point indicates we are going to execute the glibc start up functions before main is called. This includes setting up thread local storage. main simply jump to the main with the loader’s TLS, no other setups are involved.
jmp doesn’t push return address on stack. When the loaded program finishes execution, it exits the loader program, instead of giving control back to the caller.