BUSYBOX
BusyBox
About Documentation Get BusyBox Development

Links

Developer Pages

How to build a Busybox binary with reduced memory usage

Busybox is designed to be frugal with memory usage. However, it is still written in C. Compiler and linker are usually not written with a focus to strongly minimize memory usage.

Overview of RAM usage by Linux binaries

Executables have the following parts: read-only executable code and constants, also known as "text", read-write initialized data, and read-write non-initialized (zeroed on demand) data, also known as "bss".

At runtime, all text pages are mapped RO and executable. A RO mapping, like any other, can only cover whole pages, not parts of them. What happens to the last page of text? The entire page is mapped RO, including the part beyond the last text byte.

Data pages are mapped RW and they are file-backed (this includes a small portion of bss which may live in the last partial page of data). Pages which are fully in bss are mapped to anonymous memory: when a page fault requires kernel to allocate a real page, a free page is found and zero-filled. No reads from storage are needed.

File-backed RO pages are shared among all copies of running process. RW pages, both data and bss, exist as separate copies for each process. (They may be shared as long as they are not modified. However, well-designed programs don't put constant data into RW sections, therefore in practice almost all RW data or bss pages which were touched by the program are modified, and therefore not shared).

When process starts executing, two more mappings appear. One is the stack. It is usually located near the top of virtual address space, and contains program command line, environment and ELF auxiliary vector. This is an anonymouns mapping.

Another anonymouns mapping is created by malloc subsystem in libc. This mapping is initially empty, but it is a growable and shrinkable via the brk(2) system call. Normally it starts right after the last page of bss, but kernels configured with address space randomization can place it elsewhere. Moderately sized malloc requests are served from this area. pmap shows it as "[heap]" mapping.

You can see these mappings with the following command, where busybox shows its own memory map:

$ sh -c 'exec busybox pmap $$'
1628: busybox pmap 1628
08048000     900K r-xp  /bin/busybox  - text
08129000       4K rw-p  /bin/busybox  - data
0812a000      12K rw-p    [ anon ]    - bss
08854000       8K rw-p  [heap]        - malloc space
f76fe000       8K r--p  [vvar]
f7700000       8K r-xp  [vdso]
ff82a000     132K rw-p  [stack]       - stack

It is important to minimize the number of RW pages your program touches. Data, bss and stack pages are never freed, therefore for large, and especially for temporary allocations, it's best to use malloc() or mmap(). One way to use fewer RW pages is to have fewer RW pages in your binary. There are several scenarios when a binary may end up having more data+bss pages than necessary.

It is possible to have a Busybox binary with almost all applets and only 8 kilobytes of data+bss.

Using libc which is better at using RAM sparingly

Static builds use RAM significantly more economically than dynamic ones. Here is a comparison of uclibc-based builds, the only difference between them is CONFIG_STATIC=y:

$ busybox pmap $$
08048000     792K r-xp  /bin/busybox
0810e000       4K rw-p  /bin/busybox
0810f000       8K rw-p  [heap]
f7f9e000     260K r-xp  /lib/libuClibc-0.9.30.3.so
f7fdf000       4K r--p  /lib/libuClibc-0.9.30.3.so
f7fe0000       4K rw-p  /lib/libuClibc-0.9.30.3.so
f7fe1000      16K rw-p    [ anon ]
f7fe5000      44K r-xp  /lib/libm-0.9.30.3.so
f7ff0000       4K r--p  /lib/libm-0.9.30.3.so
f7ff1000       4K rw-p  /lib/libm-0.9.30.3.so
f7ff3000       4K rw-p    [ anon ]
f7ff8000      16K r-xp  /lib/ld-uClibc-0.9.30.3.so
f7ffc000       4K r--p  /lib/ld-uClibc-0.9.30.3.so
f7ffd000       4K rw-p  /lib/ld-uClibc-0.9.30.3.so
fffdd000     132K rw-p  [stack]
mapped: 1316K
$ busybox.static pmap $$
08048000     900K r-xp  /bin/busybox.static
08129000       4K rw-p  /bin/busybox.static
0812a000      20K rw-p  [heap]
fffdd000     132K rw-p  [stack]
mapped: 1072K

Static build has only 24K mapped RW (excluding stack). Dynamic build has 4K+8K+4K+16K+4K+4K+4K = 44K, almost twice as much.

Dynamic builds have benefits that they allow different binaries to reuse library's text, but in the case of Busybox, you already reuse text - because all running applets are the same binary.

Does it mean "always build static"? Unfortunately, not. Glibc is horrible for static builds, their developers declared that static builds will be supported, but will not be optimized for. As a result, binaries statically built against glibc are some 400K larger than dynamic ones - and they still may need some libraries for e.g. DNS name resolution to work. [TODO: how much data + bss glibc adds?]

As you see above, uclibc is much better in this regard. Static build has only ~100K larger text section. Musl also shows promise, although I did not switch to it yet.

Things to look for when you evaluate some libc for static building: it should have modest data and bss. Ideally, libs should not initialize "[heap]" before user program calls malloc(). Libc should be able to aggressively prune (return to OS) freed malloc space (glibc defaults are bad). Even though we don't need that in Busybox, there should be a method to prune used (dirtied) stack space after the use of deep recursion or large on-stack objects.

If you build your libc yourself, look into ways to tweak the build to be more space-efficient.

Optimizing start of data

A simple solution for RO/RW mappings for text and data would be to pad text to a full page. However, on some architectures pages are quite large and this would cause a significant growth of on-disk size of the binary. GNU linker employs a hack to avoid that: sometimes it starts data immediately after text, without padding. The last, "mixed" page, gets mapped twice: once as a RO page, and second time as a RW page (usually at the position of the very next page in the virtual address space).

This reduces image size on-disk, yes, but it wastes RAM: now there is an unused part of first data page which takes up RAM. This may push the opposite, tail end of the data/bss sections far enough that they now need one more page.

This can be fixed if we make GNU linker to use our own linker script instead of a built-in one. We will do this by modifying its default script. Every busybox build generates a "busybox_unstripped.out" file. Among other infromation, it contains a part which says "using internal linker script:", and the script follows.

Cut out the script and put it into a file named "busybox_ldscript". Now the build will use this script, you should see a message "Custom linker script 'busybox_ldscript' found, using it" next time.

To unconditionally align data to the next page boundary, find a part which looks similar to:

/* Adjust the address for the data segment.  We want to adjust up to
   the same address within the page on the next page up.  */
. = ALIGN (0x1000) - ((0x1000 - .) & (0x1000 - 1)); . = DATA_SEGMENT_ALIGN (0x1000, 0x1000);

and replace it with:

. = ALIGN (0x1000); . = DATA_SEGMENT_ALIGN (0x1000, 0x1000);

Reducing padding between data

Busybox is linked with "--sort-section alignment". This helps, but unfortunately ld is not doing good enough job. It's possible to help it.

Output sections in the executable are specified in linker script as follows:

.rodata: { *(.rodata .rodata.*) }

If in such rules bare section wildcards are replaced with "SORT_BY_ALIGNMENT(wildcard)", the result is often more compact. In practice, vast majority of sections which benefit from alignment sorting are .rodata, .data and .bss ones, thus it's enough to only changle three locations in the linker script:

   *(SORT_BY_ALIGNMENT(.rodata) SORT_BY_ALIGNMENT(.rodata.*) {the rest})
 ...
   *(SORT_BY_ALIGNMENT(.data) SORT_BY_ALIGNMENT(.data.*) {the rest})
 ...
   *(SORT_BY_ALIGNMENT(.bss) SORT_BY_ALIGNMENT(.bss.*) {the rest})

Unfortunately, a more optimal "SORT_BY_ALIGNMENT(.rodata .rodata.*)", which would sort .rodata and .rodata.foo sections by alignment as one group, not separately, is not a valid syntax.

GCC, even with -Os, compiles byte and 16-bit arrays into data object definitions which require word alignment (at least 4 bytes). Some versions even require 32-byte alignment for arrays of more than 32 bytes long. This includes explicitly declared string arrays.

As a result, these arrays often have padding added at the end when they are packed by linker into the binary. As of 2016, linker is not clever enough yet to use that padding for tiny data objects (e.g. bool variables).

The alignment can be relaxed by explicit ALIGN1 or ALIGN2 attributes on such arrays:

static const char extn[][5] ALIGN1 = { ".zip", ".ZIP" };

The above prevents 2 bytes of padding in the binary.

Busybox code is evolving, new arrays constantly pop up, coders forget to add ALIGNn on them. You can run "grep -F -B3 '*fill*' busybox_unstripped.map" to find all linker-added padding in your binary, and add forgotten ALIGNn's. Please send a patch if you do.

Converting bss to data

Busybox's data and bss sections are small already, some 4-12 kilobytes. But they still require two mappings (VMAs in kernel-speak) because they have different attributes: data mapping is file-backed, while bss is anonymous. We can save a bit of memory on the kernel side, for every process, by not requiring two VMAs. This can be achieved by changing linker script to place all formerly-bss variables to data:

  .data:
  {
    *(.data .data.* .gnu.linkonce.d.*)
    SORT(CONSTRUCTORS)
  }
  ......................
  .bss:
  {
    *(.dynbss)                       ---
    *(.bss .bss.* .gnu.linkonce.b.*) --- move these lines to .data {} block
    *(COMMON)                        ---
    . = ALIGN(. != 0 ? 64 / 8 : 1);
  }

The result of this change is visible in pmap as absense of "[ anon ]" mapping:

08048000     856K r-xp  /bin/busybox
0811e000       8K rw-p  /bin/busybox -- former bss is in here
08bbf000       4K rw-p  [heap]
f77a2000       8K r--p  [vvar]
f77a4000       8K r-xp  [vdso]
ffa18000     132K rw-p  [stack]

This has a drawback that on-disk binary contains a few zeroed pages and they will need to be read when formerly-bss variables are touched. IOW: this has a small speed penalty.

Use space at the end of bss: FEATURE_USE_BSS_TAIL

The end of bss is usually not page-aligned. There is an unused space in the last page. Linker marks that location with the _end symbol.

The FEATURE_USE_BSS_TAIL option attempts to use that space for bb_common_bufsiz1[] array. If it fits after _end, that space will be used, and COMMON_BUFSIZE will be enlarged from its guaranteed minimum size of 1 kbyte. This may require recompilation a second time, since the position of _end is known only after final link.



Copyright © 1999-2008 Erik Andersen
Mail all comments, insults, suggestions and bribes to
Denys Vlasenko vda.linux@googlemail.com
This site created with the vi editor This site is kindly hosted by OSL