3. The magic of the Elf

 

Any sufficiently advanced technology is indistinguishable from magic.

 Arthur C. Clarke

Building executables from C source code is a complex task. An innocent looking call of gcc(1) will invoke a pre-processor, a multi-pass compiler, an assembler and finally a linker. Using all these tools to plant virus code into another executable makes the result either prohibitively large, or very dependent on the completeness of the target installation.

Real viruses approach the problem from the other end. They are aggressively optimized for code size and do only what's absolutely necessary. Basically they just copy one chunk of code and patch a few addresses at hard coded offsets.

However, this has drastic effects:

There are ways to circumvent these limitations. But they are complicated and make the virus more likely to fail.

3.1. Executable and linkable format

Another natural limitation of viruses is rigid dependency on the file format of target executables. These formats differ a lot. Even on the same hardware architecture. Or under the same operating system. Furthermore executable are not designed with post link-time modifications in mind. It's rare for a virus to support more than one infection method. This document is about the format used on recent versions of Linux, FreeBSD and Solaris.

From the Portable Formats Specification, Version 1.1:

The Executable and Linking Format was originally developed and published by UNIX System Laboratories (USL) as part of the Application Binary Interface (ABI). The Tool Interface Standards committee (TIS) has selected the evolving ELF standard as a portable object file format that works on 32-bit Intel Architecture environments for a variety of operating systems.

Actually ELF is defined for a variety of both 32 bit and 64 bit architectures. This document tries to cover multiple platforms through conditional compilation. There is a configure.pl that determines the host type and sets up a Makefile. The Makefile uses individual sub-directories for each platform and exports the name of these directories (and some other platform specific values) as environment variables. Most of the shell scripts invoked by make(1) are shown here. The following table should help to understand them.

Table 1. Environment variables exported by Makefile

VariableValue on this platform
${ARCH}i386
${CFLAGS}-Wall -O2 -march=i586
${ELFBASE}0x08048000
${OUT}out/i386
${TMP}tmp/i386

3.2. The language of mortals

For the first example I'll present the simplest piece of code that still gives sufficient feedback. Our aim is to implant it into /bin/sh. On practically every recent installation of Linux/i386 the following code will emit three magic letters instead of just dumping core.

Source: out/i386/magic_elf/magic_elf.c
#include <unistd.h>
int main() { write(1, (void*)0x08048001, 3); return 0; }

It is not an error that a file called magic_elf.c is located in a directory called out/i386. The Makefile building this document did trivial pre-processing on the original source file. ELF is used on many architectures. And each has a different magic value.

Command: src/magic_elf/cc.sh
#!/bin/sh
gcc ${CFLAGS} ${OUT}/magic_elf/magic_elf.c \
	-o ${TMP}/magic_elf/magic_elf \
&& ${TMP}/magic_elf/magic_elf

Output: out/i386/magic_elf/magic_elf
ELF

3.3. How it works

3.3.1. Digested answer

The three letters are part of the signature of ELF files. Executables created by ld(1) are always mapped into the same memory region. That's why the program can find its own header at a predictable virtual address.

3.3.2. Short answer

RTFM.

The raw details are in /usr/include/elf.h. The canonical document describing the ELF file format for Intel-386 architectures can be found at ftp://tsx.mit.edu/pub/linux/packages/GCC/ELF.doc.tar.gz. A flat-text version is http://www.muppetlabs.com/~breadbox/software/ELF.txt. And finally http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html humorously describes how far you can bend the rules to reach minimal size.

3.3.3. Sort of an answer

0x8048000 is not a natural constant, but happens to be the default base address of ELF executables produced by ld(1). As of version 2.11 of binutils it should be possible to change that with options -Ttext ORG and --section-start SECTIONNAME=ORG, but I didn't get it working. Anyway, the layout of executables produced by ld(1) is straight forward.

  1. One ELF header - Elf32_Ehdr

  2. Program headers - Elf32_Phdr

  3. Program interpreter (not if statically linked)

  4. Code

  5. Data

  6. Section headers - Elf32_Shdr

Everything from the start of the file to the last byte of code is mapped into one segment (named "code" or "text") that begins at the base address. There is a whole chapter called readelf describing a command to view all these details. In the meantime I will show fancy ways to get by without.

3.4. Showing off some tools

What would you do if you knew nothing about ELF and just asked yourself how that example works? How can you go sure that the executable file really contains those three letters?

A good start for finding text in binary files is strings(1).

Command: src/magic_elf/strings.sh
#!/bin/sh
# without "-a -n 3" we don't get any output
strings -a -n 3 ${TMP}/magic_elf/magic_elf | grep -n ELF

Output: out/i386/magic_elf/strings
1:ELF

The leading 1: is written by grep(1) and tells that our three-letter word is the first found string. This gives some help where we can find it in a hex dump. It is difficult to search strings in such a dump because of line breaks. Interactive tools like hexedit(1) might be useful.

Command: src/magic_elf/od.sh
#!/bin/sh

# select ASCII characters or backslash escapes (octal)
od -N 16 -c ${TMP}/magic_elf/magic_elf | head -1

# named characters (ASCII)
od -N 16 -a ${TMP}/magic_elf/magic_elf | head -1

# plain bytewise hex
od -N 16 -t x1 ${TMP}/magic_elf/magic_elf | head -1

Output: out/i386/magic_elf/od
0000000 177   E   L   F 001 001 001  \0  \0  \0  \0  \0  \0  \0  \0  \0
0000000 del   E   L   F soh soh soh nul nul nul nul nul nul nul nul nul
0000000 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00

At this point we can guess that file offset 1 and 0x8048000 + 1 are not coincidental. A test program might help.

3.5. The address of main

Source: out/i386/magic_elf/addr_of_main.c
#include <stdio.h>

int main()
{
  printf("# 0x08048000=%#02x\n", *(unsigned char*)0x08048000);
  printf("# 0x08048001=%.3s\n", (char*)0x08048001);
  printf("main=%p\n", main);
  printf("ofs=%lu\n", (unsigned long)main - 0x08048000);
  return 0;
}

Output: out/i386/magic_elf/addr_of_main
# 0x08048000=0x7f
# 0x08048001=ELF
main=0x8048460
ofs=1120

Looks good. The byte at address 0x8048000 + 0 is equal to that at file offset 0. And 0x8048460 is a plausible address of function main.

The fancy output format was chosen for a reason. It is valid input for /bin/sh.

Command: src/magic_elf/ndisasm.sh
#!/bin/sh
. ${OUT}/magic_elf/addr_of_main
ndisasm -e ${ofs} -o ${main} -U ${TMP}/magic_elf/magic_elf \
| sed -e '/ret/q'

If the listing below makes you feel uncomfortable you can have a look at http://linuxassembly.org/. If that does not help, probably nothing will. The Assembly-HOWTO gives description of tools and sites. Robin Miyagi's Linux Programming features a tutorial and some interesting links. Advanced topics are at Assembly resources. And then there is IA-32 Intel Architecture Software Developer's Manual.

Output: out/i386/magic_elf/ndisasm
08048460  55                push ebp
08048461  89E5              mov ebp,esp
08048463  83EC0C            sub esp,byte +0xc
08048466  6A03              push byte +0x3
08048468  6801800408        push dword 0x8048001
0804846D  6A01              push byte +0x1
0804846F  E8A4FEFFFF        call 0x8048318
08048474  31C0              xor eax,eax
08048476  89EC              mov esp,ebp
08048478  5D                pop ebp
08048479  C3                ret

Both programs have main at the same file offset. Unfortunately a brief look through /bin proves this to be pure chance.

Instead of a real system call for write we see a call to strange negative address (check the opcode). ndisasm(1) resolves this address to a location in glibc. However, during development I encountered a configuration of my system where ndisasm(1) failed to do so. The rest of the story is still interesting, though. Yet another way to do it.

Command: src/magic_elf/gdb.sh
#!/bin/sh
file=${1:-tmp/i386/magic_elf/magic_elf}
func=${2:-main}
gdb ${file} -q <<EOT | sed -n -e '/:/p' -e '/ret *$/q' -e '/hlt *$/q'
	set disassembly-flavor intel
	disassemble ${func}
EOT

Output: out/i386/magic_elf/gdb
(gdb) (gdb) Dump of assembler code for function main:
0x8048460 <main>:	push   ebp
0x8048461 <main+1>:	mov    ebp,esp
0x8048463 <main+3>:	sub    esp,0xc
0x8048466 <main+6>:	push   0x3
0x8048468 <main+8>:	push   0x8048001
0x804846d <main+13>:	push   0x1
0x804846f <main+15>:	call   0x8048318 <write>
0x8048474 <main+20>:	xor    eax,eax
0x8048476 <main+22>:	mov    esp,ebp
0x8048478 <main+24>:	pop    ebp
0x8048479 <main+25>:	ret    

That strange negative address resolves to a function in a shared library. Not shown is a pathetic attempt to single-step to the actual code of write.

3.6. In doubt use force

We can now search for a fine manual explaining how to debug shared libraries. Or just compile the bugger static.

Command: src/magic_elf/cc_static.sh
#!/bin/sh
gcc ${CFLAGS} -static ${OUT}/${arch}/magic_elf/magic_elf.c \
	-o ${TMP}/magic_elf/magic_elf_static \
&& ls -l ${TMP}/magic_elf \
&& ${TMP}/magic_elf/magic_elf_static

Output: out/i386/magic_elf/magic_elf_static
total 1684
-rwxrwxr-x    1 alba     alba        13839 Jun 23 00:50 addr_of_main
-rwxrwxr-x    1 alba     alba        13711 Jun 23 00:50 magic_elf
-rwxrwxr-x    1 alba     alba      1687693 Jun 23 00:50 magic_elf_static
ELF

Seems we found an easy way to fill up the hard disk. Anyway, what has gdb(1) to say about it?

Output: out/i386/magic_elf/static_main.gdb
(gdb) (gdb) Dump of assembler code for function main:
0x80481e0 <main>:	push   ebp
0x80481e1 <main+1>:	mov    ebp,esp
0x80481e3 <main+3>:	sub    esp,0xc
0x80481e6 <main+6>:	push   0x3
0x80481e8 <main+8>:	push   0x8048001
0x80481ed <main+13>:	push   0x1
0x80481ef <main+15>:	call   0x804cc60 <__libc_write>
0x80481f4 <main+20>:	xor    eax,eax
0x80481f6 <main+22>:	mov    esp,ebp
0x80481f8 <main+24>:	pop    ebp
0x80481f9 <main+25>:	ret    

The name of the function changed for no apparent reason. But it is reachable for disassembly now.

Output: out/i386/magic_elf/static_write.gdb
(gdb) (gdb) Dump of assembler code for function __libc_write:
0x804cc60 <__libc_write>:	push   ebx
0x804cc61 <__libc_write+1>:	mov    edx,DWORD PTR [esp+16]
0x804cc65 <__libc_write+5>:	mov    ecx,DWORD PTR [esp+12]
0x804cc69 <__libc_write+9>:	mov    ebx,DWORD PTR [esp+8]
0x804cc6d <__libc_write+13>:	mov    eax,0x4
0x804cc72 <__libc_write+18>:	int    0x80
0x804cc74 <__libc_write+20>:	pop    ebx
0x804cc75 <__libc_write+21>:	cmp    eax,0xfffff001
0x804cc7a <__libc_write+26>:	jae    0x8052bb0 <__syscall_error>
0x804cc80 <__libc_write+32>:	ret    

There are two man pages giving some overview of system calls, intro(2) and syscalls(2). The statement mov eax,4 corresponds to the value of __NR_write in /usr/include/asm/unistd.h.

3.7. The language of evil

The code generated by gcc(1) is not suitable for a virus. So here comes hand crafted code optimized for size (twenty three is the perfect number of bytes). I prefer nasm to GNU as.

Source: src/evil_magic/evil_magic.asm
		global	_start
_start:		push	byte 4
		pop	eax		; eax = 4 = write(2)
		xor	ebx,ebx
		inc	ebx		; ebx = 1 = stdout
		mov	ecx,0x08048001	; ecx = magic address
		push	byte 3
		pop	edx		; edx = 3 = three characters
		int	0x80

		xor	eax,eax
		inc	eax		; eax = 1 = exit(2)
		xor	ebx,ebx		; ebx = 0 = return code
		int	0x80

Command: src/evil_magic/nasm.sh
#!/bin/sh
nasm -f elf -o ${TMP}/evil_magic/nasm.o \
	src/evil_magic/evil_magic.asm \
&& ld -o ${TMP}/evil_magic/nasm ${TMP}/evil_magic/nasm.o \
&& ${TMP}/evil_magic/nasm

Output: out/i386/evil_magic/nasm
ELF

Output is good. But how do we get the resulting machine code? We can't just add a call to printf(3) to the assembly code. Above example is not linked with glibc; it does not even have a function called main.

3.7.1. Enter evil

On the other hand things became a lot easier. There is no initialization code that gets executed before _start, so the address of _start is really the ELF entry point of the executable. A look into /usr/include/elf.h shows that Elf32_Ehdr::e_entry is at file offset 24.

Command: src/evil_magic/od.sh
#!/bin/sh
od -j24 -An -tx4 -N4 ${TMP}/evil_magic/nasm \
| sed 's/^[[:space:]]/0x/'

Output: out/i386/evil_magic/od
0x08048080

The entry point is specified as a virtual address in memory. By subtracting the base address we get the file offset:

0x8048080 - 0x8048000 = 0x80 = 128

3.7.2. Evil magic revealed

Command: out/i386/evil_magic/ndisasm.sh
#!/bin/sh
ndisasm -e 128 -o 0x08048080  -U tmp/i386/evil_magic/nasm | head -12

Output: out/i386/evil_magic/evil_magic.asm
08048080  6A04              push byte +0x4
08048082  58                pop eax
08048083  31DB              xor ebx,ebx
08048085  43                inc ebx
08048086  B901800408        mov ecx,0x8048001
0804808B  6A03              push byte +0x3
0804808D  5A                pop edx
0804808E  CD80              int 0x80
08048090  31C0              xor eax,eax
08048092  40                inc eax
08048093  31DB              xor ebx,ebx
08048095  CD80              int 0x80

3.7.3. Dressing up binary code

There is still one thing left: Dressing up the hex dump as C source. A small filter written in perl(1) would do. Because this tool will be used throughout the document it provides a lot of features, however.

The __attribute__ clause is explained in A section called .text. It is not required at this point.

Initializing the array with string literals (looking like \xDE\xAD\xBE\xEF) is easier. The terminating zero would not work with Doing it in C, however. But then using a list of hexadecimal numbers introduces separating comas, requiring special treatment of the last line.

If command line option -last_line_is_ofs is passed to the program then the last line of disassembly is meant to specify a offset into the code. Actually it's just the last byte of that line. You are free to use any dummy operation, like push byte 1. See Target::infection for an example.

Source: src/evil_magic/ndisasm.pl
#!/usr/bin/perl -sw
use strict;
use constant LINE => "  %-30s /* %-30s */\n";

$::identfier = 'main' if (!defined($::identfier));
$::size = '' if (!defined($::size));
$::align = '8' if (!defined($::align));

printf "const unsigned char %s[%s]\n", $::identfier, $::size;
print "__attribute__ (( aligned($::align), section(\".text\") )) =\n";
print "{\n";

my @line;
while(<>)
{
  s/\s+$//;
  my $code = (split())[1];
  my $dump = '0x' . substr($code, 0, 2);
  for(my $i = 2; $i < length($code); $i += 2)
  {
    $dump .= ',0x' . substr($code, $i, 2);
  }
  s/\s+[^\s]*\s+/: /;
  push @line, [ $_, $code, $dump ]
}

my $nr = 0;
my $max = $#line;
$max -= 1 if (defined($::last_line_is_ofs));
while($nr < $max)
{
  printf LINE, $line[$nr][2] . ',', $line[$nr][0];
  $nr++;
}
printf(LINE . "};\n", $line[$nr][2], $line[$nr][0]);
if (defined($::last_line_is_ofs))
{
  my $ofs = substr($line[$nr + 1][1], -2, 2);
  printf "enum { ENTRY_POINT_OFS = 0x%x };\n", hex($ofs);
}

Output: out/i386/evil_magic/evil_magic.c
const unsigned char main[]
__attribute__ (( aligned(8), section(".text") )) =
{
  0x6A,0x04,                     /* 08048080: push byte +0x4       */
  0x58,                          /* 08048082: pop eax              */
  0x31,0xDB,                     /* 08048083: xor ebx,ebx          */
  0x43,                          /* 08048085: inc ebx              */
  0xB9,0x01,0x80,0x04,0x08,      /* 08048086: mov ecx,0x8048001    */
  0x6A,0x03,                     /* 0804808B: push byte +0x3       */
  0x5A,                          /* 0804808D: pop edx              */
  0xCD,0x80,                     /* 0804808E: int 0x80             */
  0x31,0xC0,                     /* 08048090: xor eax,eax          */
  0x40,                          /* 08048092: inc eax              */
  0x31,0xDB,                     /* 08048093: xor ebx,ebx          */
  0xCD,0x80                      /* 08048095: int 0x80             */
};

Calling the string constant main is not a mistake. Above output is a complete and valid C program.

Command: src/evil_magic/cc.sh
#!/bin/sh
gcc -Wall -O2 ${OUT}/evil_magic/evil_magic.c \
	-o ${TMP}/evil_magic/cc \
&& ${TMP}/evil_magic/cc

Output: out/i386/evil_magic/cc
out/i386/evil_magic/evil_magic.c:2: warning: `main' is usually a function
ELF

3.8. Other roads to ELF

Source: src/other_magic/perl.pl
#!/usr/bin/perl -w
syscall 4, 1, 0x8048001, 3

Output: out/i386/other_magic/perl
ELF

Command: src/other_magic/mem.sh
#!/bin/sh
dd if=/proc/self/mem bs=1 skip=134512641 count=3 2>/dev/null

Output: out/i386/other_magic/mem
ELF

Command: src/other_magic/exe.sh
#!/bin/sh
dd if=/proc/self/exe bs=1 skip=1 count=3 2>/dev/null

Output: out/i386/other_magic/exe
ELF