The magic of the Elf

 

Any sufficiently advanced technology is indistinguishable from magic.

 Arthur C. Clarke

What exactly is a virus?

The main difference between worms and viruses is persistence and speed. Modifications to files are usually permanent, i.e. they remain after reboot. On the other hand, a virus attached to a host can get active only when that host program is started. A worm takes immediate control of a running process and thus can propagate very fast.

Usually these techniques are combined to effectively cause mischief. Viruses can get resident, i.e. attach themselves to a part of the system that runs independent of the infected executable. Worms can modify system files to leave permanent back doors. And tricking the user into executing the very first infector is a lot easier than finding and exploiting buffer overflows.

A small step for mankind

Building executables from C source code is a complex task. An innocent looking call of gcc will invoke a pre-processor, a multi-pass compiler, an assembler and finally a linker. Using all these tools to plant virus code into another executable makes the result either prohibitively large, or very dependent on the completeness of the target installation.

Real viruses approach the problem from the other end. They are aggressively optimized for code size and do only what's absolutely necessary. Basically they just copy one chunk of code and patch a few addresses at hard coded offsets.

However, this has drastic effects:

There are ways to circumvent these limitations. But they are complicated and make the virus more likely to fail.

For the first example I'll present the simplest piece of code that still gives sufficient feedback. Our aim is to implant it into /bin/sh. On practically every recent installation of Linux/i386 the following code will emit three magic letters instead of just dumping core.

In the language of mortals

Source.

#include <unistd.h>
int main() { write(1, (void*)0x08048001, 3); return 0; }

Command.

#!/bin/sh
gcc -Wall -O2 src/magic_elf/magic_elf.c -o tmp/magic_elf/magic_elf \
&& tmp/magic_elf/magic_elf

Output.

ELF

How it works

Digested answer

The three letters are part of the signature of ELF files. Executables created by ld are always mapped into the same memory region. That's why the program can find its own header at a predictable virtual address.

Short answer

RTFM.

The raw details are in /usr/include/elf.h. The canonical document describing the ELF file format for Intel-386 architectures can be found at ftp://tsx.mit.edu/pub/linux/packages/GCC/ELF.doc.tar.gz. A flat-text version is http://www.muppetlabs.com/~breadbox/software/ELF.txt. And finally http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html humorously describes how far you can bend the rules to reach minimal size.

Sort of an answer

0x8048000 is not a natural constant, but happens to be the default base address of ELF executables produced by ld. As of version 2.11 of binutils it should be possible to change that with options -Ttext ORG and --section-start SECTIONNAME=ORG, but I didn't get it working. Anyway, the layout of executables produced by ld is straight forward.

  1. One ELF header - Elf32_Ehdr

  2. Program headers - Elf32_Phdr

  3. Program interpreter (not if statically linked)

  4. Code

  5. Data

  6. Section headers - Elf32_Shdr

Everything from the start of the file to the last byte of code is loaded into one segment (called "code" or "text") that begins at the base address. There is a whole section called readelf describing a command to view all these details. In the meantime I will show fancy ways to get by without.

Showing off some tools

What would you do if you knew nothing about ELF and just asked yourself how that example works? How can you go sure that the executable file really contains those three letters?

A good start for finding text in binary files is strings.

Command.

#!/bin/sh
# without "-a -n 3" we don't get any output
strings -a -n 3 tmp/magic_elf/magic_elf | grep -n ELF

Output.

1:ELF

The leading 1: is written by grep and tells that our three-letter word is the first found string. This gives some help where we can find it in a hex dump. It is difficult to search strings in such a dump because of the line breaks. Interactive tools like hexedit might be useful.

Command.

#!/bin/sh

# select ASCII characters or backslash escapes (octal)
od -N 16 -c tmp/magic_elf/magic_elf | head -1

# named characters (ASCII)
od -N 16 -a tmp/magic_elf/magic_elf | head -1

# plain bytewise hex
od -N 16 -t x1 tmp/magic_elf/magic_elf | head -1

Output.

0000000 177   E   L   F 001 001 001  \0  \0  \0  \0  \0  \0  \0  \0  \0
0000000 del   E   L   F soh soh soh nul nul nul nul nul nul nul nul nul
0000000 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00

At this point we can guess that file offset 1 and 0x8048000 + 1 are not coincidental. A test program might help.

Source.

#include <stdio.h>

int main()
{
  printf("0x08048000=%#02x\n", *(unsigned char*)0x08048000);
  printf("0x08048001=%.3s\n", (char*)0x08048001);
  printf("main=%p\n", main);
  return 0;
}

Output.

0x08048000=0x7f
0x08048001=ELF
main=0x8048460

Looks good. The byte at address 0x8048000 + 0 is equal to that at file offset 0. And the address of function main is plausible.

Command.

#!/bin/sh
ndisasm -e 0x460 -U tmp/magic_elf/magic_elf | sed -ne '1,/ret/p'

Output.

00000000  55                push ebp
00000001  89E5              mov ebp,esp
00000003  83EC0C            sub esp,byte +0xc
00000006  6A03              push byte +0x3
00000008  6801800408        push dword 0x8048001
0000000D  6A01              push byte +0x1
0000000F  E8A4FEFFFF        call 0xfffffeb8
00000014  31C0              xor eax,eax
00000016  C9                leave
00000017  C3                ret

Both programs have main at the same file offset. Unfortunately a brief look through /bin proves this to be pure chance. The really bad news is the generated code, however. Instead of a real system call for write we see a strange negative address. Let's have another try.

Command.

#!/bin/sh
gdb tmp/magic_elf/magic_elf -q <<EOT | sed -ne '/:$/,/ret *$/p'
	break main
	run
	disassemble
EOT

Output.

(gdb) Dump of assembler code for function main:
0x8048460 <main>:	push   %ebp
0x8048461 <main+1>:	mov    %esp,%ebp
0x8048463 <main+3>:	sub    $0xc,%esp
0x8048466 <main+6>:	push   $0x3
0x8048468 <main+8>:	push   $0x8048001
0x804846d <main+13>:	push   $0x1
0x804846f <main+15>:	call   0x8048318 <write>
0x8048474 <main+20>:	xor    %eax,%eax
0x8048476 <main+22>:	leave  
0x8048477 <main+23>:	ret    

That strange negative address resolves to a function in a shared library. Not shown is a pathetic attempt to single-step to the actual code of write.

In the language of evil

The code generated by gcc is not suitable for a virus. So here comes hand crafted code optimized for size. I prefer nasm to GNU as.

Source.

		global	_start
_start:		push	byte 4
		pop	eax		; eax = 4 = write(2)
		xor	ebx,ebx
		inc	ebx		; ebx = 1 = stdout
		mov	ecx,0x08048001	; ecx = magic address
		push	byte 3
		pop	edx		; edx = 3 = three characters
		int	0x80

		xor	eax,eax
		inc	eax		; eax = 1 = exit(2)
		xor	ebx,ebx		; ebx = 0 = return code
		int	0x80

Command.

#!/bin/sh
nasm -f elf -o tmp/evil_magic/nasm.o src/evil_magic/evil_magic.asm \
&& ld -o tmp/evil_magic/nasm tmp/evil_magic/nasm.o \
&& tmp/evil_magic/nasm

Output.

ELF

Output is good. But how do we get the resulting machine code? We can't just add a call to printf(3) to the assembly code. Above example is not linked with glibc; it does not even have a function called main.

Entry point

On the other hand things became a lot easier. There is no initialization code that gets executed before _start, so the address of _start is really the ELF entry point of the executable. A look into /usr/include/elf.h shows that Elf32_Ehdr::e_entry is at file offset 24.

Command.

#!/bin/sh
od -Ad -j24 -w4 -tx4 tmp/evil_magic/nasm | head -1

Output.

0000024 08048080

The entry point is specified as a virtual address in memory. By subtracting the base address we get the file offset:

0x08048080 - 0x8048000 = 0x80

Resulting code

Command.

#!/bin/sh
ndisasm -e 0x80 -U tmp/evil_magic/nasm | head -12

Output.

00000000  6A04              push byte +0x4
00000002  58                pop eax
00000003  31DB              xor ebx,ebx
00000005  43                inc ebx
00000006  B901800408        mov ecx,0x8048001
0000000B  6A03              push byte +0x3
0000000D  5A                pop edx
0000000E  CD80              int 0x80
00000010  31C0              xor eax,eax
00000012  40                inc eax
00000013  31DB              xor ebx,ebx
00000015  CD80              int 0x80

That's the code we need. There is just one thing left: Dressing up the hex dump as C source. A filter written in perl will do.

Filter.

#!/usr/bin/perl -sw
use strict;

$::identfier = 'main' if (!defined($::identfier));
$::size = '' if (!defined($::size));

printf "const unsigned char %s[%s] =\n", $::identfier, $::size;
while(<>)
{
  chomp;
  my @word = split;
  my $code = $word[1];

  my $escape = '"';
  for(my $i = 0; $i < length($code); $i += 2)
  {
    $escape = $escape . '\\x' . substr($code, $i, 2);
  }
  $escape .= '"';
  s/\s+[^\s]*\s+/: /;
  printf "  %-24s /* %-30s */\n", $escape, $_;
}
print "  ;\n";

Output.

const unsigned char main[] =
  "\x6A\x04"               /* 00000000: push byte +0x4       */
  "\x58"                   /* 00000002: pop eax              */
  "\x31\xDB"               /* 00000003: xor ebx,ebx          */
  "\x43"                   /* 00000005: inc ebx              */
  "\xB9\x01\x80\x04\x08"   /* 00000006: mov ecx,0x8048001    */
  "\x6A\x03"               /* 0000000B: push byte +0x3       */
  "\x5A"                   /* 0000000D: pop edx              */
  "\xCD\x80"               /* 0000000E: int 0x80             */
  "\x31\xC0"               /* 00000010: xor eax,eax          */
  "\x40"                   /* 00000012: inc eax              */
  "\x31\xDB"               /* 00000013: xor ebx,ebx          */
  "\xCD\x80"               /* 00000015: int 0x80             */
  ;

Calling the string constant main is not a mistake. Above output is a complete and valid C program.

Command.

#!/bin/sh
gcc -Wall -O2 out/evil_magic/evil_magic.c -o tmp/evil_magic/cc \
&& tmp/evil_magic/cc

Output.

out/evil_magic/evil_magic.c:1: warning: `main' is usually a function
ELF

Other roads to ELF

Source.

#!/usr/bin/perl -w
syscall 4, 1, 0x8048001, 3

Output.

ELF

Command.

#!/bin/sh
dd if=/proc/self/mem bs=1 skip=134512641 count=3 2>/dev/null

Output.

ELF

Command.

#!/bin/sh
dd if=/proc/self/exe bs=1 skip=1 count=3 2>/dev/null

Output.

ELF