Any sufficiently advanced technology is indistinguishable from magic. | |
Arthur C. Clarke |
Building executables from C source code is a complex task. An innocent looking call of gcc(1) will invoke a pre-processor, a multi-pass compiler, an assembler and finally a linker. Using all these tools to plant virus code into another executable makes the result either prohibitively large, or very dependent on the completeness of the target installation.
Real viruses approach the problem from the other end. They are aggressively optimized for code size and do only what's absolutely necessary. Basically they just copy one chunk of code and patch a few addresses at hard coded offsets.
However, this has drastic effects:
Since we directly copy binary code, the virus is restricted to a particular hardware architecture.
Code must be position independent.
Code cannot use shared libraries; not even the C runtime library.
We cannot allocate global variables in the data segment.
There are ways to circumvent these limitations. But they are complicated and make the virus more likely to fail.
For the first example I'll present the simplest piece of code that still gives sufficient feedback. Our aim is to implant it into /bin/sh. On practically every recent installation of Linux/i386 the following code will emit three magic letters instead of just dumping core.
Source.
#include <unistd.h>
int main() { write(1, (void*)0x08048001, 3); return 0; } |
Command.
#!/bin/sh
gcc -Wall -O2 src/magic_elf/magic_elf.c -o tmp/magic_elf/magic_elf \
&& tmp/magic_elf/magic_elf |
Output.
ELF |
The three letters are part of the signature of ELF files. Executables created by ld(1) are always mapped into the same memory region. That's why the program can find its own header at a predictable virtual address.
RTFM.
The raw details are in /usr/include/elf.h. The canonical document describing the ELF file format for Intel-386 architectures can be found at ftp://tsx.mit.edu/pub/linux/packages/GCC/ELF.doc.tar.gz. A flat-text version is http://www.muppetlabs.com/~breadbox/software/ELF.txt. And finally http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html humorously describes how far you can bend the rules to reach minimal size.
0x8048000 is not a natural constant, but happens to be the default base address of ELF executables produced by ld(1). As of version 2.11 of binutils it should be possible to change that with options -Ttext ORG and --section-start SECTIONNAME=ORG, but I didn't get it working. Anyway, the layout of executables produced by ld(1) is straight forward.
One ELF header - Elf32_Ehdr
Program headers - Elf32_Phdr
Program interpreter (not if statically linked)
Code
Data
Section headers - Elf32_Shdr
Everything from the start of the file to the last byte of code is mapped into one segment (named "code" or "text") that begins at the base address. There is a whole section called readelf describing a command to view all these details. In the meantime I will show fancy ways to get by without.
What would you do if you knew nothing about ELF and just asked yourself how that example works? How can you go sure that the executable file really contains those three letters?
A good start for finding text in binary files is strings(1).
Command.
#!/bin/sh
# without "-a -n 3" we don't get any output
strings -a -n 3 tmp/magic_elf/magic_elf | grep -n ELF |
Output.
1:ELF |
The leading 1: is written by grep(1) and tells that our three-letter word is the first found string. This gives some help where we can find it in a hex dump. It is difficult to search strings in such a dump because of line breaks. Interactive tools like hexedit(1) might be useful.
Command.
#!/bin/sh
# select ASCII characters or backslash escapes (octal)
od -N 16 -c tmp/magic_elf/magic_elf | head -1
# named characters (ASCII)
od -N 16 -a tmp/magic_elf/magic_elf | head -1
# plain bytewise hex
od -N 16 -t x1 tmp/magic_elf/magic_elf | head -1 |
Output.
0000000 177 E L F 001 001 001 \0 \0 \0 \0 \0 \0 \0 \0 \0
0000000 del E L F soh soh soh nul nul nul nul nul nul nul nul nul
0000000 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 |
At this point we can guess that file offset 1 and 0x8048000 + 1 are not coincidental. A test program might help.
Source - addr_of_main.c.
#include <stdio.h>
int main()
{
printf("0x08048000=%#02x\n", *(unsigned char*)0x08048000);
printf("0x08048001=%.3s\n", (char*)0x08048001);
printf("main=%p\n", main);
return 0;
} |
Output.
0x08048000=0x7f
0x08048001=ELF
main=0x8048460 |
Looks good. The byte at address 0x8048000 + 0 is equal to that at file offset 0. And the address of function main is plausible.
Command.
#!/bin/sh
ndisasm -e 0x460 -U tmp/magic_elf/magic_elf | sed -e '/ret/q' |
Output.
00000000 55 push ebp
00000001 89E5 mov ebp,esp
00000003 83EC0C sub esp,byte +0xc
00000006 6A03 push byte +0x3
00000008 6801800408 push dword 0x8048001
0000000D 6A01 push byte +0x1
0000000F E8A4FEFFFF call 0xfffffeb8
00000014 31C0 xor eax,eax
00000016 C9 leave
00000017 C3 ret |
Both programs have main at the same file offset. Unfortunately a brief look through /bin proves this to be pure chance. The really bad news is the generated code, however. Instead of a real system call for write we see a strange negative address. Let's have another try.
Command.
#!/bin/sh
file=${1:-tmp/magic_elf/magic_elf}
func=${2:-main}
gdb ${file} -q <<EOT | sed -ne '/:$/,/ret *$/p'
set disassembly-flavor intel
disassemble ${func}
EOT |
Output.
(gdb) (gdb) Dump of assembler code for function main:
0x8048460 <main>: push ebp
0x8048461 <main+1>: mov ebp,esp
0x8048463 <main+3>: sub esp,0xc
0x8048466 <main+6>: push 0x3
0x8048468 <main+8>: push 0x8048001
0x804846d <main+13>: push 0x1
0x804846f <main+15>: call 0x8048318 <write>
0x8048474 <main+20>: xor eax,eax
0x8048476 <main+22>: leave
0x8048477 <main+23>: ret |
That strange negative address resolves to a function in a shared library. Not shown is a pathetic attempt to single-step to the actual code of write.
We can now search for a fine manual explaining how to debug shared libraries. Or just compile the bugger static.
Command.
#!/bin/sh
gcc -Wall -O2 -static src/magic_elf/magic_elf.c \
-o tmp/magic_elf/magic_elf_static \
&& ls -l tmp/magic_elf \
&& tmp/magic_elf/magic_elf_static |
Output.
total 1668
-rwxrwxr-x 1 alba alba 13711 Apr 18 21:18 magic_elf
-rwxrwxr-x 1 alba alba 1687693 Apr 18 21:18 magic_elf_static
ELF |
Seems we found an easy way to fill up the hard disk. Anyway, what has gdb(1) to say about it?
Output.
(gdb) (gdb) Dump of assembler code for function main:
0x80481e0 <main>: push ebp
0x80481e1 <main+1>: mov ebp,esp
0x80481e3 <main+3>: sub esp,0xc
0x80481e6 <main+6>: push 0x3
0x80481e8 <main+8>: push 0x8048001
0x80481ed <main+13>: push 0x1
0x80481ef <main+15>: call 0x804cc60 <__libc_write>
0x80481f4 <main+20>: xor eax,eax
0x80481f6 <main+22>: leave
0x80481f7 <main+23>: ret |
The name of the function changed for no apparent reason. But it is reachable for disassembly now.
Output.
(gdb) (gdb) Dump of assembler code for function __libc_write:
0x804cc60 <__libc_write>: push ebx
0x804cc61 <__libc_write+1>: mov edx,DWORD PTR [esp+16]
0x804cc65 <__libc_write+5>: mov ecx,DWORD PTR [esp+12]
0x804cc69 <__libc_write+9>: mov ebx,DWORD PTR [esp+8]
0x804cc6d <__libc_write+13>: mov eax,0x4
0x804cc72 <__libc_write+18>: int 0x80
0x804cc74 <__libc_write+20>: pop ebx
0x804cc75 <__libc_write+21>: cmp eax,0xfffff001
0x804cc7a <__libc_write+26>: jae 0x8052bb0 <__syscall_error>
0x804cc80 <__libc_write+32>: ret |
There are two man pages giving some overview of system calls, intro(2) and syscalls(2). The statement mov eax,4 corresponds to __NR_write in /usr/include/asm/unistd.h.
The code generated by gcc(1) is not suitable for a virus. So here comes hand crafted code optimized for size (twenty three is the perfect number of bytes). I prefer nasm to GNU as.
Source.
global _start
_start: push byte 4
pop eax ; eax = 4 = write(2)
xor ebx,ebx
inc ebx ; ebx = 1 = stdout
mov ecx,0x08048001 ; ecx = magic address
push byte 3
pop edx ; edx = 3 = three characters
int 0x80
xor eax,eax
inc eax ; eax = 1 = exit(2)
xor ebx,ebx ; ebx = 0 = return code
int 0x80 |
Command.
#!/bin/sh
nasm -f elf -o tmp/evil_magic/nasm.o src/evil_magic/evil_magic.asm \
&& ld -o tmp/evil_magic/nasm tmp/evil_magic/nasm.o \
&& tmp/evil_magic/nasm |
Output.
ELF |
Output is good. But how do we get the resulting machine code? We can't just add a call to printf(3) to the assembly code. Above example is not linked with glibc; it does not even have a function called main.
On the other hand things became a lot easier. There is no initialization code that gets executed before _start, so the address of _start is really the ELF entry point of the executable. A look into /usr/include/elf.h shows that Elf32_Ehdr::e_entry is at file offset 24.
Command.
#!/bin/sh
od -Ad -j24 -w4 -tx4 tmp/evil_magic/nasm | head -1 |
Output.
0000024 08048080 |
The entry point is specified as a virtual address in memory. By subtracting the base address we get the file offset:
0x08048080 - 0x8048000 = 0x80
Command.
#!/bin/sh
ndisasm -e 0x80 -U tmp/evil_magic/nasm | head -12 |
Output.
00000000 6A04 push byte +0x4
00000002 58 pop eax
00000003 31DB xor ebx,ebx
00000005 43 inc ebx
00000006 B901800408 mov ecx,0x8048001
0000000B 6A03 push byte +0x3
0000000D 5A pop edx
0000000E CD80 int 0x80
00000010 31C0 xor eax,eax
00000012 40 inc eax
00000013 31DB xor ebx,ebx
00000015 CD80 int 0x80 |
There is still one thing left: Dressing up the hex dump as C source. A filter written in perl(1) will do.
Filter.
#!/usr/bin/perl -sw
use strict;
$::identfier = 'main' if (!defined($::identfier));
$::size = '' if (!defined($::size));
printf "const unsigned char %s[%s]\n", $::identfier, $::size;
print "__attribute__ (( aligned(16), section(\".text\") )) =\n";
my $last_line = "{\n";
while(<>)
{
print $last_line;
chomp;
my $code = (split())[1];
my $dump = '0x' . substr($code, 0, 2);
for(my $i = 2; $i < length($code); $i += 2)
{
$dump .= ',0x' . substr($code, $i, 2);
}
$dump .= ',';
s/\s+[^\s]*\s+/: /;
$last_line = sprintf(" %-28s /* %-30s */\n", $dump, $_);
}
$last_line =~ s/, / /;
print $last_line . "};\n"; |
Output.
const unsigned char main[]
__attribute__ (( aligned(16), section(".text") )) =
{
0x6A,0x04, /* 00000000: push byte +0x4 */
0x58, /* 00000002: pop eax */
0x31,0xDB, /* 00000003: xor ebx,ebx */
0x43, /* 00000005: inc ebx */
0xB9,0x01,0x80,0x04,0x08, /* 00000006: mov ecx,0x8048001 */
0x6A,0x03, /* 0000000B: push byte +0x3 */
0x5A, /* 0000000D: pop edx */
0xCD,0x80, /* 0000000E: int 0x80 */
0x31,0xC0, /* 00000010: xor eax,eax */
0x40, /* 00000012: inc eax */
0x31,0xDB, /* 00000013: xor ebx,ebx */
0xCD,0x80 /* 00000015: int 0x80 */
}; |
The __attribute__ clause is explained in A section called .text. It is not required at this point.
Calling the string constant main is not a mistake. Above output is a complete and valid C program.
Command.
#!/bin/sh
gcc -Wall -O2 out/evil_magic/evil_magic.c -o tmp/evil_magic/cc \
&& tmp/evil_magic/cc |
Output.
out/evil_magic/evil_magic.c:2: warning: `main' is usually a function
ELF |
Source.
#!/usr/bin/perl -w
syscall 4, 1, 0x8048001, 3 |
Output.
ELF |
Command.
#!/bin/sh
dd if=/proc/self/mem bs=1 skip=134512641 count=3 2>/dev/null |
Output.
ELF |
Command.
#!/bin/sh
dd if=/proc/self/exe bs=1 skip=1 count=3 2>/dev/null |
Output.
ELF |