You cannot have a science without measurement. | |
R. W. Hamming |
Building executables from C source code is a complex task. An innocent looking call of gcc(1) will invoke a pre-processor, a multi-pass compiler, an assembler and finally a linker. Using all these tools to plant virus code into another executable makes the result either prohibitively large, or very dependent on the completeness of the target installation.
Real viruses approach the problem from the other end. They are aggressively optimized for code size and do only what's absolutely necessary. Basically they just copy one chunk of code and patch a few addresses at hard coded offsets.
However, this has drastic effects:
Since we directly copy binary code, the virus is restricted to a particular hardware architecture.
Code must be position independent.
Code cannot use shared libraries; not even the C runtime library.
We cannot allocate global variables in the data segment.
There are ways to circumvent these limitations. But they are complicated and make the virus more likely to fail.
Another natural limitation of viruses is rigid dependency on the file format of target executables. These formats differ a lot. Even on the same hardware architecture and under the same operating system. Furthermore executable are not designed with post link-time modifications in mind. It's rare for a virus to support more than one infection method. This document is about the format used on recent versions of Linux, FreeBSD and Solaris.
This format is well documented. Some public resources:
Source code of Linux and FreeBSD. Admittedly not for the faint of heart. [1]
/usr/include/elf.h [2]
Portable Formats Specification, Version 1.1. [3]
Linux Standard Base Specification [4]
Creating Really Teensy ELF Executables for Linux [5]
A quote from the Portable Formats Specification:
The Executable and Linking Format was originally developed and published by UNIX System Laboratories (USL) as part of the Application Binary Interface (ABI). The Tool Interface Standards committee (TIS) has selected the evolving ELF standard as a portable object file format that works on 32-bit Intel Architecture environments for a variety of operating systems.
Actually ELF covers object files (.o), shared libraries (.so) and executable files. The Linux kernel [6] is also a valid ELF file.
ELF is used for a variety of both 32 bit and 64 bit architectures. Obviously you need to handle assembly language for each platform. A good starting point is "Linux Assembly" [7] and "Assembly Language Related Web Sites". [8]
alpha specific sites:
Assembly Language Programmer's Guide [9]
i386 specific sites:
Assembly-HOWTO. [10] Description of tools and sites for Linux.
FAQ of comp.lang.asm.x86 [11]
"Robin Miyagi's Linux Programming" [12] features a tutorial and interesting links.
"Assembly resources" [13] covers advanced topics.
IA-32 Intel Architecture Software Developer's Manual [14]
"The Place on the Net to Learn Assembly Language Programming" [15]
The Art of Assembly Language. 32-bit Linux Edition Featuring HLA. [16]
X86 Architecture, low-level programming, freeware [17]
Dr. Dobb's Microprocessor Resources [18]
sparc specific sites:
A debugger lets you see what is going on "inside" another program while it executes. gdb can also show a plain disassembly of the code, and can do so without executing a single instruction. This listing does not include a hex dump of opcodes, however. On the other hand pure disassemblers take shortcuts; they don't have a complete picture of the target executable.
objdump is part of GNU binutils. It is advertised as a means to display information from object files. But objdump can also work on executables. And it provides option --disassemble. Since it does not resolve function names in shared libraries it cannot fully replace gdb, though.
GNU binutils includes readelf since version 2.10. It is described to display the contents of ELF format files, regardless of target machine. Functionality of readelf and objdump overlap, but I think the output of the former is nicer. However, the tool is missing on old distributions like SuSE 6.0. [21]
Of course all GNU disassembly tools adhere to the syntax of the GNU assembler. Veterans of i386 programming consider this style repulsive, however. And while gdb provides statement set disassembly-flavor intel to lower the contrast, objdump has nothing similar. On platform i386 I will therefore use nasm and ndisasm where possible. This tool has absolutely no understanding of ELF (or any other file format). But for the scope of this document this is a feature. The calculations necessary to get at the interesting bytes are interesting themselves.
The primary quality of this document is reproducabilty. Every tiny bit of information should be proved by a working example. Since I don't trust myself all output files are rebuild for every release. All sections titled "Output" are real product of source code and shell scripts included in this document. Most numbers and calculations are processed by a Perl script parsing these output files.
The document itself is written in DocBook, [22] a XML document type definition. [23] Conversion to HTML is the last step of a Makefile that builds and runs all examples. However, this means that I can't provide one document comparing two platforms. Instead I set up everything for conditional compilation. I then build one consistent variation of the document on a single system.
You are now reading the platform independent part. The links below lead to the actual examples. This part continues with general topics and larger chunks of source code.
This script is used throughout the document to convert binary files into valid C code, i.e. definition of a byte array. This could have been a small filter written in perl(1), but we actually need a lot of features.
We need to process the output of both ndisasm and objdump, one multiple platforms. Two examples for valid input:
08048080 6A04 push byte +0x4
10074: 82 10 20 04 mov 4, %g1 |
The __attribute__ clause is explained in A section called .text.
Initializing the array with string literals (looking like \xDE\xAD\xBE\xEF) is easier. The terminating zero would not work with Doing it in C, however. But then using a list of hexadecimal numbers introduces separating comas, requiring special treatment of the last line.
If command line option -last_line_is_ofs is passed to the program then the last line of disassembly is meant to specify a offset into the code. Actually it's just the last byte of that line. You are free to use any dummy operation, like push byte 1. See Target::infection for an example.
Source: src/platform/disasm.pl
#!/usr/bin/perl -sw
use strict;
my $LINE = " %-30s /* %-32s */\n";
$::identfier = 'main' if (!defined($::identfier));
$::size = '' if (!defined($::size));
$::align = '8' if (!defined($::align));
printf "const unsigned char %s[%s]\n", $::identfier, $::size;
print "__attribute__ (( aligned($::align), section(\".text\") )) =\n";
print "{\n";
my @line;
while(<>)
{
s/^\s+//; # trim leading white space
s/\s+$//; # trim trailing white space
s/\s+[!;].*//; # trim trailing comments
my $addr = (split(/[:\s]+/))[0];
s/[[:xdigit:]]+:?\s+//;
my @code = split(/\s\s+/);
my $code = $code[0];
$code =~ s/\s//g; # make objdump look like ndisasm
my $dump = '0x' . substr($code, 0, 2);
for(my $i = 2; $i < length($code); $i += 2)
{
$dump .= ',0x' . substr($code, $i, 2);
}
push @line, [ $addr . ': ' . join(' ', @code[1..$#code]), $code, $dump ]
}
my $nr = 0;
my $max = $#line;
$max -= 1 if (defined($::last_line_is_ofs));
while($nr < $max)
{
printf $LINE, $line[$nr][2] . ',', $line[$nr][0];
$nr++;
}
printf($LINE . "};\n", $line[$nr][2], $line[$nr][0]);
if (defined($::last_line_is_ofs))
{
my $ofs = substr($line[$nr + 1][1], -2, 2);
printf "enum { ENTRY_POINT_OFS = 0x%x };\n", hex($ofs);
} |
[1] | A nice introduction for the uninitiated is http://www.tldp.org/LDP/tlk/kernel/processes.html#tth_sEc4.8 | ||
[2] | Present on Linux (part of glibc), FreeBSD and SunOS. | ||
[3] |
| ||
[4] | |||
[5] | http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html | ||
[6] | This means file vmlinux. vmlinuz is compressed and prefixed with a boot-sector. See http://www.tldp.org/LDP/tlk/kernel/processes.html#tth_sEc4.8 | ||
[7] | |||
[8] | |||
[9] | http://www.tru64unix.compaq.com/docs/base_doc/DOCUMENTATION/HTML/AA-PS31D-TET1_html/TITLE.html | ||
[10] | |||
[11] | |||
[12] | |||
[13] | |||
[14] | http://developer.intel.com/design/pentium4/manuals/245470.htm | ||
[15] | |||
[16] | |||
[17] | |||
[18] | |||
[19] | |||
[20] | http://www.cs.unm.edu/~maccabe/classes/341/labman/labman.html | ||
[21] | This might be the reason that Silvio Cesare does not mention readelf anywhere in his classic works. | ||
[22] | |||
[23] |