You cannot have a science without measurement. | |
R. W. Hamming |
Building executables from C source code is a complex task. An innocent looking call of gcc will invoke a pre-processor, a multi-pass compiler, an assembler and finally a linker. Using all these tools to plant virus code into another executable makes the result either prohibitively large, or very dependent on the completeness of the target installation.
Real viruses approach the problem from the other end. They are aggressively optimized for code size and do only what's absolutely necessary. Basically they just copy one chunk of code and patch a few addresses at hard coded offsets.
However, this has drastic effects:
Since we directly copy binary code, the virus is restricted to a particular hardware architecture.
Code must be position independent.
Code cannot use shared libraries; not even the C runtime library.
We cannot allocate global variables in the data segment.
There are ways to circumvent these limitations. But they are complicated and make the virus more likely to fail.
Another natural limitation of viruses is rigid dependency on the file format of target executables. These formats differ a lot. Even on the same hardware architecture and under the same operating system. Furthermore executable are not designed with post link-time modifications in mind. It's rare for a virus to support more than one infection method. This document is about the format used on recent versions of Linux, FreeBSD and Solaris. [1]
This format is well documented. Some public resources:
Source code of Linux and FreeBSD. Admittedly not for the faint of heart. [2] |
/usr/include/elf.h [3] |
Portable Formats Specification, Version 1.1. [4] |
Linux Standard Base Specification [5] |
NetBSD ELF FAQ [6] |
Creating Really Teensy ELF Executables for Linux [7] |
A quote from the Portable Formats Specification:
The Executable and Linking Format was originally developed and published by UNIX System Laboratories (USL) as part of the Application Binary Interface (ABI). The Tool Interface Standards committee (TIS) has selected the evolving ELF standard as a portable object file format that works on 32-bit Intel Architecture environments for a variety of operating systems.
Actually ELF covers object files (.o), shared libraries (.so) and executable files. The Linux kernel [8] is also a valid ELF file.
GNU binutils provides two utilities to view ELF headers, objdump and readelf. [9] Functionality of both tools overlap, but I think the output of readelf is nicer. On Solaris the native tools for this purpose are called dump and avdp.
ELF is used for a variety of both 32 bit and 64 bit architectures. Obviously you need to handle assembly language for each platform. A good starting point is "Linux Assembly" [10] and "Assembly Language Related Web Sites". [11]
Introduction to Alpha [12] |
Alpha Assembly Language Guide [13] |
Assembly Language Programmer's Guide [14] |
Assembly-HOWTO. [15] Description of tools and sites for Linux. |
FAQ of comp.lang.asm.x86 [16] |
"Robin Miyagi's Linux Programming" [17] features a tutorial and interesting links. |
"Assembly resources" [18] covers advanced topics. |
IA-32 Intel Architecture Software Developer's Manual [19] |
"The Place on the Net to Learn Assembly Language Programming" [20] |
The Art of Assembly Language. 32-bit Linux Edition Featuring HLA. [21] |
X86 Architecture, low-level programming, freeware [22] |
Dr. Dobb's Microprocessor Resources [23] |
FreeBSD Assembly Language Tutorial [24] |
A debugger lets you see what is going on "inside" another program while it executes. gdb can also show a plain disassembly of the code, and can do so without executing a single instruction. This listing does not include a hex dump of opcodes, however. On the other hand pure disassemblers take shortcuts; they don't have a complete picture of the target executable.
objdump is part of GNU binutils. It is advertised as a means to display information from object files. But objdump can also work on executables. And it provides option --disassemble. Since it does not resolve function names in shared libraries it cannot fully replace gdb, though.
By default all GNU disassembly tools adhere to the syntax of the GNU assembler. Veterans of i386 programming consider this style repulsive, however. gdb provides statement set disassembly-flavor intel to lower the contrast. And objdump has option -Mintel for similar effect. Still I prefer ndisasm [29] on i386 and will use it where possible. This tool has absolutely no understanding of ELF (or any other file format). But for the scope of this document this is a feature. The calculations necessary to get at the interesting bytes are interesting themselves.
In this document input for assemblers (including nasm) is stored in .S files. Traditional cc treat that as "assembler code which must be preprocessed by cpp". This is required on platform alpha where symbolic names for registers are not part of the assembly language. Output of disassemblers ends up as .asm.
The primary quality of this document is reproducibility. Every tiny bit of information should be proved by a working example. Since I don't trust myself all output files are rebuild for every release. All sections titled "Output" are real product of source code and shell scripts included in this document. Most numbers and calculations are processed by a Perl script parsing these output files.
The document itself is written in DocBook, [30] a XML document type definition. [31] Conversion to HTML is the last step of a Makefile that builds and runs all examples. However, this means that I can't provide one document comparing two platforms. Instead I set up everything for conditional compilation. I then build one consistent variation of the document on a single system.
You are now reading the platform independent part. The links below lead to actual examples, and the actual story of constantly improving technique. This part continues with general topics and larger chunks of source code. It is a bit like a huge appendix, since the platform parts frequently refer to chapters here.