██████╗ ██████╗ ██╗ ██╗ █████╗ ╚════██╗██╔═══██╗██║ ██║██╔══██╗ █████╔╝██║ ██║███████║███████║ ╚═══██╗██║ ██║██╔══██║██╔══██║ ██████╔╝╚██████╔╝██║ ██║██║ ██║ ╚═════╝ ╚═════╝ ╚═╝ ╚═╝╚═╝ ╚═╝
Welcome to 3OHA, a place for random notes, thoughts, and factoids that I want to share or remember
3 August 2024
Two of the most important impossibility results in classic computer virology are the undecidability of virus decection and the existence of undetectable viruses.
In his 1987 paper Computer Viruses: Theory and Experiments, Fred Cohen gives the first formal account of computer viruses. He defines a computer virus as:
a program that can 'infect' other programs by modifying them to include a possibly evolved copy of itself.Interestingly, the focus of this definition is not on the virus payload (i.e., what the virus does on the infected host), but on the infection property itself. The key defining attribute of a computer virus is thus its ability to propagate throughout a computer system or network to infect other programs, which in turn may also act as a virus.
Cohen examines the potential for detection of computer viruses through a decision procedure D, which returns true if the input is a virus and false otherwise. The strategy to determine if a program P is a virus is to identify its key defining property, i.e., whether P infects other programs or not. Using a classic diagonal argument, it is trivial to prove that this problem is undecidable: P can invoke D and infect other programs if and only if D determines that P is not a virus. Cohen calls this construction a contradictory-virus (CV), and is practically equivalent to Turing's proof of the undecidability of the halting problem:
program contradictory-virus: if D(contradictory-virus) is false then infect(); else do something else;
An important remark about this proof is that the detector D is assumed to be static (Cohen uses the wording "precise determination of a virus by its appearance"). If D implements dynamic detection that requires running the program ("detection by behavior," using Cohen's words), the argument above is not valid but nonetheless the overal result holds. The proof is now as follows. Since the virus would only be triggered by particular inputs, behavioral detection needs to detect triggering. But in order to detect what the triggering is, one needs to detect the virus statically. This implies that precise dynamic detection requires precise static detection (of the inputs), which is undecidable. It then follows that precise dynamic detection is also undecidable.
Another relevant impossibility result from Cohen's paper is the undecidability of determining if two programs (computer viruses), P1 and P2, are evolutions of the same computer virus. The proof is, again, a classic diagonal argument where the decision procedure D presumably capable of finding their equivalence is invoked by P1 and P2. If P1 and P2 are both found to be equivalent, they perform different operations; and if they are found different, they act the same.
Cohen notes that the previous proofs do not imply that virus detection is impossible in particular cases, only in the general case. Interestingly, he discusses some of the key strategies that would later become prevalent in anti-virus technologies. He explicitly describes how to use a sequence of bytes as a signature (he does not use the term signature, though) to detect some of the examples discussed in the paper. He discusses how a virus can evade signature detection by mutating the parts of its code that can are used to build the signature. He also puts forward the idea of using "potentially illegitimate behavior in a decidable an deasily computable way" as the basis for dynamic detection, and anticipates that detection will have to deal with false positives and, especially, false negatives, if we are to detect a large number of viruses.
Chess and White published in 2000 a paper titled An Undetectable Computer Virus where they introduce the notion of a viral set:
Consider the set of programs which produce one or more programs as output. For any pair of programs p and q, we say that p eventually produces q if and only if p produces q either directly or through some steps. A viral set V is a maximal set of programs such that for every pair of programs p and q in V, p eventually produces q, and q eventually produces p.
The notion of a viral set is somewhat equivalent to the idea of computer virus as defined by Cohen. The simplest virus is a viral set that contains exactly one program, which produces itself. Larger viral sets model polymorphic viruses, which produce evolved copies of itself.
Chess and White's key result is complementary to Cohen's detection impossibility result. Cohen states that for every detection algorithm there is some virus that it does not detect. Chess and White prove that there is a virus (in the viral set sense) that no algorithm perfectly detects. Such an undetectable virus is named W, and the following is an instance r:
if subroutine_one(r) then exit; else { replace the text of subroutine_one with a random program; spread; exit; } subroutine_one (arg): return false;
Assume that C is a detection algorithm for W. For any such C there is an instance s in W like this:
if subroutine_one(s) then exit; else { replace the text of subroutine_one with a random program; spread; exit; } subroutine_one (arg): return C(arg);
Note that C does not result the correct result for s. If C(s) identifies s as a virus and returns true, then s exits, so it's not an instance of W or any other viral set. Conversely, if C(s) returns false, then s spreads as an instance of W.
In Chess and White's own words:
the main practical impact of the current result is to dispel the notion that it is always possible to create a detector for a given virus that has no false positives, even if you have a copy of the virus in hand.