██████╗ ██████╗ ██╗ ██╗ █████╗ ╚════██╗██╔═══██╗██║ ██║██╔══██╗ █████╔╝██║ ██║███████║███████║ ╚═══██╗██║ ██║██╔══██║██╔══██║ ██████╔╝╚██████╔╝██║ ██║██║ ██║ ╚═════╝ ╚═════╝ ╚═╝ ╚═╝╚═╝ ╚═╝
Welcome to 3OHA, a place for random notes, thoughts, and factoids that I want to share or remember
4 October 2021
In 2018 we did a study on the evolution of malware from 1975 to 2016 from a software engineering perspective. We analyzed the source code of 456 malware samples from 428 unique families. We collected this dataset over two years (2015-2016) from different online malware repositories and some historical vx forums. Our dataset is certainly limited and biased, but so it is the availability of malware source code. Still, we believe this is the largest dataset of malware source code analyzed so far.
We analyzed malware as a software product using several metrics used in software engineering to quantify different aspects of software artifacts. They are grouped into three main categories:
Our results suggest an exponential increment of nearly one order of
magnitude per decade in aspects such as size and estimated effort, with
code quality metrics similar to those of benign software. One example is
the number of source code files. The
code of viruses and worms developed in the early 2000s is generally
distributed across a reduced (<10) number of files, while some Botnets
and RATs from 2005 on comprise substantially more. For instance,
Back Orifice 2000
, GhostRAT
, and
Zeus
, all from 2007, contain
206, 201, and 249 source code files, respectively. After 2010, no sample
comprises a single file. Examples of large projects from these years include
KINS
(2011),
SpyNet
(2014), and the RIG
exploit kit (2014) with 267, 324, and 838 files,
respectively.
The growth in terms of SLOC is also exponential.
Up to the mid 1990s viruses and early worms rarely
exceeded 1,000 SLOC. Between 1997 and 2005 most samples
contain several thousands SLOC, with a few exceptions above
that figure, e.g., Simile
(10,917 SLOC) or
Troodon
(14,729 SLOC). After 2007, many samples have SLOC counts in the
range of tens of thousands. For instance, GhostRAT
(33,170),
Zeus
(61,752), KINS
(89,460), Pony2
(89,758), or SpyNet
(179,682). Most of such samples correspond to moderately
complex malware containing more than just one executable.
Typical examples include Botnets or RATs featuring a webbased C&C server, support libraries, and various types of
bots/trojans. We observe a similar behavior in the case of the FP count.
The evolution of development effort metrics is also exponential, with values growing again approximately one order of magnitude each decade. While in the 1990s most samples required around one man-month, this value rapidly escalates up to 10–20 man-months in the mid 2000s, and to 100s for a few samples in the last years. Overall, the effort growth ratio per year is approximately 11%; or, equivalently, it doubles every 6.5 years.
The estimated time to develop the malware samples shows a linear increase up to 2010, rising from 2-3 months in the 1990s to 7-10 months in the late years.
We also studied the extent to which code reuse is present in our dataset. One challenge to detect code clones is the diversity of programming languages used by the samples. We detected a significant number of code clones across malware families that can be grouped into four main categories:
W32.Remhead
and
W32.Rovnix
W32.Dopebot
botnet contains shellcode to exploit the CVE-2003-0533 vulnerability,
and the same shellcode is found in the
W32.Sasser worm
. Another good example of this practice is the network
sniffer shared by W32.NullBot
and
W32.LoexBot
.W32.Rbot
and W32.LoexBot
. Another example is the list of
strings found in W32.Hunatchab
and
W32.Branko
, containing the process names associated
to different commercial AV software, which both bots try to
disable. Furthermore, some samples also share strings containing
IP addresses, for example the Sasser
worm and the
Dopebot
botnet.W32.Cairuh
worm
and the W32.Hexbot
botnet share a 22,709 lines long packer.
This is the biggest clone we
found in our dataset. Another remarkable example is the
metamorphic engine shared by the Simile
and
Metaphor.1d
viruses, consisting of more than 10,900
lines of assembly code. Other examples of reused anti-analysis
modules are found in W32.Antares
and
W32.Vampiro
, which share the same polymorphic
engine, and also in W95.Babyloni
, and
W32.Ramlide
, which share the same packing engine. We
also found a number of reused instances of code to kill running
AV processes, such as the clone found in Hunatchab.c
and Branko.c
.
The complete study is available in our paper:
A preliminary version (reduced dataset and no code reuse analysis) was presented in RAID 2016:
The dataset and results are available in a GitHub repository.