jet

██████╗  ██████╗ ██╗  ██╗ █████╗ 
╚════██╗██╔═══██╗██║  ██║██╔══██╗
 █████╔╝██║   ██║███████║███████║
 ╚═══██╗██║   ██║██╔══██║██╔══██║
██████╔╝╚██████╔╝██║  ██║██║  ██║
╚═════╝  ╚═════╝ ╚═╝  ╚═╝╚═╝  ╚═╝

Welcome to 3OHA, a place for random notes, thoughts, and factoids that I want to share or remember

3OHA

4 October 2021

The malsource dataset

In 2018 we did a study on the evolution of malware from 1975 to 2016 from a software engineering perspective. We analyzed the source code of 456 malware samples from 428 unique families. We collected this dataset over two years (2015-2016) from different online malware repositories and some historical vx forums. Our dataset is certainly limited and biased, but so it is the availability of malware source code. Still, we believe this is the largest dataset of malware source code analyzed so far.

cowsay cowsay

Evolution

We analyzed malware as a software product using several metrics used in software engineering to quantify different aspects of software artifacts. They are grouped into three main categories:

Measures of size:
- Number of source lines of code (SLOC)
- Number of source files
- Number of different programming languages used
- Number of function points (FP)
Estimates of the cost of developing the sample:
- Effort (man-months)
- Required time
- Number of programmers
Measures of code quality
- Comment-to-code ratio
- Ccomplexity of the control flow logic
- Maintainability of the code

Our results suggest an exponential increment of nearly one order of magnitude per decade in aspects such as size and estimated effort, with code quality metrics similar to those of benign software. One example is the number of source code files. The code of viruses and worms developed in the early 2000s is generally distributed across a reduced (<10) number of files, while some Botnets and RATs from 2005 on comprise substantially more. For instance, Back Orifice 2000 , GhostRAT, and Zeus, all from 2007, contain 206, 201, and 249 source code files, respectively. After 2010, no sample comprises a single file. Examples of large projects from these years include KINS (2011), SpyNet (2014), and the RIG exploit kit (2014) with 267, 324, and 838 files, respectively.

The growth in terms of SLOC is also exponential. Up to the mid 1990s viruses and early worms rarely exceeded 1,000 SLOC. Between 1997 and 2005 most samples contain several thousands SLOC, with a few exceptions above that figure, e.g., Simile (10,917 SLOC) or Troodon (14,729 SLOC). After 2007, many samples have SLOC counts in the range of tens of thousands. For instance, GhostRAT (33,170), Zeus (61,752), KINS (89,460), Pony2 (89,758), or SpyNet (179,682). Most of such samples correspond to moderately complex malware containing more than just one executable. Typical examples include Botnets or RATs featuring a webbased C&C server, support libraries, and various types of bots/trojans. We observe a similar behavior in the case of the FP count.

The evolution of development effort metrics is also exponential, with values growing again approximately one order of magnitude each decade. While in the 1990s most samples required around one man-month, this value rapidly escalates up to 10–20 man-months in the mid 2000s, and to 100s for a few samples in the last years. Overall, the effort growth ratio per year is approximately 11%; or, equivalently, it doubles every 6.5 years.

The estimated time to develop the malware samples shows a linear increase up to 2010, rising from 2-3 months in the 1990s to 7-10 months in the late years.

cowsay cowsay

Code reuse

We also studied the extent to which code reuse is present in our dataset. One challenge to detect code clones is the diversity of programming languages used by the samples. We detected a significant number of code clones across malware families that can be grouped into four main categories:

Operational data structures and functions. These are libraries to manipulate system and networking artifacts, such as executable file formats (PE and ELF) and communication protocols (TCP, HTTP) and services (SMTP, DNS). We also observe a number of clones consisting of headers for several API functions needed to interact with the Windows kernel, such as the 3,054 lines long clone shared by W32.Remhead and W32.Rovnix
Code implementing malicious artifacts, such as infection, spreading, or actions on the victim. For instance, the W32.Dopebot botnet contains shellcode to exploit the CVE-2003-0533 vulnerability, and the same shellcode is found in the W32.Sasser worm. Another good example of this practice is the network sniffer shared by W32.NullBot and W32.LoexBot.
Data clones. Some of the clones are not code, but data structures that appear in multiple samples. An example is the array of frequent passwords present in both W32.Rbot and W32.LoexBot. Another example is the list of strings found in W32.Hunatchab and W32.Branko, containing the process names associated to different commercial AV software, which both bots try to disable. Furthermore, some samples also share strings containing IP addresses, for example the Sasser worm and the Dopebot botnet.
Anti-analysis capabilities. The W32.Cairuh worm and the W32.Hexbot botnet share a 22,709 lines long packer. This is the biggest clone we found in our dataset. Another remarkable example is the metamorphic engine shared by the Simile and Metaphor.1d viruses, consisting of more than 10,900 lines of assembly code. Other examples of reused anti-analysis modules are found in W32.Antares and W32.Vampiro, which share the same polymorphic engine, and also in W95.Babyloni, and W32.Ramlide, which share the same packing engine. We also found a number of reused instances of code to kill running AV processes, such as the clone found in Hunatchab.c and Branko.c.

Full report and dataset

The complete study is available in our paper:

A. Calleja, J. Tapiador, and J. Caballero. The MalSource Dataset: Quantifying Complexity and Code Reuse in Malware Development. IEEE Transactions on Information Forensics and Security, Vol 14, No. 12, pp. 3175-3190, 2018.

A preliminary version (reduced dataset and no code reuse analysis) was presented in RAID 2016:

A. Calleja, J. Tapiador, and J. Caballero. A look into 30 years of malware development from a software metrics perspective. RAID 2016.

The dataset and results are available in a GitHub repository.