Accurate Identification of Malware by Program Running on Bridges-AI
Convolutional Neural Network Promises Leap Forward in Antivirus Tech
by Ken Chiacchia
Computer experts are good at identifying malware—computer viruses, worms and the like—and writing code to protect us. But even using powerful computing tools, it’s hard to keep up with the small changes in code that malware designers use to make their products escape detection. A team from the University of North Georgia used a new approach to convert malware into a visual image that a convolutional neural network running on PSC’s Bridges-AI supercomputer used to accurately identify malware, offering far faster antivirus efforts.
The recurrence plot images created from the malware families (named at the top of each column of images) show similarities between members of each family (the three images under each family name). Scientists aren’t sure whether or how these similarities are significant, but the images allowed an AI running on PSC’s Bridges-AI system to accurately identify malware.
Yong Wei, University of North Georgia
Sara Sartoli, University of North Georgia
Why It’s Important
In 2018, malware cost the global economy a staggering $1 trillion—more than 5 percent of the U.S. GDP. And it only gets worse. The people who design malware are skilled and inventive, and always looking for the next vulnerability. “White hat” hackers who protect us from malware rely on expertise and the fact that malware has to do certain things to get into a computer system and do harm. These patterns in malware code leave a kind of signature that can be spotted by a trained eye—with help from computers. This approach underlies the antivirus protection on our computers, and the antivirus updates that we should be making regularly.“A signature is a model of known malicious behavior such as a sequence of instructions in the malware code. Creating malware signatures and loading them into databases needs some manual work. And this is often a time-consuming process … On the other hand, Malware developers mutate malwares so that they can not be detected by signature-based antiviruses. They change it over time, and they are very fast. We need new approaches … to cope with the high volume of malware.”—Sara Sartoli, University
of North Georgia
The problem is that the people who make malware are as fast as the people who fight it. Faster, since it only takes a few changes to make malware no longer fit its signature. Sara Sartoli and Yong Wei of the University of North Georgia (UNG) wondered whether they could use artificial intelligence (AI)—in particular, a type of AI called a convolutional neural network (CNN)—to leap ahead of the malware makers. They turned to the AI-specialized hardware of PSC’s Bridges-AI supercomputer.
How PSC Helped
Humans are smart, but slow compared with computers. CNN is far faster, and it’s proved itself in identifying visual images accurately. If the UNG scientists could boil down malware computer code to a simplified visual image, they might be able to use CNN to automate malware identification, enabling it to identify it faster than current methods, which rely on humans.
Previous attempts to convert malware directly into images had run into a problem, though. Simply changing the size of an image converted in this way fundamentally changes how the image looks, which makes the identification task much harder. So the UNG team used another approach, borrowed from engineers outside the computer security field. Called recurrence plots (RP), this method leverages the tasks within a given piece of software to create similar images when the tasks are similar, regardless of the size of the images. They used RP to create a series of images of common malware families, and then trained a CNN to identify them.
“PSC and Bridges-AI played a critical and invaluable role in the project. University of North Georgia is a primarily undergraduate institution. [We] do not have access to high performance computing resources needed to train and test the convolutional neural network malware classifier … The powerful NVIDIA V100 GPUs enabled the implementation of the algorithms used in the paper and made the training of our machine-learning models fast.”—Yong Wei, University of North Georgia
CNN proceeds in training and testing steps. In the UNG team’s learning step, they gave the AI a set of images labeled by human experts as coming from either malware or harmless software. The CNN creates layers of interconnected computer processors in which each layer generates a particular characteristic of the malware RP. In the training process, the AI adjusts its parameters and the weights of the connections between the layers. By trial and error, the AI learns the patterns of the malware RP. After the CNN finishes the training, when a new piece of malware’s RP comes in, the AI can identify its class with the speed of light.
Bridges-AI was ideal for the task because of its advanced graphics processing units (GPUs). Originally designed to make video-game images more realistic, GPUs also have the ability to accelerate the training of CNNs dramatically. In particular, they fueled an explosive improvement of AI technology in the early 2010s. The NVIDIA DGX-2 system at the heart of the National Science Foundation-funded Bridges-AI and its pathbreaking Volta V100 GPUs represented a revolutionary new way of combining such elements into a supercomputer specifically designed to make AI faster and more powerful.
“We … transferred the malware into the images and fed those images to the convolutional neural network algorithm, and the algorithm could classify the malware without any expert knowledge. What we are doing here—what distinguishes our approach—is that we don’t know anything about the [malware’s] behaviors other than that the features are recognized by the [algorithm].”—Sara Sartoli, University of North Georgia
The UNG team trained their CNN on RP versus direct-conversion images of nine common types of malware. They then tested it on new members of these families, comparing the RP and direct conversion results. The RP CNN’s average performance was better than that of direct conversion. It accurately identified malware 96.8 percent of the time compared with direct conversion’s 95.7 percent. Even more importantly, its average performance was more consistent than direct conversion’s. These results convinced the scientists that their preliminary approach is well worth developing further. They reported these findings in a peer-reviewed presentation at the International Conference on Machine Learning and Applications in December 2020, which you can read here.
One avenue of future research, possibly with PSC’s new Bridges-2 system, will be to find out whether these common visual signatures match up to the code signatures. They’d like to know exactly how CNN is recognizing the malware. This will be an important step in validating that CNN’s performance is reliable and predictable.