A Scalable, Ensemble Approach for Building and Visualizing Deep Code-Sharing Networks

Показать описание

By Josh Saxe

"The millions of unique malicious binaries gathered in today's white-hat malware repositories are connected through a dense web of hidden code-sharing relationships. If we could recover this shared-code network, we could provide much needed context for and insight into newly observed malware. For example, our analysis could leverage previous reverse engineering work performed on a new malware sample's older ""relatives,"" giving important context and accelerating the reverse engineering process.

Various approaches have been proposed to see through malware packing and obfuscation to identify code sharing. A significant limitation of these existing approaches, however, is that they are either scalable but easily defeated or that they are complex but do not scale to millions of malware samples. A final issue is that even the more complex approaches described in the research literature tend to only exploit one ""feature domain,"" be it malware instruction sequences, call graph structure, application binary interface metadata, or dynamic API call traces, leaving these methods open to defeat by intelligent adversaries.

How, then, do we assess malware similarity and ""newness"" in a way that both scales to millions of samples and is resilient to the zoo of obfuscation techniques that malware authors employ? In this talk, I propose an answer: an obfuscation-resilient ensemble similarity analysis approach that addresses polymorphism, packing, and obfuscation by estimating code-sharing in multiple static and dynamic technical domains at once, such that it is very difficult for a malware author to defeat all of the estimation functions simultaneously. To make this algorithm scale, we use an approximate feature counting technique and a feature-hashing trick drawn from the machine-learning domain, allowing for the fast feature extraction and fast retrieval of sample ""near neighbors"" even when handling millions of binaries.

Our algorithm was developed over the course of three years and has been evaluated both internally and by an independent test team at MIT Lincoln Laboratories: we scored the highest on these tests against four competing malware cluster recognition techniques and we believe this was because of our unique ""ensemble"" approach. In the presentation, I will give details on how to implement the algorithm and will go over these algorithm results in a series of large-scale interactive malware visualizations. As part of the algorithm description I will walk through a Python machine learning library that we will be releasing in the conference material which allows users to detect feature frequencies over billions of items on commodity hardware."

Рекомендации по теме

A Scalable, Ensemble Approach for Building and Visualizing Deep Code-Sharing Networks

A Scalable, Ensemble Approach for Building and Visualizing Deep Code-Sharing Networks

A Scalable, Ensemble Approach for Building and Visualizing Deep Code Sharing Networks

Black Hat USA 2014 - Reverse Engineering: A Scalable, Ensemble Approach for Building and Visualizing

An Ensemble Approach to Optimization with George Corugedo

Multifaceted Approach for Anticipating Learner Performance Using Weight-age and Ensemble Alg Fusion

Using a Stacking Model Ensemble Approach to Predict Rare Events | SciPy 2019 | Susan Yuhou Xia

Custom Ensemble Approach To Solve Machine Learning Problems

DDD: A New Ensemble Approach for Dealing with Concept Drift

OpenAI CLIP: ConnectingText and Images (Paper Explained)

Closing the loop: A scalable ensemble reinforcement learning algorithm for the Internet of Things

Bagging vs Boosting - Ensemble Learning In Machine Learning Explained

More Power to the Many: Scalable Ensemble-based Simulations and Data Analysis -- Shantenu Jha

Introduction to Ensemble Learning with Real Life Examples | Machine⚙️ Learning

Tutorial 42 - Ensemble: What is Bagging (Bootstrap Aggregation)?

Understanding weight matrices (Deep Ensemble vs Batch Ensemble vs Rank-1 BNN)- part 1

DCASE Workshop 2021, ID 40 - An Ensemble Approach to Anomalous Sound Detection Based on Conformer...

Dr. Erin LeDell - Multi-algorithm Ensemble Learning at Scale: Software... - MLconf SEA 2016

BAGGING vs. BOOSTING vs STACKING in Ensemble Learning | Machine Learning

Proactive Auto Scaling Approach of Production Applications Using an Ensemble Model

PNP: Fast Path Ensemble Method for Movie Design

Proactive Auto Scaling Approach of Production Applications Using an Ensemble Model

Proactive Auto Scaling Approach of Production Applications Using an Ensemble Model

Training models with an ensemble of experimental designs to account for model discrepancy

Frustratingly Easy Model Ensemble for Abstractive Summarization (Research Paper Walkthrough)