Malware and Machine Learning - Computerphile

preview_player
Показать описание
Do anti virus programs use machine learning? Dr Fabio Pierazzi looks at the trends and challenges.




This video was filmed and edited by Sean Riley.


Рекомендации по теме
Комментарии
Автор

I've spent many years in the industry and the biggest hurdle I've seen to having more dynamic identification is false positives. More specifically stopping users from their day-to-day activities because it has been determined to be malicious. Users are MUCH more forgiving of false negatives (actual infections) than false positives.

ClifBratcher
Автор

I did machine learning for ransomware detection as part of my thesis, problem I had was trying to obtain data for the newest variants. The model needed consistent training to keep up with the new malware.

KamiKze
Автор

In my experience, the biggest hurdle I faced while using ML for malware detection or behavior detection was choosing and extracting the features. Often the selected features overlap between malicious and benign software (eg. sequence of APIs). Unlike static and dynamic detection which works on heuristics written by an experienced analyst, ML models learn these heuristics on their own during training. And most of the time these heuristics learned by the ML model do not actually make sense. At the end of the day ML models work on pattern detection. It is really difficult to make the model learn actual features that are responsible for behavior rather than some random reoccurring features in the dataset. As a result, we end up getting high FP.

saasthavasan
Автор

I think there might be a mistake in the diagram at 17:10. The red slice should be test data and the remaining slices should be used for training.
In any case, great video once again.

PHF
Автор

Can we talk about those flawless freehand bell curves?!

samcooke
Автор

I would argue Machine Learning is already very prevalent in industry. As someone who has worked in Malware detection for both Microsoft and Amazon, we leverage large tree models and even large language models for detection.

kenbobcorn
Автор

I could talk to that guy over a pint for like three hours. He's oversimplifying here for a general viewer but this topic is fascinating. Thanks for the video.

trymoto
Автор

There is another way to also think about this issue, but it is one that is not talked about as much and that is separation of data systems and data itself from public and private data. because of the increase in online usability and transparency much of the data is exposed to all these forms of attack, also the monetisation of data & proprietary IP creates a reason to profit from it on both sides of the data fence. if you cannot access it directly, it is less likely to be stolen, if the stored information is not valuable, it becomes pointless to steal it. If the identity requirements are removed/reduced, the identity is less value. everything is a trade off. pattern matching machine algorithms (ML & AI) is limited by the algorithms parameters.

GaryParris
Автор

I'm on a team that releases a free open source app. For a while, every time we released a new version we would get a handful of false positive reports from users whose virus scanners tripped on it. Seems like some of the companies just give up and flag everything that isn't in their whitelist when faced with an essentially unsolvable task.

HebaruSan
Автор

Prevalence data and diversity of behaviour are two important crieteria. It's difficult to mount an adversarial attack on models that are behaviour dependent. These modern ML approaches to cyber security use static and dynamic behaviour encoding to stop malware. Cylance ML models are an example of it.

shiladityasircar
Автор

Using ML to group different types of malicious applications into different families makes the process of malware detection more adaptive, yet we are still getting zero days where a malicious application succeeds by appearing benign.

In the medical sciences, there have been many problems, discovered later, where the features used by ML did not accurately predict on new data. This is because researchers let the ML program determine its own features, and the ML program lacked domain expertise. This has resulted in many new companies heavily investing in PhD researchers to prepare the data and relevant features to then run in the model.

In cybersecurity, we will still need the human element for similar reasons.

christersmith
Автор

I assumed this was going to be about malware that uses machine learning. Terrifying.

GenaTrius
Автор

Actually, there is ways to safely implement this. Using it as a trigger value and not the decision engine. Drillning down into the actual detection tree - there's that many different ways of compromise but can be handled, and they are still limited, in short keeping track of execution, persistence and escalation is first step with this as a possible helper.

"EDR/XDR" can be quite sufficient in spanning into a larger chain of "observant" behaviour, ie, the detection engine itself does not have to utilize it, but acting and piecing data together does have elevation from this field.

I do however agree that taking on the whole chain of compromise things gets really tricky.

Static and/or dynamic binary analysis is such a small portion in the whole part of the indicator chain, but training something to the actual portions, be it a buffer overflow etc etc, it can be used in my opinion.

ewookiis
Автор

It doesn't help that a lot of false positives are generated by detectors actively equating software piracy with malware. In many cases the techniques are similar, so the issue cannot entirely be dismissed, but even when the techniques are exclusive to piracy, detectors often have a high motivating factor to keep identifying piracy techniques as false positives for "malware", particularly those companies which write both detectors and high-profile commercial software such as Microsoft itself, or who are incentivized by them.

delusionnnnn
Автор

I feel like many areas of modern ML, including this one, either do or could benefit greatly from continual learning (which, from my understanding, is synonymous with iterative online learning; if they're different, I'd appreciate an explanation of how!). Now, if only we could make that practically efficient on the massive networks of hundreds of billions of parameters or more 😁

IceMetalPunk
Автор

MLearns evaluates Malware as an Adversarial code execution that's malicious.identity That's detection relies on behavior that is itself a signature representation unique for recognizing it has been deployed. How is a behavior signature not like a fingerprint?

CodingTrades
Автор

So machine learning models, such as classifiers. Require a labeld dataset for supervised trained.
So there is datasets of malware? Maybe like vx underground vault?

Veptis
Автор

Its not quite over-fitting, it's just trained for different threats. The problem is that the patterns would change, as if a panda suddenly didn't mean panda but dog, and the ML system cannot adapt to that.

Maybe a more fitting imagery would be if you had a few images of pandas in your training data, and the ML system would recognize them as pandas very well, but now the context changed and dogs are now also pandas. So it should recognize dogs as pandas but it doesn't, as it has either been trained to recognize dogs as dogs, or not trained on them at all, and the image look so different that it has no way of linking the dog to the panda.

celivalg
Автор

It seems to me that the hunt for bells, whistles and bling in applications leads to an enhanced attack surface which allows malware.
I wrote a secure interface (a long time ago), it was doable because the range of API calls I had to intercept was very limited and I could parse all possible legit parameters and reject the rest. The code was documented and could be checked by my peers.
Move to a GUI based environment with more levels of abstraction and the operating system being invoked the whole time for sound or video or malice - no chance.
Security starts from the operating system (disclaimer - Windows user - I do hope the antivirus people know their stuff).

andrewharrison
Автор

Heard 20 seconds of the video, and… yes, he’s Italian as me.
Stepping aside from this inside joke, great content!

FrancescoBazzani