Detecting new crypto-mining attack targeting Kubeflow and TensorFlow
Published 07/23/2021
This blog was originally published by Sysdig here
Written by Stefano Chierici, Security Researcher, Sysdig
Microsoft has discovered a new large-scale attack targeting Kubeflow instances to deploy malicious TensorFlow pods, using them to mine Monero cryptocurrency in Kubernetes cluster environments.
Kubeflow is a popular open-source framework often used for running machine learning tasks in Kubernetes. TensorFlow, on the other hand, is an open-source machine learning platform used for implementing machine learning in a Kubernetes environment.
Machine learning workloads require huge amounts of processing power, and they often have access to GPUs. As those are the same requirements for a crypto mining rig, clusters running Kubeflow and TensorFlow are the perfect targets for these attacks.
One would think that with the value of crypto mining crashing, crypto mining attacks should be less frequent. However, not only do we keep seeing these attacks, but we are also discovering new variants specifically targeting machine learning workloads.
Since the attack is really recent and perhaps still ongoing, any new Kubernetes clusters that run Kubeflow could already be compromised.
Affected users may notice degraded performance and an increase on their infrastructure bills due to the intensive crypto mining process.
In this article, we'll introduce you to Kubeflow and TensorFlow, track how this attack works, and cover steps to detect if you are affected, and mitigate its effects.What are Kubeflow and TensorFlow?
Kubeflow is an end-to-end Machine Learning (ML) project that provides infrastructure components for each stage in the ML lifecycle. It is built on top of Kubernetes to reuse k8s core components, adapting them for specific machine learning use cases.
Kubeflow provides different components to support the machine learning deployment in a Kubernetes cluster. In particular, the component Kubeflow Pipelines is used to create a complete machine learning workflow.
Demystifying Kubeflow pipelines - Ubuntu Blog
A pipeline is a series of steps, each one is an independent container. Each step is a piece of code packaged as an image which contains inputs and outputs and performs a specific machine learning task. All together they form the ML workflow.
Using Kubeflow Pipelines, it’s possible to deploy TensorFlow models in Kubernetes environments.
Tensorflow (TF) is a general purpose graph-based computation engine. TF offers an abstraction for building and training machine learning models. Furthermore, TF provides a full production ML pipeline to simplify the model release and the framework also supports GPU hardware acceleration.
Now that we have a brief overview of Kubeflow and TensorFlow, we can move on to the attack scenario and the security issue.
The attack scenario
Microsoft observed a consistent increase of TensorFlow pods deployed on various Kubernetes clusters. Different from what sometimes happens in similar attacks, the pods used are the legitimate TensorFlow images from the official Docker Hub account.
The two different TensorFlow images used were:
- tensorflow/tensorflow:latest: Latest version of TensorFlow.
- tensorflow/tensorflow:latest-gpu: Latest version of TensorFlow with GPU support.
In the attack scenario reported by Microsoft, the attackers were able to:
- Get access to the Kubeflow UI dashboard.
- Create a new deployment pipeline in the infrastructure.
- With a new pipeline, they proceeded with the Tensorflow images deployment and the execution of additional malicious code.
The malicious code downloaded and executed the famous Monero miner in the Kubernetes cluster, and started mining.
The issue
The Kubeflow dashboard is exposed by the Istio ingress gateway and is accessible by default internally only. Using the port-forward mechanism, users are able to access the dashboard and tunnel the traffic through the Kubernetes API server.
Users prefer a more flexible and immediate way to access the service without any hassles. In this case, by setting the Istio Service to Load-Balancer, the service is going to be exposed on the Internet, allowing users to access the UI directly. This flexibility and easy-to-use configuration might open a big security issue since the dashboard is directly exposed on the Internet without any restrictions.
Combining this configuration with weak credentials may end in a compromission of the entire environment.
This is a quite common pattern where attackers use legitimate images for running their malicious code. In this case, if the company is using TensorFlow images in the cluster for machine learning purposes, the new pipeline might look totally legitimate.
In the attack reported, adversaries use totally legitimate images even though security mechanisms are in place, so the malicious deployment won’t be detected at all. In addition, the TensorFlow images are optimized to run machine learning tasks efficiently, which require a huge amount of computational resources.
This is exactly what attackers are looking for. The chance to compromise this kind of infrastructure and image is absolutely priceless for adversaries, allowing the attacker to maximize the mining gains from the host.
Mitigating the attack
For this specific scenario, the attack was possible due to the change in the configuration to publicly expose the Kuberflow centralized dashboard on the internet, which is a huge security issue.
To mitigate this issue:
- All the management dashboards and consoles should be kept internal only and reachable using port forwarding or another mechanism.
- In case that the service must remain public, strong authentication mechanisms should be in place, using complex passwords or MFA mechanisms.
Detecting the attack
What has been reported by Microsoft emphasizes the trend we have seen in the recent years, where mine cryptocurrency is still one of the main attacks performed by attackers. Blue teams need to have efficient and powerful detection mechanisms to identify these kinds of compromissions and bad behaviors as soon as possible.
Luckily, the detection of this specific attack scenario can be done in different ways and with different tools. Since crypto miners follow very distinctive patterns, we can use their behavior to create a strong detection.
One way is using infrastructure monitoring tools, scanning the pod resources used, and alerting on high CPU or GPU usage.
The other option is the “security” way, by detecting suspicious connections or malicious binary executions inside the pod.
Let’s dive into both approaches.
Runtime detection
A runtime threat detection tool for containers and Kubernetes can be used to detect behaviour closely related to crypto-miners. Most of these tools will provide out-of-the-box rules for this scenario.
Make sure that you check for the following:
- DNS requests sent to resolve miner pool domains.
- Crypto miners using the Stratum protocol. You detect related urls by looking for processes containing stratum+tcp in their command line arguments.
- Container drift, caused by chmod being used to create new executables.
- Outbound Connection to C2 Servers.
Resource monitoring
Since the main adversaries goal is to run a crypto miner in your infrastructure and use your resources for cryptocurrency, it’s possible to use any monitoring tool to check and get alerted in case something is using a high percentage of cpu inside your infrastructure.
To discover a significant increase of pods, we can compare the current existing pods per namespace versus the ones existing 1 day ago at the same hour of the day. This is a metric that can present noise, especially if kubeflow is creating and destroying pods. To remove the noise, we can average over the last hour.
We could query such data with PromQL:
100 * avg_over_time(sum by (namespace) (kube_pod_info)[1h:1h]) / avg_over_time(sum by (namespace) (kube_pod_info)[1h:1h] offset 5m)
The result of this query will show the percentage of additional pods in different namespaces, on average per hour, that are running currently, compared with the number of pods running yesterday at the same hour.
Image of an increase of 140% of pods in two of the namespaces of the monitored cluster
You can set an alert with this PromQL query, setting a threshold for example of 125. This will mean that if there are 25% more pods than yesterday, you'll receive an alert.
100 * avg_over_time(sum by (namespace) (kube_pod_info)[1h:1h]) / avg_over_time(sum by (namespace) (kube_pod_info)[1h:1h] offset 5m) > 125
We can also do something similar with the amount of CPU used by namespace. The query would be:
100 * avg_over_time(sum by (namespace)(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m]))[1h:1h]) / avg_over_time(sum by (namespace)(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m]))[1h:1h] offset 1d)
An increase of 200% and 300% of CPU in two of the namespaces of the monitored cluster
In a similar way, we can set alerts to detect that the CPU per namespace is 50% over the CPU used yesterday at the same hour of the day:
100 * avg_over_time(sum by (namespace)(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m]))[1h:1h]) / avg_over_time(sum by (namespace)(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m]))[1h:1h] offset 1d) > 150
Conclusion
Despite the huge drop in value of cryptocurrency, your infrastructure is still free money for attackers. It’s even more so if your cluster is powerful enough to run machine learning workloads.
Security best practices, like keeping your management tools private or using strong authentication mechanisms, will help you mitigate such attacks.
Fortunately, these attacks are easy to detect. Runtime security tools can detect crypto mining processes and connections to mining pools, and your monitoring solution can help you detect a rise in resource usage.
Related Articles:
A Vulnerability Management Crisis: The Issues with CVE
Published: 11/21/2024
Democracy at Risk: How AI is Used to Manipulate Election Campaigns
Published: 10/28/2024
File-Sharing Fraud: Data Reveals 350% Increase in Hard-to-Detect Phishing Trend
Published: 10/21/2024