Eureka

Horwitz, Eliahu; Cavia, Bar; Kahana, Jonathan; Hoshen, Yedid

Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction

Hangxuan Li^2,†, Renjun Jia^3,†, Xuezhang Wu¹, Yunjie Qian¹, Zeqi Zheng¹, Xianling Zhang⁴

¹Alibaba Cloud Computing Co. Ltd, Hangzhou, China,
²School of Computer Science, Fudan University, Shanghai, China,
³School of Computer Science and Technology, Tongji University, Shanghai, China,
⁴Independent Researcher, United States

Abstract

The rapid growth of foundation models (LLMs, VLMs) has surged enterprise cloud AI demand, where GPU resources must be dynamically allocated to meet evolving workloads. Accurate demand prediction is critical for efficient and reliable real-world deployment, yet traditional forecasting systems struggle due to sparse historical data and highly volatile workload behaviors. We present Eureka, an LLM-driven agentic framework that automates feature engineering. Our approach has three main components: a domain knowledge-driven Expert Agent that encodes cloud resource expertise to evaluate feature quality, an Automated Feature Generator that explores new feature spaces, lastly a RL (Reinforcement Learning) Feedback Loop connects the two components and enables continuous learning. Deployed on Alibaba Cloud and evaluated on its production data, Eureka improves demand fulfillment rate by 16%, and reduces computing resource migration rates by 33%. This work introduces a novel intelligent system for cloud resource prediction and AI supply chain management, advancing the efficiency, scalability, and deployability of foundation models in production environments.

About Eureka

Eureka is an agentic feature engineering framework that discovers high-quality, interpretable features—even under data scarcity—across LLM and VLM applications. It consists of three components:

Eureka Expert (Planning Stage) encodes domain knowledge as heuristic constraints and design templates to guide exploration of feature interactions and cross-resource dependencies. It steers toward domain-meaningful candidates, filters spurious correlations, and identifies high-order feature dependencies that reflect system dynamics.
LLM Feature Factory (Execution Stage) translates design plans into concrete features via conditioned code generation. Guided by domain heuristics, it explores the feature space and outputs executable implementations that maximize information signal from limited historical data.
Self-Evolving Alignment Engine (Refinement Stage) improves feature quality using dual-channel reward feedback from real-world deployment outcomes. By linking offline optimization with online performance, it adapts features to irregular workload patterns while preventing overfitting.

Directional Weight Score

Generate high-quality features with Eureka

Sizes of model trees

To show Eureka’s real-world value, consider the production case shown in the figure above: it successfully caught a GPU demand surge that traditional monitoring systems completely missed. Here’s how it worked. At first glance, two raw signals—gpu_anomaly_acceleration_squared and scaled_vcpu_gpu_ratio—looked weak and uninformative on their own, so a standard pipeline would’ve thrown them out. But Eureka’s domain expert module saw something deeper. It knows that large AI training jobs don’t just start out of nowhere—they’re usually preceded by a “prep phase,” where customers quietly spin up vCPUs, set up data pipelines, and provision storage before firing up GPUs. That means a temporarily high vCPU-to-GPU ratio isn’t a bug—it’s actually a sign of preparation!
Eureka reinterpreted this as a “preparatory imbalance”—a subtle but reliable early warning. Guided by this insight, its LLM-powered feature factory generated new candidates, and after validation by the self-evolving engine, it zeroed in on a smart new feature: scaled_vcpu_gpu_volatility_ratio. This feature essentially divides the prep signal (high vCPU/GPU ratio) by a near-zero GPU acceleration value (meaning GPUs are still idle), which amplifies the “calm before the storm.” In practice, this new metric spikes days before actual GPU demand peaks—while all conventional metrics stay flat—giving operators enough lead time to act. Thanks to this, Eureka didn’t just predict the burst; it explained why it was coming.

Experimental Results

To evaluate Eureka’s effectiveness, we test it on multiple public classification datasets as well as an internal enterprise dataset from Alibaba Cloud. The results are summarized below.

Performance Comparison Across Different Methods and Datasets

Method	Adult	Bank	Blood	Credit	Diabetes	Heart	Myocardial	EGS
DFS	0.915	0.897	0.637	0.707	0.882	0.921	0.654	0.680
AutoFe	0.872	0.865	0.735	0.676	0.837	0.903	0.674	0.666
OpenFe	0.920	0.923	0.626	0.701	0.817	0.897	0.697	0.658
TabPFN	0.900	0.904	0.715	0.787	0.888	0.938	0.676	0.694
CAAFE	0.901	0.905	0.713	0.797	0.886	0.941	0.686	0.692
FeatLLM	0.894	0.851	0.678	0.743	0.811	0.881	0.663	0.675
Eureka-Qwen	0.926	0.932	0.738	0.794	0.873	0.926	0.737	0.679
Eureka-32B	0.926	0.932	0.716	0.824	0.884	0.936	0.711	0.688
Eureka-gpt4o	0.928	0.934	0.745	0.838	0.881	0.938	0.699	0.699
Eureka-grok2	0.928	0.934	0.735	0.815	0.886	0.940	0.719	0.686

The results reveal several key insights. The proposed Eureka framework consistently achieves strong and superior performance across multiple datasets, particularly excelling on feature-rich data such as Myocardial. This advantage stems from Eureka’s ability to semantically interpret original features and generate domain-informed feature combinations, which is especially beneficial in complex, high-dimensional settings. Eureka-Qwen, Eureka-grok2, and Eureka-gpt4o utilize corresponding proprietary APIs, while Eureka-32B is a fine-tuned open-source variant based on Qwen3-32B. Among them, Eureka-gpt4o attains the highest accuracy on four datasets (Adult, Bank, Blood, and Credit-g) and remains competitive elsewhere. Eureka-Qwen leads on Myocardial (0.737), highlighting its strength in handling feature-rich data, while Eureka-grok2 performs best on Heart and shows strong results on Adult and Bank. Overall, the consistent superiority of Eureka demonstrates the framework’s effectiveness and generalization ability in automated feature engineering.

BibTeX

@inproceedings{
li2025eureka,
title={Eureka: Intelligent Feature Engineering for Enterprise {AI} Cloud Resource Demand Prediction},
author={Hangxuan Li and Renjun Jia and Xuezhang Wu and zeqi zheng and Yunjie Qian and Xianling Zhang},
booktitle={1st Workshop on VLM4RWD @ NeurIPS 2025},
year={2025},
url={https://openreview.net/forum?id=zeTzKZezXY}
}

More Works from Our Lab

We Should Chart an Atlas of All the World's Models

Recovering the Pre-Fine-Tuning Weights of Generative Models

Unsupervised Model Tree Heritage Recovery

Deep Linear Probe Generators for Weight Space Learning

Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights