New Methods Aim to Make AI Planning More Realistic and Reliable
Two research papers published this week on arXiv address a common challenge in applied AI: making adaptive decision-making systems work reliably when real-world conditions are messier than standard assumptions allow.
Smarter Peer-Referral Recruitment for Public Health
A team led by Lingkai Kong and colleagues at institutions including work associated with Milind Tambe — a prominent researcher in AI for social good — has proposed a new planning framework for respondent-driven sampling (RDS), a method public health agencies use to study and intervene among populations affected by infectious diseases, such as intravenous drug users or unhoused individuals.
These "hidden populations" are difficult to reach through traditional surveys. RDS works by asking known participants to recruit peers from their own social networks, spreading outreach organically. However, allocating limited referral resources efficiently across multiple recruitment rounds is a complex planning problem.
Prior approaches simplified the problem by assuming new recruits were drawn randomly from a uniform population — an assumption the researchers argue misrepresents how real peer recruitment works. In practice, people tend to refer others who are similar to themselves, a phenomenon known as homophily.
The team's proposed method, Generative Frontier Planning (GFP), uses learned generative models to better anticipate who future recruits are likely to be, based on the characteristics of current participants. A key algorithmic innovation allows the planner to avoid computationally expensive Monte Carlo simulations by substituting a mathematically structured surrogate value function. The approach also benefits from a "diminishing returns" property that enables a computationally efficient greedy allocation strategy — proven to achieve at least 63% of the theoretical optimum in each round.
Tested on simulations calibrated to a real RDS dataset, GFP outperformed several baselines including reinforcement learning and random allocation approaches.
Noise-Resistant Reinforcement Learning for Product Recommendations
The second paper, from a team at an e-commerce research group including Kewei Xu and colleagues, addresses a problem in applying reinforcement learning (RL) to generative recommendation systems — the AI models that suggest products to shoppers.
RL is appealing for recommendation because it can optimise for user outcomes beyond simple imitation of past behaviour. But it depends heavily on a trustworthy reward signal — typically a "ranker" model trained on historical user interaction data. The researchers found that such rankers, trained on biased historical logs, produce unreliable signals for a significant fraction of training examples.
Their analysis found that RL-based training is genuinely helpful only when two conditions hold simultaneously: the recommendation model is uncertain about the best answer, and the ranker can clearly distinguish correct from incorrect recommendations. When either condition fails, forcing RL-style updates can actively harm performance.
Their proposed framework, AdaGRPO, addresses this by treating RL updates as selective rather than universal. For each training example, the system checks whether both conditions are met; if not, it falls back to standard supervised learning. In large-scale A/B testing on a production e-commerce platform, AdaGRPO delivered statistically significant improvements in click-through rate and user dwell time compared to fixed-blend training approaches.