Researchers Expose Security Gaps in AI Code Sandboxes and Demonstrate LLM-Powered Scientific Code Migration

Two new studies highlight both the risks and the transformative potential of deploying AI in computational infrastructure

edit

By LineZotpaper

Published10 June 2026

Read Time3 min

Sources2 outlets

A pair of studies published this week on arXiv examine how artificial intelligence intersects with foundational computing infrastructure — one auditing the security of the sandboxed environments used to run AI-generated code, the other demonstrating how large language models can automate the painstaking migration of decades-old scientific software into modern, high-performance frameworks.

AI Sandbox Security: A Mixed Picture

A new comparative security study by George Andronchik and Pavel Lokhmakov has put five commercial AI code sandbox products under the microscope, measuring them across six engine-level dimensions including host attack surface, information leakage potential, vulnerability history, and the pace at which security patches are applied.

The study, titled AI Code Sandboxes: A Comparative Security Study, draws a sharp distinction between the architectural classes of sandboxing technology — microVMs, userspace kernels, and OCI containers — and the specific products built on top of them. While the three engine classes behave predictably and distinctly from one another on every measured axis, products within the same class vary considerably, meaning the choice of underlying technology is not alone sufficient to guarantee security.

One of the study's most pointed findings concerns patch policy. Although vendors that participate in coordinated vulnerability disclosure programmes typically release fixes within days of a CVE being published, downstream customers can face delays ranging from zero days to more than 471 days — or, in some cases, indefinite opacity about whether patches have been applied at all. The authors describe product pin policy as "the dominant operator-facing variable" in real-world security posture.

Perhaps most strikingly, the study finds that the most robust theoretical combination — a microVM engine paired with a continuous public fuzzing programme — is entirely unoccupied among the products examined. Several products sit at a concerning intersection: zero published CVEs, no upstream fuzzer, and no independent academic study, a combination the authors describe as "structurally unmeasured" rather than reassuringly clean.

The researchers stop short of ranking the products overall, instead providing per-product security portraits and a threat-model qualification matrix. A companion code repository has been released under an Apache 2.0 licence.

LLMs Automate Legacy Scientific Code Migration

Separately, researchers at Columbia University have published results showing that a large language model pipeline can successfully translate a 19,000-line Fortran land surface model into JAX, Google's high-performance differentiable computing library — a feat that would typically require months of expert engineering effort.

The team, led by Aya Lahlou, Linnia Hawkins, and Pierre Gentine, developed a five-phase agentic pipeline that first analyses the dependency structure of the legacy codebase, then iteratively compiles and repairs translation errors, and finally validates numerical accuracy against the original Fortran code before accepting any module into the translated version.

The payoff is substantial. The translated model — based on CLM-ml-v2, a widely used climate modelling component — can compute a full Jacobian in a single backward pass, recovers physical parameters in eight times fewer optimisation steps than gradient-free methods, and runs 24 times faster than the original Fortran when processing ensembles of 2,048 simulations.

The researchers are releasing both the translated model and the pipeline infrastructure as an open framework, with the explicit aim of enabling other Earth system modelling teams to modernise their own legacy codebases.

Together, the two papers reflect a maturing conversation in the AI research community about infrastructure: not just what AI can do, but how safely and reliably it can be deployed and integrated into existing systems.

Analysis

Why This Matters

Security implications for AI deployment: As organisations increasingly rely on sandboxed environments to safely execute AI-generated code, the audit's findings about patch delays of up to 471-plus days suggest that many deployments may carry unacknowledged risk — particularly in enterprise and government settings.
Scientific computing transformation: The successful automated translation of a major climate model from Fortran to JAX signals that decades of legacy scientific code — in climate science, physics, and engineering — could be modernised at scale, potentially accelerating research across disciplines.
Infrastructure gap: Both papers, read together, highlight that the foundational layer beneath AI applications — sandboxes, legacy codebases, deployment pipelines — has received far less scrutiny than AI models themselves.

Background

AI code execution environments have grown rapidly in importance as tools like ChatGPT's Code Interpreter and competing platforms allow users to run arbitrary code generated by language models. These sandboxes must prevent malicious or erroneous code from escaping a contained environment and accessing the host system or sensitive data — a problem that draws on decades of operating systems and virtualisation research.

The security community has long debated the relative merits of different isolation technologies. MicroVMs, popularised by Amazon's Firecracker (used in AWS Lambda), offer strong kernel-level isolation. Userspace kernels like gVisor intercept system calls before they reach the host kernel. OCI containers, the technology underpinning Docker and Kubernetes, offer lighter isolation at potentially greater risk. Comparative empirical studies of how commercial AI products implement these technologies have been scarce.

On the scientific computing side, a vast quantity of foundational simulation code — particularly in climate science, computational fluid dynamics, and physics — was written in Fortran during the 1970s through 1990s and has never been fully modernised. Differentiable programming frameworks like JAX and PyTorch now offer capabilities, such as automatic differentiation and GPU acceleration, that could transform how these models are calibrated and used, but manual translation has historically been prohibitively expensive.

Key Perspectives

Security researchers (Andronchik & Lokhmakov): The authors argue that no single security metric is sufficient for evaluating AI sandboxes, and that the industry's current fuzzing and patching practices leave meaningful blind spots. They deliberately avoid producing an overall ranking, emphasising that threat-model fit matters more than abstract scores.

Scientific computing community (Lahlou, Hawkins & Gentine): The Columbia team positions LLM-driven migration not as a replacement for domain expertise, but as an automation layer that handles mechanical translation while human researchers validate numerical correctness. They frame open-sourcing the pipeline as a way to democratise access to differentiable modelling for under-resourced research groups.

Critics/Skeptics: Security critics may question whether the sandbox audit's six axes capture all relevant threat vectors, particularly side-channel attacks and supply-chain risks not covered by CVE databases. On the code migration side, sceptics will note that the pipeline was validated on a single model; generalisation to other codebases — particularly those with less regular structure or poorer documentation — remains to be demonstrated.

What to Watch

Patch cadence disclosures: Whether AI platform vendors respond to the audit by publishing more transparent patch timelines and pin policies for their sandbox infrastructure.
Part 2 of the sandbox study: The authors indicate a second paper is forthcoming; it is likely to examine workload-level and operator-level security properties not covered in Part 1.
Adoption of the JAX migration framework: Whether Earth system modelling teams — including those working on NOAA, ECMWF, or NASA climate models — begin using or contributing to the open-source pipeline as a path to GPU-accelerated differentiable modelling.

Sources

AI Code Sandboxes: A Comparative Security Study. Part 1 of 2 -- Engine-Level Properties (Attack Surface, Leakage, Stackability, CVE History, Patch Cadence, Fuzzing) — cs.AI updates on arXiv.org
Systematic LLM Translation of Legacy Scientific Code to Differentiable Frameworks: Application to a Land Surface Model — cs.AI updates on arXiv.org