Skip to main content

AI Safety

2026


Apart Research AI Control Hackathon·

Ana Carolina Erthal, Ria Deane, Juan Belieni, Gustavo Ewbank Rodrigues Danon

Project developed during Apart Research’s AI Control Hackathon. We studied whether AI control monitors remain reliable when attackers exploit unfamiliar vulnerabilities. Using the ControlArena Bash setting, we augmented a Docker environment with synthetic vulnerabilities and evaluated monitor performance under fully informed, partially informed, and uninformed knowledge conditions. We found that partial knowledge provides little improvement over complete ignorance, while full information substantially reduces attack success. This suggests that monitors may rely on explicit prior knowledge of attack surfaces rather than robust general reasoning about suspicious behavior.

RSI @ ICLR 2026·

Juan Belieni, Ana Carolina Erthal, Eliezer de Souza da Silva, Diego Mesquita

Machine unlearning enables the removal of specific knowledge from trained models without full retraining. While effective methods exist for single deletion requests, handling sequential requests in large language models (LLMs) remains underexplored. In this setting, we observe that gradient interference between successive unlearning steps degrades prior objectives. We propose ONPO (Orthogonal Negative Preference Optimization), which projects each step’s update onto the orthogonal complement of a low-dimensional subspace spanned by cached gradients from previous unlearning requests. This preserves prior unlearning objectives with minimal per-step overhead. On the TOFU benchmark, ONPO achieves a better trade-off between forgetting quality and model utility than existing methods.

2025


Capstone project developed during the last week of the AI Security Bootcamp, where I did a small-scale replication of the paper “Watermark Stealing Attacks on Large Language Models”, which demonstrates that statistical text watermarking schemes can be extracted and circumvented by low-budget adversaries.

ML4Good Colombia 2025

Bogotá, Colombia·

ML4Good is a bootcamp focused on AI Safety upskilling, including workshops on interpretability, alignment and governance of artificial intelligence. In this edition, I participated as a teaching assistant.

Exploratory analysis of multilingual SAE features

Recent research from Anthropic suggests that Sparse Autoencoder (SAE) features can be multilingual, activating for the same concept across multiple languages. However, if multilingual features are scarce and not as good as monolingual ones, SAEs could have their robustness undermined, leaving them vulnerable to failures and adversarial attacks in languages not well-represented by the model. In this post, I present findings from an exploratory analysis conducted to assess the degree of multilingualism in SAE features.

2024