Skip to main content

AI Safety

2026


RSI @ ICLR 2026·

Juan Belieni, Ana Carolina Erthal, Eliezer de Souza da Silva, Diego Mesquita

Machine unlearning enables the removal of specific knowledge from trained models without full retraining. While effective methods exist for single deletion requests, handling sequential requests in large language models (LLMs) remains underexplored. In this setting, we observe that gradient interference between successive unlearning steps degrades prior objectives. We propose ONPO (Orthogonal Negative Preference Optimization), which projects each step’s update onto the orthogonal complement of a low-dimensional subspace spanned by cached gradients from previous unlearning requests. This preserves prior unlearning objectives with minimal per-step overhead. On the TOFU benchmark, ONPO achieves a better trade-off between forgetting quality and model utility than existing methods.

2025


Capstone project developed during the last week of the AI Security Bootcamp, where I did a small-scale replication of the paper “Watermark Stealing Attacks on Large Language Models”, which demonstrates that statistical text watermarking schemes can be extracted and circumvented by low-budget adversaries.

Exploratory analysis of multilingual SAE features

Recent research from Anthropic suggests that Sparse Autoencoder (SAE) features can be multilingual, activating for the same concept across multiple languages. However, if multilingual features are scarce and not as good as monolingual ones, SAEs could have their robustness undermined, leaving them vulnerable to failures and adversarial attacks in languages not well-represented by the model. In this post, I present findings from an exploratory analysis conducted to assess the degree of multilingualism in SAE features.

2024