AI Safety

2026

Workshop on Securing AI Research in Latin America

Santiago, Chile·2026-05-12

The Workshop on Securing AI Research in Latin America was a 3-day workshop held in Santiago, Chile, from May 12 to 14, 2026, and organized by Deloitte in partnership with the U.S. Department of State.

Monitors are Fragile under Information Asymmetry ↗ ↖

Apart Research AI Control Hackathon·2026-03-23

AI Safety Ai Control

Ana Carolina Erthal, Ria Deane, Juan Belieni, Gustavo Ewbank Rodrigues Danon

Project developed during Apart Research’s AI Control Hackathon. We studied whether AI control monitors remain reliable when attackers exploit unfamiliar vulnerabilities. Using the ControlArena Bash setting, we augmented a Docker environment with synthetic vulnerabilities and evaluated monitor performance under fully informed, partially informed, and uninformed knowledge conditions. We found that partial knowledge provides little improvement over complete ignorance, while full information substantially reduces attack success. This suggests that monitors may rely on explicit prior knowledge of attack surfaces rather than robust general reasoning about suspicious behavior.

Orthogonal Gradient Projection for Continual LLM Unlearning ↗ ↖

RSI @ ICLR 2026·2026-03-05

Publication Machine Learning AI Safety

Juan Belieni, Ana Carolina Erthal, Eliezer de Souza da Silva, Diego Mesquita

Machine unlearning enables the removal of specific knowledge from trained models without full retraining. While effective methods exist for single deletion requests, handling sequential requests in large language models (LLMs) remains underexplored. In this setting, we observe that gradient interference between successive unlearning steps degrades prior objectives. We propose ONPO (Orthogonal Negative Preference Optimization), which projects each step’s update onto the orthogonal complement of a low-dimensional subspace spanned by cached gradients from previous unlearning requests. This preserves prior unlearning objectives with minimal per-step overhead. On the TOFU benchmark, ONPO achieves a better trade-off between forgetting quality and model utility than existing methods.

2025

Replication of "Watermark Stealing" ↗ ↖

2025-08-24

Replication AI Safety Visualization

Capstone project developed during the last week of the AI Security Bootcamp, where I did a small-scale replication of the paper “Watermark Stealing Attacks on Large Language Models”, which demonstrates that statistical text watermarking schemes can be extracted and circumvented by low-budget adversaries.

AI Security Bootcamp

London, United Kingdom·2025-08-04

AI AI Safety

AISB is a 4-week long intensive program to bring researchers and engineers up to speed on security fundamentals for AI systems. During this program, I developed a replication of the watermark stealing attack.

ML4Good Colombia 2025

Bogotá, Colombia·2025-04-11

AI Safety AI Machine Learning

ML4Good is a bootcamp focused on AI Safety upskilling, including workshops on interpretability, alignment and governance of artificial intelligence. In this edition, I participated as a teaching assistant.

Exploratory analysis of multilingual SAE features

2025-02-01

AI Safety Visualization

Recent research from Anthropic suggests that Sparse Autoencoder (SAE) features can be multilingual, activating for the same concept across multiple languages. However, if multilingual features are scarce and not as good as monolingual ones, SAEs could have their robustness undermined, leaving them vulnerable to failures and adversarial attacks in languages not well-represented by the model. In this post, I present findings from an exploratory analysis conducted to assess the degree of multilingualism in SAE features.

2024

Mechanistic Interpretability Course ↗ ↖

2024-08-23

AI Safety Mechanistic Interpretability

Short course of four meetings given in Portuguese at FGV EMAp, with the aim of introducing the area of Mechanistic Interpretability for Large Language Models (LLMs).

ML4Good Brazil 2024

São Paulo, Brazil·2024-07-08

AI Safety AI Machine Learning

ML4Good is a bootcamp focused on AI Safety upskilling, including workshops on interpretability, alignment and governance of artificial intelligence. In this edition, I developed a Mechanistic Interpretability Course.

Replication of "Towards Automated Circuit Discovery for Mechanistic Interpretability" ↗ ↖

2024-06-21

Replication AI Safety Mechanistic Interpretability

Replication of “Towards Automated Circuit Discovery for Mechanistic Interpretability” paper, by Arthur Conmy et al., part of the process of upskilling in Mechanistic Interpretability by Juan Belieni and Ana Carolina Erthal, funded by Condor Initiative.