AI Safety

2025

ML4Good Colombia 2025

2025-04-11·18 words

ML4Good Colombia 2025 was a bootcamp focused on AI Safety upskilling, where I participated as a teaching assistant.

Exploratory analysis of multilingual SAE features

2025-02-01·2245 words

Recent research from Anthropic suggests that Sparse Autoencoder (SAE) features can be multilingual, activating for the same concept across multiple languages. However, if multilingual features are scarce and not as good as monolingual ones, SAEs could have their robustness undermined, leaving them vulnerable to failures and adversarial attacks in languages not well-represented by the model. In this post, I present findings from an exploratory analysis conducted to assess the degree of multilingualism in SAE features.

2024

Mechanistic Interpretability Course ↗ ↖

2024-08-23

AI Safety Mechanistic Interpretability

Short course of four meetings given in Portuguese at FGV EMAp, with the aim of introducing the area of Mechanistic Interpretability for Large Language Models (LLMs).

ML4Good Brazil 2024

2024-07-08·41 words

AI Safety AI Machine Learning

ML4Good is a bootcamp focused on AI Safety upskilling, including workshops on interpretability, alignment and governance of artificial intelligence.

Replication of "Towards Automated Circuit Discovery for Mechanistic Interpretability" ↗ ↖

2024-06-21

AI Safety Mechanistic Interpretability

Replication of “Towards Automated Circuit Discovery for Mechanistic Interpretability” paper, by Arthur Conmy et al., part of the process of upskilling in Mechanistic Interpretability by Juan Belieni and Ana Carolina Erthal, funded by Condor Initiative.

Condor Camp

2024-02-10·57 words

AI Safety AI Machine Learning

Condor Camp was an amazing event on AI safety that happened in Mexico City. There, I learned and discussed topics related to AI governance and technical AI safety. I was also introduced to the effective altruism philosophy. It was probably the best experience regarding career planning as well.