Skip to main content

AI Safety

2025


Exploratory Analysis of Multilingual SAE Features

·2245 words
Recent research from Anthropic suggests that Sparse Autoencoder (SAE) features can be multilingual, activating for the same concept across multiple languages. However, if multilingual features are scarce and not as good as monolingual ones, SAEs could have their robustness undermined, leaving them vulnerable to failures and adversarial attacks in languages not well-represented by the model. In this post, I present findings from an exploratory analysis conducted to assess the degree of multilingualism in SAE features.

2024


Condor Camp

·57 words

Condor Camp was an amazing event on AI safety that happened in Mexico City. There, I learned and discussed topics related to AI governance and technical AI safety. I was also introduced to the effective altruism philosophy. It was probably the best experience regarding career planning as well.