OthelloSAE

CS194-196 Course Project

1 Dependencies

Make sure to install neel-plotly. Clone from https://github.com/neelnanda-io/neel-plotly.git, then cd neel-plotly && pip install -e .

SAE-related files: Download datasets and trained models from https://drive.google.com/drive/folders/1xMkEctaqAUjoPXGY-9dBu-pE3SJjKx2K, place it in /(root folder).

Download trimmed trained linear probe weights at https://drive.google.com/drive/folders/1hYbOP4tzHeRmnxmu2rTO6v-qNaGMsO5Q?usp=sharing, place it in /probes.

Download self-trained OthelloGPT weights at https://drive.google.com/drive/folders/1LBu8BivQX1fO2yEV1OZdpNwTEoTUJLlm?usp=sharing, place it in /(root folder).

2 Training

We adopted a lot of code from this piece of research blog: https://www.lesswrong.com/posts/BduCMgmjJnCtc7jKc/research-report-sparse-autoencoders-find-only-9-180-board

2.1 Train OthelloGPT

We adopted OthelloGPT training code from existing code base. You can find fucntion full_scale_training in model_training.py. Change num_layers value in the code to train original OthelloGPT(8 layer) or bigger OthelloGPT in our experiment(12 layer).

2.2 Train Linear Probes

We adopted and modified linear probe training code from existing code base. You can find function full_probe_run in model_training.py. We trimmed down the size of the linear probe by removing the OthelloGPT weights inside the original implementation.

3 Experiments

3.1 Linear probe related experiments

Every block of code in experiment.ipynb have explanations about the usage of the code. See the ipynb file for more details.

3.2 Cosine Similarity

Every block of code in cosine_similarity.ipynb have explanations about the usage of the code. See the ipynb file for more details.

3.3 SAE stability

You can find all functions related to SAE stability in stability.py.

Use all_activations and all_boards from previously ran save_activations_boards_and_legal_moves in analysis.py.

compute_stability_maps: Get stability maps for stability classifications using all_boards.

evaluate_all_stability_classification: You can get AUROCs of a specific layer and seed from the activations and the stability maps.

compare_top_features_stability: Get visualizations of top features predicting stability frequency with a specific layer, seed, and AUROC threshold as inputs.

Our usage: First run save_activations_boards_and_legal_moves to get activations for your specific layer if you have not done so already. Pass all_boards into compute_stability_maps to get stability_maps. Run evaluate_all_stability_classification if you’re doing one seed by passing in activations, stability_maps, layer, seed. Finally run compare_top_features_stability with a specific layer, seed, and AUROC threshold to get a visualization.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
data		data
mechanistic_interpretability		mechanistic_interpretability
mingpt		mingpt
togglable		togglable
utils		utils
.gitignore		.gitignore
README.md		README.md
analysis.py		analysis.py
autoencoder.py		autoencoder.py
cosine_similarity.ipynb		cosine_similarity.ipynb
experiment.ipynb		experiment.ipynb
linear_probes.py		linear_probes.py
model_training.py		model_training.py
othello_gpt.py		othello_gpt.py
requirements.txt		requirements.txt
stability.py		stability.py
train.py		train.py
train_probe_othello.py		train_probe_othello.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OthelloSAE

1 Dependencies

2 Training

2.1 Train OthelloGPT

2.2 Train Linear Probes

3 Experiments

3.1 Linear probe related experiments

3.2 Cosine Similarity

3.3 SAE stability

About

Releases

Packages

Contributors 3

Languages

ALT-JS/OthelloSAE

Folders and files

Latest commit

History

Repository files navigation

OthelloSAE

1 Dependencies

2 Training

2.1 Train OthelloGPT

2.2 Train Linear Probes

3 Experiments

3.1 Linear probe related experiments

3.2 Cosine Similarity

3.3 SAE stability

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages