This repository provides code and scripts to reproduce experiments from the paper:
“Using Shapley Interactions to Understand How Models Use Structure”
Divyansh Singhvi, Diganta Misra, Andrej Erkelens, Raghav Jain, Isabel Papadimitriou, Naomi Saphra
arXiv:2403.13106v2
Modern language and speech models learn rich hidden structures about syntax, semantics, and phonetics.
This repository shows how to use the Shapley-Taylor Interaction Index (STII) to quantify pairwise interactions:
- Text models: How do pairs of tokens interact beyond their individual effects?
- Speech models: How do acoustic frames interact near phoneme boundaries?
By doing so, you can test:
- How well models encode syntactic tree structures
- How they handle multiword expressions
- How speech models reflect phonetic coarticulation
✅ STII for Text (ExperimentRunner
):
- Load tagged sentences with multiword expressions (MWEs) and syntactic trees
- For token pairs:
- Compute logits for 4 contexts:
AB
,A
,B
,φ
(none) - Interaction =
(AB - A - B + φ)
and normalize by(φ)
norms
- Compute logits for 4 contexts:
- Analyze how interaction varies with:
- Linear distance
- Syntactic distance
- Whether tokens belong to a strong or weak MWE
✅ STII for Speech (SpeechSTIIExperimentRunner
):
- Load audio and phoneme time alignments
- Mask 20ms waveform slices to simulate ablations
- Compare interaction:
- Consonant-vowel vs consonant-consonant
- By manner of articulation (how vowel-like a consonant is)
- The methodology is same for both Speech and Text
Create the environment using conda.yaml
:
conda env create -f conda.yaml -n shapley_llm
conda activate shapley_llm
- Generate Language Data
cd language/mwe_tagger
./run_sr_pipeline.sh <model_name> <model_name>
model_names : ['gpt2', 'bert']
Places the MWE tagger outputs at (bert_bert.pkl_*
or gpt_gpt.pkl_*
) in language/mwe_tagger/
Then run:
python language/language_runner.py
Add audio files to speech_data/mfa_inp_new/
and phoneme CSVs to speech_data/extracted_phonemes/
. Results are written to speech_data/stii_outputs_fix/
.
To generate the files and csv use files at speech/data_utils/
python speech/speech_runner.py