This repository contains code used in the paper "Efficient uniform sampling explains non-uniform memory of narrative stories" by Jianing Mu, Alison R. Preston and Alexander G. Huth.
Raw Gorilla data to transcripts and segmentation, get metrics from recall coding
- Generate participant_info spreadsheet by running
python parse_output_spreadsheets.py
. Contains columns: prolific_id, gorilla_id, story, audio_task. See args for details, need to change the data directory for each new experiment. - For tasks with multiple stories, run
python group_audio_files.py
to copy the audio files into their respective story directories. Saves underbehavior_data/recall_audio/story
- Transcribe recall transcripts using
transcribe_audio.py
- Extract segmentation, calculate segmentation consensus, comprehension accuracy with
python parse_behavioral_data.py
Use flag --exclude to exclude a pre-determined set of subjects. For pie man, theBehavioral_data.ipynb
computes the consensus segmentation, subjects' segmentation file and comprehension stats after exclusion - Run
parse parse_behavioral_data_combine.py
(even if this story only has 1 experiment) - Exclusion criteria documented in
Behavioral Results.ipynb
and google spreadsheet Behavioral data masterlist. - Run
parse_behavioral_data_combine.py
again with exclusion args '--exclude' - Run
combine_recall_transcripts.py --story {story}
to collect all checked recall transcripts into a csv file. - To get the event number of each detail in the story coding files, use the first few cells in the
recall coding metrics.ipynb
. - Extract clean recall transcripts using
bash batch_check_transcripts.sh "story"
Split the story into chunks of equal durations, same number of chunks as num events (Fig.2 uniform encoding, and supplemental boundary analysis)
- First run
run_split_story_by_even_duration.sh
locally to generate the unadjusted splits of the story. Output is underbehavior_data/story_split_timing
. Then manually adjust for phrase boundaries. - Run
run_story_even_split_analysis.sh
, packages inference code, runs both instruct and non-instruct to get I(Xi;R) (run_recall_explained_events). Also calculates H(X) (get_logits), I(Xi;Xj) (run_pairwise_events). Inference scripts called in this bash file uses --split_story_by_duration to indicate the even duration condition - Use
run_analyze_uniform_encoding.sh
to generate dataframes for plotting - Use
uniform encoding hypothesis combine stories-split story evenly by duration.ipynb
to generate scatter plots - Use
uniform encoding hypothesis - split story evenly by duration - compare models.ipynb
to generate bar plots of R^2 - Use
Uniform encoding hypothesis - by subject prevalence-split story evenly.ipynb
to perform subject-level significance testing
Boundary analysis that splits the story into equal-duration or equal-token chunks (Fig.3 and supplemental results)
- Split into equal token with 1.5xnumber of events
- Generate chunks
split_story_by_tokens.py --story {story} --factor 1.5
Outputs 'story_even_token_factor_%.1f.csv'%args.factor in behavior_data/story_split_timing - Adjust for phrase boundaries manually, save them as 'story_even_token_factor_%.1f_adjusted.csv'%args.factor, send them back to TACC
- Run
bash run_story_even_split_analysis.sh "Llama3-8b-instruct" ""pieman" "alternateithicatom" "odetostepfather" "legacy" "souls" "wheretheressmoke" "adventuresinsayingyes" "inamoment"" "true" "false" 1.5 "true"
This calls run_split_story_by_even_duration.sh to align the adjusted chunks to the correct timing and tokens and recalculate whether each chunk is a boundary or not, then run the full LLM inference. Results are under pairwise_event/{story}/'story_split_tokens_factor_%.1f_adjusted'%args.factor - Calculate CRUISE, surprisal weighted sampling and controls using
uniform encoding hypothesis-split story evenly by tokens-split with factor.ipynb
- Plot using
split story by tokens - cleaned for plotting.ipynb
- Generate chunks
- Split into equal duration with 1.5 x number of events.
- Generate chunks
bash run_split_story_by_even_duration.sh "Llama3-8b-instruct" ""pieman" "alternateithicatom" "odetostepfather" "legacy" "souls" "wheretheressmoke" "adventuresinsayingyes" "inamoment"" "false" 1.5 "false"
. Outputs 'story_even_duration_factor_%.1f.csv'%args.factor in behavior_data/story_split_timing - Adjust for phrase boundaries manually, save them as 'story_even_duration_factor_%.1f_adjusted.csv'%args.factor
- Run
bash run_story_even_split_analysis.sh "Llama3-8b-instruct" ""pieman" "alternateithicatom" "odetostepfather" "legacy" "souls" "wheretheressmoke" "adventuresinsayingyes" "inamoment"" "true" "false" 1.5 "false"
This calls run_split_story_by_even_duration.sh to align the adjusted chunks to the correct timing and tokens, then run the full LLM inference. Results are under pairwise_event/{story}/'story_split_timing_factor_%.1f_adjusted'%args.factor - Calculate CRUISE, surprisal weighted sampling and controls using code similar to the equal token split
- Plot using
split story evenly by duration - chunks with boundary vs. no boundary cleaned for plotting.ipynb
- Generate chunks
Time courses of information properties around boundaries (Fig. 3jkl) and surprisal around boundaries vs. baseline (Supplement)
CE around event boundaries vs. random chunks.ipynb
event_boundary_information.ipynb
generates count balanced ablation stimuli inablation/{model_name}/sliding_window_ablation/moth_stories
.- Send stimuli to TACC for inference to obtain CE with
sliding_ablation_entropy.py
. - Analysis in
event_boundary_information_cleaned.ipynb
- On TACC,
generate_model_recall.py --story {story} --n 50 --temp 0.7 --att_to_story_start --prompt_number 1
. These are the parameters that all stories should have. Need to specify the desired attention temperature on line 131. Need to use the transformer env with custom Llama generation code. See implementation of the attention temperature manipulation here. Results are saved in csv files ingenerated/{model_name}/model_recall
. If you rerun thegenerate_model_recall.py
with different temps, it will concatenate new generations onto existing ones using the same parameters - Calculate how much recall explains about the story: run everything on TACC using
model_recall_inference.sh
. Remember to change the stories you want to run inference onpython get_recall_tokens.py --story {story} --model_recall --temp 0.7 --att_to_story_start --prompt_number 1 --recall_only --recall_original_concat --original_recall_concat
Tokens are saved in{story}_temp0.70_prompt1_att_to_story_start_True
- Run inference and get logits.
bash bash_files/model_recall_inference.sh
This inference code will append the new inference results onto existing ones.
- Analysis are in
modify_llama_attention.ipynb
to compare with attention entropy from human recalls. The rate-distortion analysis is inrate distortion by attention scale-no annotations.ipynb
. This nb saves dictionaries for plotting ingenerated/llama3-8b-instruct/rate_distortion
. Rate distortion plots for all stories are inplot rate distortion_all stories together.ipynb
.
Recall concatenation with original transcript (part of Fig. 5 rate distortion, gets rate and attentions)
Packaged in story_recall_inference.sh
- Generate stimuli and analysis in
verbatim recall simulation.ipynb
- run
verbatim_recall_inference.sh
Use attention_try.ipynb
to generate repeating stimuli and run inference to measure induction head score and duplicate token head score. Results are saved in generated/{model}/attention_head_test
. Dependency: TransformerLens 1.15.0: pip install transformer-lens==1.15.0