MSDS_Practicum_1

Doctor Who Logo Word Cloud

Introduction

This project analyzes Doctor Who fanfiction scraped from ArchiveOfOurOwn and seeks to create a topic model derived from the most popular published stories in the fandom. This project was completed over the 8 week capstone term using Python and scikit-learn's LatentDirichletAllocation model.

Files Included

The following files are included as part of my submission:

Jupyter Notbook: This file contains all code used to clean, explore, and model the data. Available as a ipynb or PDF format.
CSV Files:
- TopicThemes.xlsx: The final themes determined by the 30 topics generated from the model.
- topics_2021-06-05.csv: The 30 topics and associated words.
- fic_lem_nounonly_tokens_2021-05-28.csv.zip: Had to be zipped because it was too big otherwise!
Final_Presentation.pdf: Final presentation with insights and process. The file must be downloaded for the links to work.

Methods Used

The data was scraped from AO3 and cleaned until only lemmatized nouns remained. The topics were created by tuning the max_df, min_df, n_components inputs and perplexity score output. The final model was created using 30 topics, with a min_df of 25, max_df of 70%. Additional information on tuning and the dataset is available in the presentation deck and presentation video.

Stories in each Topic

No stories had the dominant topic for Topics 3, 8, 9, 10, 15, 25. Each story received a score for every topic created and the highest score is where the story ultimately got assigned to. This presents an opportunity viewing the topics as a hierarchy. Topic 18 ended up being a general catch-all topic, representing 26% of all stories.

Next Steps

There are opportunities for subtopic modeling to understand how closely each story resembled other topics.
Understanding the smaller topics. They seem to be crossovers with other fandoms and the presence of fandom specific words biased the creation of those topics.
More comprehensive cleaning: merging bigrams or trigrams
Testing other topic model paradigms
Build a better dataset: full stories, no crossovers etc.
- I initially went with chapter ones because I could pull in a greater amount of variety than a smaller dataset with full stories of varying lengths.

Conclusion

Difficult to create interpretable topics
Required extensive domain knowledge on the chosen fandom
- This means this is not easily scalable to other fandoms as I initally hoped.
It was fun to create names of themes!

References

Chen, Yanlin. (2018). How to generate an LDA topic model for text analysis. Retrieved from https://medium.com/@yanlinc/how-to-build-a-lda-topic-model-using-from-text-601cdcbfd3a6

Ganesan, K. (n.d.). 10+ examples for using CountVectorizer. Retrieved from https://kavita-ganesan.com/how-to-use-countvectorizer/#.YMJZGJNKgl4.

James. (n.d). Topic modeling in python. Retrieved from https://ourcodingclub.github.io/tutorials/topic-modelling-python/#apply.

Li, Jingyi. (2016). AO3 scraper. Retrieved from https://github.com/radiolarian/AO3Scraper

scikit-learn. (n.d.). Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation. Retrieved from https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html.

sophros.https://stackoverflow.com/questions/65817456/lda-topic-model-gensim-gives-same-set-of-topics

Marcel. (2017). python scikit learn, get documents per topic in LDA. Retrieved from https://stackoverflow.com/questions/45145368/python-scikit-learn-get-documents-per-topic-in-lda

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
CSV_FIles		CSV_FIles
Images		Images
Final_Presentation.pdf		Final_Presentation.pdf
LDA_Topic_Model_Code.ipynb		LDA_Topic_Model_Code.ipynb
LDA_Topic_Model_Code.pdf		LDA_Topic_Model_Code.pdf
README.md		README.md

Topic #	Theme	Story Count
18	General Regeneration	366
0	13th Doctor Regeneration	176
14	TenToo AU	144
7	Rose as Bad Wolf	137
1	River Song's Husband	113
21	Falling in Love	77
13	Heaven and Hell Collide	71
16	The Family Pond	51
20	Time Lords	43
12	Friends of the Doctor	43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MSDS_Practicum_1

Introduction

Files Included

Methods Used

Top 10 Topics

Stories in each Topic

Next Steps

Conclusion

References

About

Uh oh!

Releases

Packages

Languages

lcagney/MSDS_Practicum

Folders and files

Latest commit

History

Repository files navigation

MSDS_Practicum_1

Introduction

Files Included

Methods Used

Top 10 Topics

Stories in each Topic

Next Steps

Conclusion

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages