Skip to content

vitalsource/data

Repository files navigation

VitalSource Supplemental Data Repository

This repository contains supporting datasets and analysis code for several of our papers evaluating the use of artificial intelligence to enhance electronic textbooks at scale. These projects include automatic question generation (AQG) as well as other generative AI–based features such as text simplification. All datasets are drawn from real student interactions in the VitalSource Bookshelf ereader platform.

Our earliest research focused on AQG as a method for adding formative practice to textbooks. Millions of automatically generated questions have been added to thousands of textbooks in Bookshelf as part of a free study feature called CoachMe. CoachMe is based on the Doer Effect, the learning science principle that students who do practice as they read have better learning outcomes than those who only read. Our efforts have since expanded beyond AQG to include other generative AI-based interventions to support student learning and engagement. All of our published research papers can be found on our research site.

The datasets available are:

Directory Paper
l@s-2021 Toward effective courseware at scale: Investigating automatically generated questions as formative practice
aied-2021-itextbooks Transforming textbooks into learning by doing environments: An evaluation of textbook-based automatic question generation
l@s-2022 Discrimination of automatically generated questions used as formative practice
ijaied-2024 Automatic question generation for Spanish textbooks: Evaluating Spanish questions generated with the parallel construction method
ife-2024 An expert evaluation of formative practice generated for Spanish textbooks using Artificial Intelligence
aied-2024-evallac Exploring large language models for evaluating automatically generated questions
edm-2024 Investigating student ratings with features of automatically generated questions: A large-scale analysis using data from natural learning contexts
jedm-2025 Intrinsic and contextual factors impacting student ratings of automatically generated questions: A large-scale data analysis
edm-2025-causaledm Improving automatically generated fill-in-the-blank answer selection with an LLM-based agreement filter NEW
l@s-2025 Refining sentence selection for automatic cloze question generation with large language models NEW
aied-2025-evallac Open-ended questions need personalized feedback: Analyzing LLM-enabled features with student data NEW
aied-2025-itextbooks Improving textbook accessibility through AI simplification: Readability improvements and meaning preservation NEW

Unless otherwise noted, our datasets are available under the Creative Commons Attribution 4.0 International License.

Contact Us

If you have questions, please feel free to email [email protected].

About

Publicly available data and code accompanying VitalSource's published academic research

Resources

License

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •