Call for PhD: Simulating the Dead: Historical AI as Tools for the Humanities

Deadline: 20 May 2025 at 1pm.

Where to apply: nbaumard@gmail.com 

The PhD will be carried out within the framework of PR[AI]RIE-PSAI.

The following documents are required:

  • the candidate’s CV;
  • a one-page covering letter describing the applicant’s ambitions for the subject described and the relevance of the application to the description of the subject;
  • a copy of your most recent diplomas.

There are two stages to the selection process.

1. Pre-selection. Supervisors (here Nicolas Baumard and Valentin Thouzeau) publish their subject and engage in an open and transparent recruitment process (OTM). 

2. Final selection. A committee made up of at least half external personalities evaluates the applications according to criteria of excellence and ranks the candidates. A main list and a supplementary list will be drawn up by 15 June 2025.

Non-discrimination, openness and transparency. All PR[AI]RIE-PSAI partners are committed to supporting and promoting equality, diversity and inclusion within its communities. We encourage applications from a variety of backgrounds, which we will select through an open and transparent recruitment process.

Scientific context

Humanities scholars have long sought to understand how past people thought, felt, and interpreted the world. Traditional methods — such as close reading, archival analysis, and philology — offer rich, interpretive insights, but they are often labor-intensive and limited in scope. Quantitative approaches like word frequency, topic modelling and word embedding analysis have expanded our methodological toolkit, but remain indirect proxies for psychological or cultural traits [1-4]. 

Recent advances in artificial intelligence (AI) have opened up novel avenues for understanding the human experience across time. Among the most intriguing frontiers is the development of Historical Large Language Models (HLLMs) — language models trained on corpora of historical texts [5]. These models offer the potential to simulate plausible psychological responses and cultural representations from individuals who lived in past societies, effectively creating populations of ‘virtual ancestors’. An HLLM trained on a specific corpus — say, 18th-century French political tracts, or Qing dynasty administrative documents — can respond to prompts with outputs that reflect the linguistic and conceptual patterns present in its training data. These simulated responses can be interrogated using psychological instruments or thematic surveys, generating data that, while artificial, may reveal the distribution of beliefs or values latent in a cultural moment. One could, for instance, estimate levels of authoritarianism, concern for purity, or belief in free will.

MonadGPT (https://huggingface.co /Pclanglais/MonadGPT) provides an example of what we have in mind. This is a fine-tuned version of the Mistral-Hermes 2 model, trained on a corpus of 11,000 early modern texts in English, French, and Latin, primarily sourced from Early English Books Online (EEBO) and Gallica. This model is designed to emulate the language and conceptual frameworks of the 17th c., offering insights into the discourse of that era.

Scientific goal

We envision a collaborative project — involving humanists, computer scientists, and social scientists — to build, evaluate, and apply HLLMs in the study of cultural history. This would involve curating high-quality historical corpora, developing evaluation protocols that integrate humanistic expertise with quantitative benchmarks, and deploying HLLMs in experimental designs drawn from both behavioral science  [6]. The goal is not to reconstruct individual psychology, but to simulate the probabilistic shape of a cultural worldview embedded in language — and to generate novel data that make those structures visible, measurable, and comparable across time.

Validation Strategy: Testing the Method Across Cultural Contexts

To assess the validity of Historical Language Models (HLLMs), we propose a three-tiered validation strategy, using validated methods to study LLM’s abilities to simulate the beliefs and values of specific human groups [7-12]:

1. Large-Corpus Cultures: For major linguistic and cultural areas (e.g., English, French, Mandarin, Arabic), we possess both large-scale textual corpora and extensive longitudinal data on human values and beliefs (e.g., World Values Survey, Pew Global Attitudes) [13]. This focus will allow us to benchmark the performance of HLLMs against independently collected empirical data on human attitudes, preferences, and moral intuitions. For instance, we will assess whether HLLMs fine-tuned on texts from the 1970s and 1980s can reliably recover the cultural patterns and psychological dispositions of that era.

2. Small-Corpus Cultures: We will then extend validation to smaller or underrepresented cultures where textual data is limited but ethnographic or experimental psychology data exist. If LLMs trained on these sparse corpora generate responses aligned with known psychological patterns, it would support the generalizability of the method across data-poor settings such as historical periods.

3. Historical Psychological Reconstructions: Finally, we will test HLLMs against prior empirical reconstructions of historical psychology. Our team has published several large-scale studies estimating long-term changes in traits such as romantic love, fictiveness, and perceived trustworthiness—based on literary fiction, portraiture, and other sources. These results represent some of the best current approximations of past cultural preferences. We will assess whether HLLMs fine-tuned on historical corpora from corresponding regions and periods replicate these reconstructed patterns. For instance, do simulated Ming dynasty outputs reflect increased romantic themes? Do Early Modern European texts show rising fictiveness? These benchmarks enable validation even in the absence of direct historical survey data.

The Role of Transfer Learning: A Paradox of Generalization

One intriguing methodological question concerns the role of transfer learning in the performance of HLLLMs. It might seem intuitive to assume that models trained exclusively on temporally and culturally specific corpora — such as MonadGPT, fine-tuned on early modern texts — would provide the most authentic simulations of historical discourse. However, there is an apparently paradoxical possibility worth exploring: models that have undergone general pretraining on vast, heterogeneous corpora spanning multiple languages, epochs, and genres may, in some contexts, outperform narrowly trained models in capturing the conceptual patterns of past discourse.

This advantage may stem from the fact that generalist LLMs are exposed to a broader range of linguistic structures, including syntax, argumentation styles, and abstract semantic associations. This deep and varied training allows them to internalize robust patterns that are also present, in different forms, in historical texts. Remarkably, even training on seemingly unrelated domains — such as computer code — has been shown to improve performance on core natural language tasks, including translation and reasoning. This suggests that exposure to highly structured linguistic systems enhances a model’s general capacity to parse and generate meaningful content. In this light, transfer learning — where a model pretrained on a large, heterogeneous corpus is fine-tuned on a historical subcorpus — may outperform models trained from scratch or solely on narrow historical datasets. A model’s ability to emulate 17th-century discourse, then, may not depend solely on exclusive exposure to 17th-century texts, but rather on having a richly patterned and adaptable linguistic foundation that is subsequently reoriented by fine-tuning.

Generalist Models and the Validity of Extrapolation

If generalist LLMs — trained on vast contemporary corpora — are able to generate responses that plausibly reflect historical modes of reasoning, this challenges the assumption that historical authenticity requires exclusively historical data. On the contrary, it suggests that models trained on contemporary texts already contain part of what is needed to understand the past. The reason lies in the depth and diversity of the modern data: it captures not only present-day discourse, but also the enduring structures of human thought, recurrent moral intuitions, narrative patterns, and linguistic forms that stretch across time [6, 14].

Rather than seeing generalist LLMs as contaminated by modernity, we might see them as enriched by it. The ability of a contemporary LLM to simulate a Stoic philosopher or a 17th-century theologian is not necessarily a distortion — it may be evidence that these models have internalized deep patterns of human behavior that transcend historical boundaries. This perspective rests on an important assumption, supported by decades of research in psychology and behavioral science: that cultural variability is not random [1-4, 15-18] ; it follows structured patterns shaped by universal psychological mechanisms — such as attachment, fairness, authority sensitivity, or intuitive ontology — whose expression varies predictably across time and space, in response to recurrent ecological changes [15-18]. This has important epistemological implications. It means that the extrapolations made by generalist models are not arbitrary, but constrained by the regularities of cultural expression and human psychology. In this view, transfer learning is not a source of contamination, but a source of power: it allows models to project into the past because they have absorbed the structured variability of the human mind across contexts.

Simulated Data as Cultural Traces

The generative capacity of HLLMs opens the possibility of treating their outputs as synthetic datasets — structured, probabilistically informed distributions of responses that approximate how a population might have answered key moral, political, or metaphysical questions in a given historical context. This allows for statistical analysis of cultural tendencies, experimental manipulation of prompts and framing, and cross-temporal comparisons that would be impossible using only surviving human informants.

Limitations

Surviving historical texts tend to reflect elite viewpoints, and the literacy gap between historical and modern societies means HLLMs may over-represent educated perspectives [1-3]. Additionally, these models reflect textual culture — what was written, not necessarily what was thought. However, many tools in the humanities operate under similar constraints. With critical awareness, triangulation with other data (e.g., family records, economic behavior, art), and careful model construction, HLLMs can still yield powerful insights. 

References

[1] Baumard, N., Safra, L., Martins, M., & Chevallier, C. (2024). Cognitive fossils: using cultural artifacts to reconstruct psychological changes throughout history. Trends in Cognitive Sciences, 28(2), 172-186

[2] Atari, M., & Henrich, J. (2023). Historical psychology. Current Directions in Psychological Science, 32(2), 176-183

[3] Baumard, N., Huillery, E., Hyafil, A., & Safra, L. (2022). The cultural evolution of love in literary history. Nature Human Behaviour, 6(4), 506-522.

[4] Martins, M. D. J. D., & Baumard, N. (2020). The rise of prosociality in fiction preceded democratic revolutions in Early Modern Europe. Proceedings of the National Academy of Sciences, 117(46), 28684-28691

[5] Varnum, M. E., Baumard, N., Atari, M., & Gray, K. (2024). Large Language Models based on historical text could offer informative tools for behavioral science. Proceedings of the National Academy of Sciences, 121(42), e2407639121

[6] Dillion, D., Tandon, N., Gu, Y., & Gray, K. (2023). Can AI language models replace human participants?. Trends in Cognitive Sciences, 27(7), 597-600

[7] Tao, Y., Viberg, O., Baker, R. S., & Kizilcec, R. F. (2024). Cultural bias and cultural alignment of large language models. PNAS nexus, 3(9),

[8] Bisbee, J., Clinton, J. D., Dorff, C., Kenkel, B., & Larson, J. M. (2024). Synthetic replacements for human survey data? the perils of large language models. Political Analysis, 32(4), 401-416.

[9] Ramezani, A., & Xu, Y. (2023). Knowledge of cultural moral norms in large language models. arXiv preprint arXiv:2306.01857

[10] Zhao, W., Mondal, D., Tandon, N., Dillion, D., Gray, K., & Gu, Y. (2024). Worldvaluesbench: A large-scale benchmark dataset for multi-cultural value awareness of language models. arXiv preprint arXiv:2404.16308

[11] Giuliani, N., Ma, C., Pradeep, P., & Ippolito, D. (2024, November). CAVA: A Tool for Cultural Alignment Visualization & Analysis. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 153-161)

[12] Kharchenko, J., Roosta, T., Chadha, A., & Shah, C. (2024). How well do llms represent values across cultures? empirical analysis of llm responses based on hofstede cultural dimensions. arXiv preprint arXiv:2406.14805

[13] Inglehart, R. (2020). Modernization and postmodernization: Cultural, economic, and political change in 43 societies. Princeton university press

[14] Buttrick, N. (2024). Studying large language models as compression algorithms for human culture. Trends in cognitive sciences, 28(3), 187-189

[15] Varnum, M. E., & Grossmann, I. (2017). Cultural change: The how and the why. Perspectives on Psychological Science, 12(6), 956-972

[16] Baumard, N. (2019). Psychological origins of the industrial revolution. Behavioral and Brain Sciences, 42, e189

[17] Boon-Falleur, M., Baumard, N., & André, J. B. (2024). The effect of income and wealth on behavioral strategies, personality traits, and preferences. Perspectives on Psychological Science, 17456916231201512

[18] Zhong, Y., Thouzeau, V., & Baumard, N. (2023). The evolution of romantic love in Chinese fiction in the very long run (618-2022): A quantitative approach. Proceedings http://ceur-ws. org ISSN, 1613, 0073