Linguistic data and language comparison in light of the ‘quantitative turn’ and ‘big data’ – a workshop and symposium

May 7 to 9 2025 at the Department of Linguistics, University of Bern
Organizer and contact: Sandra Auderset (sandra.auderset@unibe.ch)

Timeline

Application submissions due: November 15, 2024 (midnight CEST)
Notifications sent out by: December 20, 2024
Workshop: May 7 to 9, 2025

 

Issues and themes

In recent years, linguistics has undergone a ‘quantitative turn’, that is, the introduction and spread of quantitative methods and models. The uptake for such approaches has been greater in some subfields than in others. In phonetics, sociolinguistics, and psycholinguistics, the use of statistics is well established and is generally seen as uncontroversial. This is not so for subfields focusing on language description and comparison (both synchronic and diachronic), especially with respect to the integration of understudied and/or endangered languages. Quantitative methods applied in linguistic typology and historical linguistics often need relatively ‘big’ data sets (for linguistic standards). Accordingly, we have also witnessed a rise in large-scale databases including extensive reference catalogs such as Glottolog (Hammarström et al. 2024), comparative typological data bases such as the World Atlas of Language Structures (Dryer & Haspelmath 2013) and more recently GramBank (Skirgård et al. 2023), and large cognate-coded word lists such as IE-Cor (Heggarty et al. 2023), among many others. These resources are often used to make broad, universal claims about the interplay of language and cognition (Hahn 2020), language and social structure (Lupyan & Dale 2010, Shcherbakova et al. 2023), language and genetics (Dediu 2011), and language and climate (Everett et al. 2015, Everett et al. 2016), among others. ‘Big data’ sets all involve standardization, multiple levels of abstraction, and a view of language as composed of separable, domain-specific building blocks (cf. Lehmann 2004, Heath 2016, and Good in Berez-Kroeker 2022). The alternative view – of language as interaction and an interconnected system – has led to lower-level (regional, family-specific, etc.), but more detailed and less abstractive micro-typologies (cf. Konoshenko & Shavarina 2019, Hildebrandt et al. 2023, among many others). Such studies reveal that there is considerable internal diversity within language families and subgroups, which is key to understanding diachronic processes.

The question of how to model diachronic processes has also been at the center of recent developments in historical linguistics. Bayesian phylogenetics, adapted from evolutionary biology, have found  wider adoption in the past decade (cf. Auderset et al. 2023, Wu et al. 2022, Kaiping & Klamer 2022, among many others) but remain controversial. The main points of skepticism concern whether biological models of evolution are applicable to languages at all (e.g. Campbell 2024: 23) and the issue of relying solely on ‘lexical’ data. At the same time, classifications based on expert opinions and qualitative methods are often accepted without much scrutiny, even if the data the analysis is based on remain inaccessible to other scholars. Thus there has been a move towards open datasets in historical linguistics, often with considerable efforts to make the analytical choices, for example in cognate annotation, transparent (cf. Auderset & Campbell 2024, Arora et al. 2023, among others).

Finer-grained, family-internal and truly bottom-up approaches and methodologies are easier to connect with language documentation and description efforts that have increased over the past decades. However, the question of how to integrate this data into comparative studies, both qualitative and quantitative, is not resolved. This is especially pertinent for spoken language data, which forms the bulk of language documentation, but so far plays only a minor role in typology and diachronic linguistics (but see the contributions in Schnell et al. 2021 and  Epps et al. 2022 for recent examples). Cross-linguistic spoken language corpora, focusing on diverse and mostly understudied languages, such as Multi-CAST (Haig & Schnell 2015) and DoReCo (Seifart et al. 2022) aim at addressing this latter issue. Since they need to rely on common annotation schemas, they also contribute to a wider debate on cross-linguistic comparability and build explicit bridges between raw and primary/secondary data.

In general, discussions on the notion and role of data with respect to analysis and theory often revolve around how language-specific data can be related to cross-linguistic definitions and concepts (see e.g. Alfieri et al. 2021). Much less attention is paid to the ontological underpinnings of what constitutes (primary/secondary) data and how the preparation and annotation of this data influences qualitative and quantitative theories and models (cf. Weigel 2013 for an explicit discussion of contrived data). This workshop provides a forum for in-depth discussion and exchange on theoretical and methodological issues related to linguistic data and language comparison by exploring the relationship of data gathering, analysis, and annotation practices in linguistics in light of the 'quantitative turn' and the advent of ‘big data’. A particular focus lies on synchronic and diachronic comparison and the role of understudied/endangered languages.

Potential talk and discussion topics include but are not limited to:

  • types of linguistic data and their relationship to qualitative and quantitative analyses
  • transparency and reproducibility in the context of primary and secondary data
  • the role of understudied and endangered languages in methodological and theoretical advancements
  • biases in annotation and analysis of linguistic data and how they can be addressed or mitigated
  • the connection of quantitative/computational methods and language documentation, especially how they can mutually benefit each other
  • models and methodologies for bottom-up language comparison
  • database design principles and their effect on linguistic theorizing
  • ‘best practices’ for quantitative and statistical methods drawing on a diverse set of data
  • models of collaboration between researchers focusing on different aspects of data management and analysis (e.g. recordings and annotation, statistical modeling, questionnaire development)

Format and target audience

The workshop/symposium consists of short talks by the participants, invited keynotes, and discussion sessions. It is aimed primarily at early career researchers in linguistics or adjacent fields. Preference will be given to scholars working on endangered and/or understudied languages and/or on methods and tools that advance research on such languages.

Submission guidelines

Interested researchers should send an abstract of their proposed talk and a brief motivation letter including their general research interests as they relate to the topic of the symposium. It’s not necessary to anonymize the documents - name and affiliation should be included. Please note that all participants are expected to attend the full workshop in person. Format:

  • Abstract: max. 400 words excluding references but including examples
  • Motivation letter: max. 400 words
  • e-mail a single PDF file named: lastname_dataws.pdf to data_ws_unibe@gmx.ch

Travel grants

A limited number of small travel grants are available. Applications for the travel grants will be open to accepted participants who are based abroad and cannot secure funding otherwise. Details will be sent out with the notifications for acceptance.

This workshop is supported by the Fund for the Promotion of Young Researchers and the Department of Linguistics at the University of Bern.