The Global Lexicostatistical Database: Specifics











As of today, multiple websites, in one form or another, host various numbers of Swadesh and «qua­si-Swadesh» wordlists; some of the most prominent examples include the Wiktionary Swa­desh word list collection, the Rosetta Project, the Austronesian Basic Vocabulary Database, Isi­dore Dyen's Comparative Indo-European Database (now also revised and updated with extra data and features as The Indo-European Lexical Cognacy Database), and The Automated Similarity Judgement Program.


The GLD strives to take into account both the obvious advantages and the observed flaws of these resources, as well as the experience gained from more than two deca­des of computerized lexicostatistical stu­dies by various members of the Moscow school of com­parative linguistics, in order to come up with a new, updated standard that will, at the same time, increase the reliability and transparent character of the data and allow researchers to try out new approaches and ideas concerning its manual and automatic analysis.


The principal specific features of the GLD, which, put together, set it apart from most other simi­lar ventures, are as follows:


1. All of the data are computerized or, at least, thoroughly fact-checked by professional re­searchers with a solid background in general comparative-historical linguistics and, as a mini­mum requirement, a working know­ledge of the material.

2. All of the data are accompanied with annotations that, as an absolute minimum, necessari­ly contain direct source references right down to the page number, so that any single en­try may be easily verified by anyone with access to the respective sources.

3. All of the data are transliterated from the original sources into a single unified trans­crip­tion system (UTS), based on the IPA with slight mo­di­fi­ca­tions (details may be found here), with the original orthographies included along with the recodings for some languages with es­tablished written/orthographic traditions. This makes it easy for users to compare data from languages they are unfamiliar with, and also facilitates various algorithms of auto­matic analysis.

4. All of the data, except for cases where the languages have not been studied in sufficient de­tail, are morphologically segmented, in order to fa­ci­li­tate manual and automatic analy­sis procedures and decrease the basis for potential errors of etymological judgement.

5. Specially for the needs of the GLD, an updated and explicated list of Swadesh meanings has been introduced (details may be found here), fa­ci­li­ta­ting a correct and uniform se­lec­tion of the appropriate sy­no­nym for langua­ges with sufficient data coverage.

6. All of the data are presented in at least three formats: (a) on-line database, available for browsing or querying (including the possibility to search through multiple databases at once); (b) print-ready uneditable PDF version; (c) editable Microsoft Excel table with all the data at the potential user's disposal (for normal viewing of Excel files, it will be neces­sary to download and install Starling Serif, the default Unicode font for the GLD).

7. With the gradual addition of new data, the existing collections will be slowly integrated into a hierarchic structure that will be capable of fun­c­ti­o­ning as a unified basis for genetic classification. The GLD intends to go far beyond mere collection, recoding, and annota­ti­on of raw data, incorporating powerful historical tools for analysis of said data as well.


BACK TO MAIN PAGE                                   DATABASE LIST                              RUSSIAN VERSION


     © 2011-2016 George Starostin (site design, data input coordination)
    © 2011-2016 Phil Krylov (programming, technical support)