Lydia Yampolsky - Intro to Data Science - Final Project¶

Lexical Relations in Word Associations of English Speakers¶

https://lydsy7.github.io/

Introduction¶

The lexicon of a natural language can be represented by networks of semantic units, associated according to their lexical relations to other entities. Free word association tasks shed light on how the meanings of these entities are stored and processed in the brain. For example, on reading or hearing a noun, one might immediately think of another noun having a relation like synonymy, antonymy, or hyponymy, an adjective that derives from it of from a common root, or a verb for which the noun is a canonical agent. The goal of this project is to explore the relations that hold between words in the lexica of English speakers, as evidenced by a free word association task. Specifically, I would like to know:

What are the primary relations that we use to process lexical units in English? That is, for each category of content words (nouns, verbs, adjectives, adverbs) what are the most commonly observed relations that hold between a cue word and a response?¶

- What effect, if any, does proficiency/native speaker status have on the most common relations?*¶

*My ability to answer this question will be limited by the fact that native speaker status is a self-reported, binary variable in the data used here. Speaker proficiency is often represented as a typology in the subfield of language revitilization, reflecting a more nuanced account of different levels of language proficiency.

- What effect, if any, do participants' demographics have on the most common relations? These include age, gender, level of education, country of origin, and native language (if not English).¶

My analysis of the data with respect to these questions will help me draw conclusions about how lexical items are stored by English speakers. The final result will be viewable here.

My data on participants and their responses comes from Small World of Words, a project dedicated to building models of lexica of several of the world's languages. A participant is shown a series of lexical items(single words or multi-word units such as traffic light), and for each they must enter up to three words that come to their mind right away. These associations are used to create a network that represents how those words are stored in native and/or fluent speakers' brains. Networks like the ones generated by Small World of Words help demonstrate how we intuitively understand entities in terms of other entities. Looking over the explore page, it is clear that the strongest associations with a word might not necessarily be relations like synonymy (chair - seat), but evoke descriptions or images (yellow - dandelion) or events (chocolate - melt). Because Small World of Words records multiple responses to cues along with participant demographics, it is a fitting data source for the questions I aim to address.

On their research page, Small World of Words has several datasets compiled from their results for English and Dutch. The dataset I have loaded here contains English-speaking participant data collected between 2011 and 2018, including the date they participated, their demographics, and their responses to the cue words. Each row/obervation in the DataFrame corresponds to a person's response(s) to a specific cue, and 100 sets of responses are recorded for each cue word. Education level is coded by highest level of education as: 1 = None, 2 = Elementary school, 3 = High School, 4 = College or University Bachelor, 5 = College or University Master.

I use WordNet to identify parts of speech and the lexical relations between the cues and responses. WordNet is a database containing a network of English nouns, verbs, adjectives, and adverbs and their lexical relations.

A similar application of WordNet to the results of free word association tasks can be seen in Gravino et al. (2012). Among their findings is that the most common relations are synonymy, hypernymy, and hyponymy. The analysis here will determine whether those findings are reflected in the Small World of Words data and whether there are differences along the various demographic parameters.

Loading and preparing data¶

The data will be analyzed and displayed using the Pandas and Seaborn libraries.

Below I import the needed packages and read the file containing the participant data (demographics and responses to cues), into a DataFrame. The participant data is available for download on the research page linked above. I display the top of the original DataFrame.

# import Pandas
import pandas as pd

# import WordNet 
!pip install nltk
from nltk.corpus import wordnet as wn

# import Regex
import re

# import NumPy
import numpy as np

# import seaborn
!pip install seaborn
import seaborn as sns

%matplotlib inline
import matplotlib.pyplot as plt

# import itertools
import itertools

Requirement already satisfied: nltk in c:\users\lydia\miniconda3\lib\site-packages (3.6.5)
Requirement already satisfied: regex>=2021.8.3 in c:\users\lydia\miniconda3\lib\site-packages (from nltk) (2021.11.10)
Requirement already satisfied: click in c:\users\lydia\miniconda3\lib\site-packages (from nltk) (8.0.3)
Requirement already satisfied: tqdm in c:\users\lydia\miniconda3\lib\site-packages (from nltk) (4.46.0)
Requirement already satisfied: joblib in c:\users\lydia\miniconda3\lib\site-packages (from nltk) (1.1.0)
Requirement already satisfied: colorama; platform_system == "Windows" in c:\users\lydia\miniconda3\lib\site-packages (from click->nltk) (0.4.3)
Requirement already satisfied: seaborn in c:\users\lydia\miniconda3\lib\site-packages (0.11.2)
Requirement already satisfied: numpy>=1.15 in c:\users\lydia\miniconda3\lib\site-packages (from seaborn) (1.21.2)
Requirement already satisfied: scipy>=1.0 in c:\users\lydia\miniconda3\lib\site-packages (from seaborn) (1.7.1)
Requirement already satisfied: pandas>=0.23 in c:\users\lydia\miniconda3\lib\site-packages (from seaborn) (1.3.3)
Requirement already satisfied: matplotlib>=2.2 in c:\users\lydia\miniconda3\lib\site-packages (from seaborn) (3.3.3)
Requirement already satisfied: pytz>=2017.3 in c:\users\lydia\miniconda3\lib\site-packages (from pandas>=0.23->seaborn) (2021.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\lydia\miniconda3\lib\site-packages (from pandas>=0.23->seaborn) (2.8.1)
Requirement already satisfied: pillow>=6.2.0 in c:\users\lydia\miniconda3\lib\site-packages (from matplotlib>=2.2->seaborn) (8.0.1)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\lydia\miniconda3\lib\site-packages (from matplotlib>=2.2->seaborn) (1.3.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\lydia\miniconda3\lib\site-packages (from matplotlib>=2.2->seaborn) (2.4.7)
Requirement already satisfied: cycler>=0.10 in c:\users\lydia\miniconda3\lib\site-packages (from matplotlib>=2.2->seaborn) (0.10.0)
Requirement already satisfied: six>=1.5 in c:\users\lydia\miniconda3\lib\site-packages (from python-dateutil>=2.7.3->pandas>=0.23->seaborn) (1.14.0)

# Small world of words- participant data
part_df = pd.read_csv("English.csv")
part_df.head()
# an observation corresponds to one person's responses to a specific cue

Now I will modify the DataFrame, first dropping the columns I won't need for my analysis. Because I am interested in comparing native speakers' to non-native speakers' responses, I have to change the way participants' native languages are encoded. Participants in Small World of Words indicate whether they are native speakers or not. If not, they are prompted to select their native language from a list of common languages or "Other". Native English speakers are prompted to select their "native language" from a list of countries where English is spoken (not to be confused with the "country" column, which is where they were when they participated.) Because of this inconsistency I added a column that indicates whether they are a native speaker.

I also update the education column so the levels are represented as abbreviations rather than numbers.

Both native and non-native speakers had the "Other" option for native language. These are distinguished by the values "Other_English" and "Other_Foreign".

# Drop unneeded columns
part_df.drop(columns = ["Unnamed: 0", "id", "created_at"], inplace = True)

# SWOW's list of nations/regions of native English speakers 
eng = ["Canada", "Puerto Rico", "United States", "Australia", "United Kingdom", "Ireland", "New Zealand", 
       "Papua New Guinea", "Jamaica", "Trinidad and Tobago", "Hong Kong", "India", "Pakistan", "Singapore",
      "Philippines", "Cameroon", "Ghana", "Kenya", "Malawi", "Mauritius", "Nigeria", "Rwanda", "South Africa",
      "Sudan", "Uganda", "Tanzania", "Zimbabwe", "Other_English"]
part_df["native_speaker"] = False
part_df.loc[part_df["nativeLanguage"].isin(eng), "native_speaker"] = True

# education levels as abbreviations
part_df["education"] = part_df["education"].map({
    5.: "C/U Master",
    4.: "C/U Bachelor",
    3.: "HS",
    2.: "ES",
    1.: "None"
})
# in order
part_df["education"] = pd.Categorical(part_df["education"] ,["None", "ES", "HS", "C/U Bachelor", "C/U Master"], ordered = True)

# missing cues
part_df["cue"].replace(np.nan, '', inplace = True)

part_df.head()

Participant data: exploratory analysis¶

Before exploring lexical relations, I want to know about the demographics represented in my data. To get an idea of the distribution of native vs. non-native speakers in the dataset, I plot a bar graph below and display what percentage are native speakers- about 86%.

The other bar graphs show the home countries/regions of the native English speakers, and the native languages and countries of the non-native speakers. It should be noted that the y-axis on these graphs corresponds to the number of responses to cues and not to a count of individuals. The size of the sample of speakers in the DataFrame is 83,864.

# Language stats
fig, ax = plt.subplots(4, 1, figsize = (7, 15))
plt.rcParams.update(plt.rcParamsDefault)
fig.suptitle("Summarizing Small World of Words native language data")

# Percent native speakers
part_df["native_speaker"].value_counts().plot.bar(title = "Distribution of native English speaker status", ax = ax[0])

# Native speakers' countries
ns_countries = ["United States", "United Kingdom", "Canada", "Australia", "New Zealand", "Ireland", "South Africa", "Singapore"]
part_df[(part_df["native_speaker"] == True) & 
        (part_df["country"].isin(ns_countries))]["country"].value_counts().plot.bar(
    ax = ax[1], title ="Most common regions of native speakers")

# Non-native speakers' native languages
part_df[(part_df["native_speaker"] == False)]["nativeLanguage"].value_counts().plot.bar(ax = ax[2],
                                                                                       title = "Other native languages")

# and countries
# 20 most commonly occuring countries of non-native speakers
nns_countries = ["United States", "Germany", "United Kingdom", "Belgium", "Netherlands", "Finland", "Canada", "Spain", "Sweden",
             "India", "Denmark", "France", "Italy", "Norway", "Australia", "Poland", "Romania", "Hungary", "Brazil", 
             "Switzerland"]
part_df[(part_df["native_speaker"] == False) & (part_df["country"].isin(ns_countries))]["country"].value_counts().plot.bar(ax = ax[3], 
                                                                                title = "Most common regions of non-native speakers")
plt.tight_layout()

Since most of participants are native speakers, my analysis will be more generalizable to native speakers' lexical relations. Most native speakers in the sample are American, followed by speakers from the UK, Canada, and Australia.

A notable majority of non-native speakers indicated that their native language was one other than those listed as options. Most other responses in this group are from native speakers of Indo-European languages. In both groups, large portions of the English-speaking world turn out to be underrepresented; India, Pakistan, and Nigeria each have more English speakers than the UK and Canada combined.

I would also like to get a sense of what ages, genders, and levels of education are represented. First I display a cross-tabulation of gender and education level, showing the proportions of participants that fall into each intersection. Again, education is encoded as follows: 1 = None, 2 = Elementary school, 3 = High School, 4 = College or University Bachelor, 5 = College or University Master. The possible values for gender are exactly the options given by Small World of Words at the beginning of the free word association task, with 'X' corresponding either to non-binary gender identity or non-disclosure of gender. Following the cross tabluation are a heat map showing the joint distribution of gender and education level and a pie chart of gender distribution.

The third plot shows the distribution of participants' ages for each gender. For this plot I drop the small slice of participants labelled as 'X' to make the graph more readable.

# Demographics stats

fig, ax = plt.subplots(3, 1, figsize = (10, 10))
fig.suptitle("Summarizing Small World of Words participant data")

# Education and gender - cross-tabulation
joint_dist = pd.crosstab(part_df.gender, part_df.education, normalize = True, margins = True)
sns.heatmap(joint_dist.T, ax = ax[0]).set_title("Joint distribution of gender and education level")


# Gender - pie chart
part_df.groupby("gender").size().plot.pie(title = "Gender distibution", autopct='%.2f%%', ax = ax[1], 
                                          ylabel = '', figsize = (10, 15))

# Age - histogram 
# drop non-binary
part_df["bin_gen"] = part_df["gender"].replace("X", None)
part_df.groupby("bin_gen", dropna = True)["age"].plot.hist(bins = 50, alpha = .5, ax = ax[2], density = True,
                                                          legend = True, title = "Age distribution by gender",)

joint_dist

The visualizations show that a majority of the sample is female and a majority is college educated. The ages are positively (right) skewed, with the highest concentration falling between 15 and 30 for both genders. A more detailed summary of the ages is displayed below. The table contains summary statistics for each gender value, and the list describes the ages overall.

These variables will be the parameters I use to investigate patterns in primary lexical relations for people of various backgrounds.

display(part_df.groupby('gender')["age"].describe())
part_df["age"].describe()

count    1.228200e+06
mean     3.623436e+01
std      1.561471e+01
min      1.600000e+01
25%      2.400000e+01
50%      3.100000e+01
75%      4.700000e+01
max      1.000000e+02
Name: age, dtype: float64

Labeling the data with lexical relations¶

Here is an example to demonstrate how WordNet can identify the lexical relation between two words. Word senses are stored in objects called synsets, with each synset consisting of the word, a letter corresponding to its part of speech, and a number indicating which sense of the word is being referenced. For example, here is a list of synsets for 'tree', followed by the definition of the first synset.

tree_synsets = wn.synsets('tree')
print(tree_synsets)
tree_synsets[0].definition()

[Synset('tree.n.01'), Synset('tree.n.02'), Synset('tree.n.03'), Synset('corner.v.02'), Synset('tree.v.02'), Synset('tree.v.03'), Synset('tree.v.04')]

'a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms'

If I want to determine whether a given word, say, 'oak', is a hyponym (more specific instance) of 'tree', I can ask WordNet:

tree_hypo = [synset.name().split('.')[0] for synset in wn.synsets('tree')[0].hyponyms()] # kinds of tree- just the word
'oak' in tree_hypo

True

Using this functionality of WordNet I will classify the responses in my dataset in terms of their relation to the cue word and observe overall patterns in the most common relations for each part of speech, as well as correlations with participants' demographics.

The function below takes two strings as parameters, and determines whether the second string occurs in any of WordNet's lists of words related to the first string. If the word is found, it updates a dictionary of frequencies of each of WordNet's relations, and returns a list containing the name of each relation that was found.

# dictionary, relation--> frequency
relations_freq = {}
relations = ['hypernyms', 'instance_hypernyms', 'hyponyms', 'instance_hyponyms', 'member_holonyms', 
             'substance_holonyms', 'part_holonyms', 'member_meronyms', 'substance_meronyms', 'part_meronyms',
             'topic_domains', 'region_domains', 'usage_domains', 'attributes', 'entailments', 'causes',
             'also_sees', 'verb_groups', 'similar_tos', 'lemmas']
# these relations only defined for lemmas
lemma_relations = ['antonyms', 'pertainyms', 'derivationally_related_forms']

for r in relations:
    relations_freq[r] = 0
for r in lemma_relations:
    relations_freq[r] = 0

def count_relation(cue, response):
    rel_list = []
    relation_found = False
    
    # for each sense of cue
    for synset in wn.synsets(cue):
        lemmas = synset.lemmas()
        # check if response is related 
        for r in relations:
            related = getattr(synset, r) # list of wordnet's related senses
            words = [synset.name().split('.')[0] for synset in related()] # just the word
            for i in range(len(words)):
                words[i] = re.sub('_', ' ', words[i])
            #print(r_words)
            if response in words:
                rel_list.append(r)
                relation_found = True
                #print('Relation found: ' + str(r) + '\n')
                relations_freq[r] += 1
        
    # same but for relations defined for lemmas
        for lemma in lemmas:
            for r in lemma_relations:
                l_related = getattr(lemma, r)
                l_words = [lem.name() for lem in l_related()]
                for i in range(len(l_words)):
                    l_words[i] = re.sub('_', ' ', l_words[i])
                #print(l_words)
                if response in l_words:
                    rel_list.append(r)
                    relation_found = True
                    #print('Relation found: ' + str(r) + '\n')
                    relations_freq[r] += 1

                
    return rel_list
# if none found, returns empty list

The commands below update my DataFrame to include the lists of relations produced by the count_relation function for each cue and its responses. For most pairs for which a relation is found, the list is just one relation long. If the response appeared in more than one list of words related to the cue, the list length is greater than 1 and the counts dictionary will have been updated for each occurence; I count each relation in the overall tally.

# adding relations to the dataframe  
part_df["relation1"] = part_df[["cue", "R1"]].apply(lambda x: count_relation(*x), axis = 1)
part_df["relation2"] = part_df[["cue", "R2"]].apply(lambda x: count_relation(*x), axis = 1)
part_df["relation3"] = part_df[["cue", "R3"]].apply(lambda x: count_relation(*x), axis = 1)

def join_lists(r1, r2, r3):
    return list(set(r1)) + list(set(r2)) + list(set(r3))

part_df["relations_list"] = part_df.apply(lambda x: join_lists(x.relation1, x.relation2, x.relation3), axis = 1)

# create a dataframe of counts
counts = pd.DataFrame.from_dict(relations_freq, orient='index', columns=['count'])

# write updated df and counts df to csv for easy getting
part_df.to_csv('rel_df', index = False)

counts.to_csv('counts')

part_df = pd.read_csv('rel_df')
part_df["relations_list"] = part_df["relations_list"].apply(lambda x: re.findall(r'\w+', x))

Analysis of relation distribution¶

Now I can observe patterns in relation frequencies. First, I plot the distribution of the most common overall relations that appear in the dataset. The top 6 are lemmas (synonyms), hypernyms, derivationally related forms (semantically related cognates with a different part of speech), hyponyms, similar-to's (a WordNet relation for adjectives that targets semantically related synsets with a wider scope than lemmas), and antonyms. See documentation of WordNet or NLTK's WordNet interface for more information on relation definitions. Other relations account for less than 1% of pairs for which a relation was found.

counts = pd.read_csv('counts')
counts.rename(columns = {"Unnamed: 0": "relation"}, inplace = True)
counts.set_index('relation')
counts = counts.sort_values('count', ascending = False)

# group least frequent together
threshold = 26825
new_dic={}
for key, group in itertools.groupby(relations_freq, lambda k: 'other' if (relations_freq[k]<threshold) else k):
     new_dic[key] = sum([relations_freq[k] for k in list(group)])
grouped_counts = pd.DataFrame.from_dict(new_dic, orient = 'index', columns = ['count'])
grouped_counts.sort_values('count', ascending = False).plot.pie(y = 'count', figsize = (5, 5), 
                                                                autopct = '%1.0f%%', title = "Overall counts of relations")
plt.legend(bbox_to_anchor=(-0.15, 1.), loc = 'center')

<matplotlib.legend.Legend at 0x2175239d520>

Now let's break down the distribution of relations by demographics. I isolate the observations for which at least one relation was found, and explode the dataframe so each relation gets its own row. The bar graphs below represent counts of each relation separated by gender and by education respectively.

df_slice = part_df[(part_df["relations_list"].map(len) > 0)]
df_slice = df_slice.explode('relations_list', ignore_index = True)

fig, ax = plt.subplots(2, 1)

# Plotting relation count by gender, education
df_slice.reset_index().pivot_table(values = 'index', index = 'relations_list', columns = 'gender', 
                                   aggfunc = 'count').plot.bar(figsize = (30, 20), width = 1, ax = ax[0], 
                                                               title = "Relations by gender and education level")
df_slice.reset_index().pivot_table(values = 'index', index = 'relations_list',
                                    columns = 'education', aggfunc = 'count').plot.bar(figsize = (30, 20), ax = ax[1])

plt.rcParams.update({'font.size': 30})
plt.subplots_adjust(hspace = 1)

The patterns of relation frequencies look very similar across genders and education levels. The next figures show the proportions of labelled cue-response pairs that fall under each relation label for each 10-year age range.

# Plotting relations across age: stacked bar plot with age bins
plt.rcParams.update({'font.size': 20})

bins = pd.IntervalIndex.from_tuples([(15, 25), (25, 35), (35, 45), (45, 55), (55, 65), (65, 75), (75, 85), (85, 95), (95, 105)],
                                    closed = 'left')
df_slice["age_int"] = pd.cut(df_slice["age"], bins = bins)

ct = pd.crosstab(df_slice["relations_list"], df_slice["age_int"], normalize = True)
(ct.T).plot.bar(stacked = True, figsize = (5, 5), title = "Relation frequencies given age group", colormap = 'tab20')
plt.legend(bbox_to_anchor=(-0.4, 1.), loc = 'right', fontsize = 10)

<matplotlib.legend.Legend at 0x2082ab11dc0>

To make the bar graph a bit easier to interpret, I pare down the DataFrame so it only contains the most commonly occuring relations.

common = ['lemmas', 'hypernyms', 'derivationally_related_forms', 'hyponyms', 'antonyms', 'similar_tos', 'also_sees',
          'part_holonyms', 'topic_domains', 'verb_groups']
df_common = df_slice[(df_slice["relations_list"].isin(common))]

ct2 = pd.crosstab(df_common["relations_list"], df_common["age_int"], normalize = True)
(ct2.T).plot.bar(stacked = True, figsize = (5, 5), title = "Relation frequencies given age group", colormap = 'tab10')
plt.legend(bbox_to_anchor=(-0.25, 1.), loc = 'right', fontsize = 10)
plt.rc({'font': 15})
ct2

No significant difference in the distribution of relations between age groups is visible. Now let's compare the relations by participants' native speaker status.

# Plotting relations by proficiency
ct3 = pd.crosstab(df_slice["relations_list"], df_slice["native_speaker"], normalize = True)

plt.rc({'font.size': 15})
fig, ax = plt.subplots(1, 2, figsize = (20, 10))
fig.suptitle("Relation frequencies by native speaker status")
ct3[True].plot.bar(ax = ax[0], title = "Native speakers")
ct3[False].plot.bar(ax = ax[1], title = "Non-native speakers")
ct3.T

The differences in the distributions with regards to native speaker status also appear to be negligible. Slight deviations are visible, such as non-native speakers having a higher percentage of relations classified as topic domains.

The stacked bar plots below are also separated by native speaker status, and show the relationship between the most common countries for each group and the corresponding frequencies of lexical relations.

# Plotting relations by country
plt.rcParams.update({'font.size':12})
fig, ax = plt.subplots(1, 2)

fig.suptitle("Relation frequencies by country")
nat_valuecounts = df_common[(df_common["native_speaker"] == True)]["country"].value_counts()[:9]
notnat_valuecounts = df_common[(df_common["native_speaker"] == False)]["country"].value_counts()[:9]

n_speaker = df_common[(df_common["native_speaker"] == True) & (df_common["country"].isin(nat_valuecounts.index))]
notn_speaker = df_common[(df_common["native_speaker"] == False) & (df_common["country"].isin(notnat_valuecounts.index))]

ct4 = pd.crosstab(n_speaker["country"], n_speaker["relations_list"], normalize = True)
(ct4).plot.bar(stacked = True, ax = ax[0], title = "Countries of native speakers", legend = False, fontsize = 10)

ct5 = pd.crosstab(notn_speaker["country"], notn_speaker["relations_list"], normalize = True)
ct5.plot.bar(stacked = True, ax = ax[1], title = "Countries of non-native speakers", legend = False, fontsize = 10)

plt.legend(bbox_to_anchor=(-2.3, 1.), prop = {'size': 6})
plt.subplots_adjust(wspace = 1)

display(ct4)
ct5

So far all divisions of the dataset along the participant demographics resemble the overall distributions of relations across the dataset. The last parameter to look at is non-English native languages.

# Plotting relations by other native languages

df_slice[(df_slice["native_speaker"] == False)]["nativeLanguage"].value_counts()[:16]
langs = ["Other_Foreign", "German", "Spanish", "Dutch_Netherlands", "French", "Dutch_Flanders", "Finnish", "Italian", 
         "Mandarin", "Swedish", "Russian", "Portuguese", "Danish", "Polish", "Norwegian", "Hindi"]
non_native = df_common[(df_common["native_speaker"] == False) & (df_common["nativeLanguage"].isin(langs))]
ct5 = pd.crosstab(non_native["nativeLanguage"], non_native["relations_list"], normalize = True)
ct5.plot.bar(stacked = True, title = "Relation frequencies by native language")
plt.legend(bbox_to_anchor=(-.2, 1.), prop = {'size': 6})

<matplotlib.legend.Legend at 0x20835fc2a30>

Overall, the proportions of relations in each group closely resemble the overall distribution in the dataset.

Conclusions¶

Applying WordNet's relations to the data from Small World of Words makes it possible to make inferences about which relations primarily govern the way lexical items are processed in English. This analysis found that:

The mental lexica of English speakers rely heavily on parallel relations (synonymy) as well as hierarchical ones (hyponymy and hypernymy) and derivational relatedness. In slight contrast with Garvino et al., I found that derivationally related forms are a more common cue-response relation than hyponymy.
The lack of deviation across demographics suggests that the semantic links between words that govern how their meaning is stored in the lexicon of fluent speakers is not subject to the factors of age, gender, level of education, speaker proficiency, and native language. The comparisons indicate that someone who identifies themselves as a fluent but non-native speaker of English is likely to share "primary" lexical relations with a native speaker.

Future efforts in this research topic would benefit from a more precise method of labelling relations, such as asking a participant after the word association task which meaning they intended by their response. This would potentially introduce human error, but might still be more precise than WordNet's classifications, which assigned more than one relation to some pairs in the absence of clarification about which sense the responder had in mind. In addition, most of the cue-response pairs were not labelled according to any lexical semantic relationship. Other kinds of associations should be considered, such as phonological ones like rhyme. Finally, I treated all of an individual's responses to a cue with equal weights. A future analysis may want to separate these by the order in which they come to mind and examine the differences.

Works cited¶

De Deyne, S., Navarro, D. J., Perfors, A., Brysbaert, M., & Storms, G. (2018). The “Small World of Words” English word association norms for over 12,000 cue words. Behavior Research Methods. DOI 10.3758/s13428-018-1115-7.

Gravino, P., Servedio, V., Barrat, A., Loreto, V. (2012). Complex structures and semantics in free word association. Advances in Complex Systems, World Scientific, 15, pp.1250054. <hal-00701709>

Fellbaum, C. (1998, ed.). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.

List of Countries by English-Speaking Population. (2021, November 14). Wikipedia, Wikimedia Foundation, from en.wikipedia.org/wiki/List_of_countries_by_English-speaking_population.

Perinan-Pascual, C., & Arcas-Tunez, F., (2004). Meaning postulates in a lexico-conceptual knowledge base. Proceedings. 15th International Workshop on Database and Expert Systems Applications, pp. 38-42, doi: 10.1109/DEXA.2004.1333446.

Tsunoda, T. (2013). 9. Typology of speakers. Language Endangerment and Language Revitalization: An Introduction, Berlin, Boston: De Gruyter Mouton, pp. 117-133. https://doi.org/10.1515/9783110896589.117

education	None	ES	HS	C/U Bachelor	C/U Master	All
gender
Fe	0.000382	0.002942	0.107702	0.294655	0.259411	0.665091
Ma	0.000463	0.002860	0.070188	0.142584	0.110664	0.326757
X	0.000095	0.000056	0.001912	0.003844	0.002244	0.008152
All	0.000940	0.005857	0.179801	0.441083	0.372319	1.000000

	count	mean	std	min	25%	50%	75%	max
gender
Fe	766264.0	37.466591	15.510292	16.0	25.0	33.0	49.0	100.0
Ma	457412.0	34.223892	15.584948	16.0	22.0	29.0	43.0	100.0
X	4524.0	30.797082	13.888646	16.0	22.0	27.0	35.0	99.0

age_int	[15, 25)	[25, 35)	[35, 45)	[45, 55)	[55, 65)	[65, 75)	[75, 85)	[85, 95)	[95, 105)
relations_list
also_sees	0.005743	0.005889	0.003624	0.003079	0.003075	0.001903	0.000464	0.000071	0.000008
antonyms	0.008179	0.007709	0.004487	0.003499	0.003940	0.002032	0.000560	0.000102	0.000015
derivationally_related_forms	0.043452	0.042139	0.023802	0.018516	0.017917	0.009767	0.002274	0.000431	0.000058
hypernyms	0.052875	0.055669	0.033396	0.027380	0.026394	0.014210	0.003289	0.000570	0.000081
hyponyms	0.031405	0.030684	0.018038	0.015050	0.015795	0.009186	0.002453	0.000501	0.000031
lemmas	0.077131	0.079960	0.049302	0.041363	0.042414	0.023569	0.005665	0.001096	0.000112
part_holonyms	0.006207	0.006026	0.003181	0.002480	0.002394	0.001171	0.000237	0.000048	0.000002
similar_tos	0.019465	0.021006	0.012590	0.010566	0.010145	0.005748	0.001296	0.000231	0.000017
topic_domains	0.004154	0.003882	0.001991	0.001529	0.001331	0.000809	0.000139	0.000050	0.000008
verb_groups	0.003083	0.003029	0.001824	0.001668	0.001785	0.001161	0.000325	0.000062	0.000008

relations_list	also_sees	antonyms	attributes	causes	derivationally_related_forms	entailments	hypernyms	hyponyms	instance_hypernyms	instance_hyponyms	...	part_holonyms	part_meronyms	pertainyms	region_domains	similar_tos	substance_holonyms	substance_meronyms	topic_domains	usage_domains	verb_groups
native_speaker
False	0.002085	0.002519	0.000521	0.000353	0.015776	0.000344	0.019631	0.010915	0.000554	0.00002	...	0.002233	0.000652	0.000957	0.000000	0.007313	0.000238	0.000264	0.001917	0.000007	0.001028
True	0.020811	0.026775	0.004680	0.002818	0.136204	0.004068	0.185621	0.107269	0.002447	0.00025	...	0.018637	0.007666	0.006127	0.000002	0.070488	0.001490	0.001905	0.011416	0.000052	0.011396

relations_list	also_sees	antonyms	derivationally_related_forms	hypernyms	hyponyms	lemmas	part_holonyms	similar_tos	topic_domains	verb_groups
country
Australia	0.001688	0.002112	0.011230	0.014300	0.008189	0.022215	0.001510	0.005654	0.000906	0.000906
Canada	0.002101	0.002582	0.014313	0.019105	0.010796	0.028478	0.001924	0.007153	0.001119	0.001180
France	0.000108	0.000144	0.000762	0.001155	0.000580	0.001645	0.000108	0.000436	0.000094	0.000052
Germany	0.000106	0.000171	0.000753	0.000996	0.000479	0.001575	0.000090	0.000418	0.000074	0.000045
Ireland	0.000151	0.000155	0.000924	0.001283	0.000685	0.002014	0.000153	0.000564	0.000085	0.000058
New Zealand	0.000310	0.000346	0.001748	0.002477	0.001450	0.003780	0.000243	0.000901	0.000135	0.000171
South Africa	0.000124	0.000108	0.000744	0.001083	0.000697	0.001661	0.000092	0.000411	0.000065	0.000108
United Kingdom	0.003380	0.003274	0.021916	0.030573	0.016477	0.045349	0.003259	0.011929	0.001832	0.001479
United States	0.016093	0.022078	0.104033	0.142072	0.084441	0.216420	0.014062	0.053457	0.008713	0.009189

	Unnamed: 0	id	participantID	age	gender	nativeLanguage	country	education	created_at	cue	R1	R2	R3
0	1	29	3	33	Fe	United States	Australia	NaN	2011-08-12 02:19:38	although	nevertheless	yet	but
1	2	30	3	33	Fe	United States	Australia	NaN	2011-08-12 02:19:38	deal	no	cards	shake
2	3	31	3	33	Fe	United States	Australia	NaN	2011-08-12 02:19:38	music	notes	band	rhythm
3	4	32	3	33	Fe	United States	Australia	NaN	2011-08-12 02:19:38	inform	tell	rat on	NaN
4	5	33	3	33	Fe	United States	Australia	NaN	2011-08-12 02:19:38	way	path	via	method

relations_list	also_sees	antonyms	derivationally_related_forms	hypernyms	hyponyms	lemmas	part_holonyms	similar_tos	topic_domains	verb_groups
country
Belgium	0.002134	0.003239	0.018898	0.023775	0.011430	0.028842	0.003124	0.008268	0.002629	0.001219
Canada	0.001676	0.002362	0.013564	0.017450	0.010325	0.022442	0.001524	0.006172	0.001905	0.001067
Finland	0.001334	0.001715	0.012116	0.013983	0.008763	0.020155	0.001600	0.006096	0.001829	0.000610
Germany	0.002248	0.005334	0.020727	0.025375	0.015660	0.035167	0.003162	0.009563	0.002896	0.001372
India	0.002096	0.001410	0.011011	0.014555	0.009487	0.022975	0.001143	0.004953	0.000686	0.001143
Netherlands	0.002553	0.002743	0.017603	0.023318	0.012726	0.032957	0.003010	0.009487	0.002400	0.001067
Sweden	0.001295	0.001257	0.009906	0.013983	0.008154	0.018517	0.001753	0.005906	0.001143	0.000533
United Kingdom	0.002896	0.003467	0.020308	0.026976	0.014021	0.037491	0.003086	0.010211	0.002629	0.001181
United States	0.006706	0.008687	0.044007	0.057685	0.033415	0.081574	0.005791	0.021260	0.003963	0.003124