The lexicon of a natural language can be represented by networks of semantic units, associated according to their lexical relations to other entities. Free word association tasks shed light on how the meanings of these entities are stored and processed in the brain. For example, on reading or hearing a noun, one might immediately think of another noun having a relation like synonymy, antonymy, or hyponymy, an adjective that derives from it of from a common root, or a verb for which the noun is a canonical agent. The goal of this project is to explore the relations that hold between words in the lexica of English speakers, as evidenced by a free word association task. Specifically, I would like to know:
My analysis of the data with respect to these questions will help me draw conclusions about how lexical items are stored by English speakers. The final result will be viewable here.
My data on participants and their responses comes from Small World of Words, a project dedicated to building models of lexica of several of the world's languages. A participant is shown a series of lexical items(single words or multi-word units such as traffic light), and for each they must enter up to three words that come to their mind right away. These associations are used to create a network that represents how those words are stored in native and/or fluent speakers' brains. Networks like the ones generated by Small World of Words help demonstrate how we intuitively understand entities in terms of other entities. Looking over the explore page, it is clear that the strongest associations with a word might not necessarily be relations like synonymy (chair - seat), but evoke descriptions or images (yellow - dandelion) or events (chocolate - melt). Because Small World of Words records multiple responses to cues along with participant demographics, it is a fitting data source for the questions I aim to address.
On their research page, Small World of Words has several datasets compiled from their results for English and Dutch. The dataset I have loaded here contains English-speaking participant data collected between 2011 and 2018, including the date they participated, their demographics, and their responses to the cue words. Each row/obervation in the DataFrame corresponds to a person's response(s) to a specific cue, and 100 sets of responses are recorded for each cue word. Education level is coded by highest level of education as: 1 = None, 2 = Elementary school, 3 = High School, 4 = College or University Bachelor, 5 = College or University Master.
I use WordNet to identify parts of speech and the lexical relations between the cues and responses. WordNet is a database containing a network of English nouns, verbs, adjectives, and adverbs and their lexical relations.
A similar application of WordNet to the results of free word association tasks can be seen in Gravino et al. (2012). Among their findings is that the most common relations are synonymy, hypernymy, and hyponymy. The analysis here will determine whether those findings are reflected in the Small World of Words data and whether there are differences along the various demographic parameters.
The data will be analyzed and displayed using the Pandas and Seaborn libraries.
Below I import the needed packages and read the file containing the participant data (demographics and responses to cues), into a DataFrame. The participant data is available for download on the research page linked above. I display the top of the original DataFrame.
# import Pandas
import pandas as pd
# import WordNet
!pip install nltk
from nltk.corpus import wordnet as wn
# import Regex
import re
# import NumPy
import numpy as np
# import seaborn
!pip install seaborn
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
# import itertools
import itertools
# Small world of words- participant data
part_df = pd.read_csv("English.csv")
part_df.head()
# an observation corresponds to one person's responses to a specific cue
Now I will modify the DataFrame, first dropping the columns I won't need for my analysis. Because I am interested in comparing native speakers' to non-native speakers' responses, I have to change the way participants' native languages are encoded. Participants in Small World of Words indicate whether they are native speakers or not. If not, they are prompted to select their native language from a list of common languages or "Other". Native English speakers are prompted to select their "native language" from a list of countries where English is spoken (not to be confused with the "country" column, which is where they were when they participated.) Because of this inconsistency I added a column that indicates whether they are a native speaker.
I also update the education column so the levels are represented as abbreviations rather than numbers.
Both native and non-native speakers had the "Other" option for native language. These are distinguished by the values "Other_English" and "Other_Foreign".
# Drop unneeded columns
part_df.drop(columns = ["Unnamed: 0", "id", "created_at"], inplace = True)
# SWOW's list of nations/regions of native English speakers
eng = ["Canada", "Puerto Rico", "United States", "Australia", "United Kingdom", "Ireland", "New Zealand",
"Papua New Guinea", "Jamaica", "Trinidad and Tobago", "Hong Kong", "India", "Pakistan", "Singapore",
"Philippines", "Cameroon", "Ghana", "Kenya", "Malawi", "Mauritius", "Nigeria", "Rwanda", "South Africa",
"Sudan", "Uganda", "Tanzania", "Zimbabwe", "Other_English"]
part_df["native_speaker"] = False
part_df.loc[part_df["nativeLanguage"].isin(eng), "native_speaker"] = True
# education levels as abbreviations
part_df["education"] = part_df["education"].map({
5.: "C/U Master",
4.: "C/U Bachelor",
3.: "HS",
2.: "ES",
1.: "None"
})
# in order
part_df["education"] = pd.Categorical(part_df["education"] ,["None", "ES", "HS", "C/U Bachelor", "C/U Master"], ordered = True)
# missing cues
part_df["cue"].replace(np.nan, '', inplace = True)
part_df.head()
Before exploring lexical relations, I want to know about the demographics represented in my data. To get an idea of the distribution of native vs. non-native speakers in the dataset, I plot a bar graph below and display what percentage are native speakers- about 86%.
The other bar graphs show the home countries/regions of the native English speakers, and the native languages and countries of the non-native speakers. It should be noted that the y-axis on these graphs corresponds to the number of responses to cues and not to a count of individuals. The size of the sample of speakers in the DataFrame is 83,864.
# Language stats
fig, ax = plt.subplots(4, 1, figsize = (7, 15))
plt.rcParams.update(plt.rcParamsDefault)
fig.suptitle("Summarizing Small World of Words native language data")
# Percent native speakers
part_df["native_speaker"].value_counts().plot.bar(title = "Distribution of native English speaker status", ax = ax[0])
# Native speakers' countries
ns_countries = ["United States", "United Kingdom", "Canada", "Australia", "New Zealand", "Ireland", "South Africa", "Singapore"]
part_df[(part_df["native_speaker"] == True) &
(part_df["country"].isin(ns_countries))]["country"].value_counts().plot.bar(
ax = ax[1], title ="Most common regions of native speakers")
# Non-native speakers' native languages
part_df[(part_df["native_speaker"] == False)]["nativeLanguage"].value_counts().plot.bar(ax = ax[2],
title = "Other native languages")
# and countries
# 20 most commonly occuring countries of non-native speakers
nns_countries = ["United States", "Germany", "United Kingdom", "Belgium", "Netherlands", "Finland", "Canada", "Spain", "Sweden",
"India", "Denmark", "France", "Italy", "Norway", "Australia", "Poland", "Romania", "Hungary", "Brazil",
"Switzerland"]
part_df[(part_df["native_speaker"] == False) & (part_df["country"].isin(ns_countries))]["country"].value_counts().plot.bar(ax = ax[3],
title = "Most common regions of non-native speakers")
plt.tight_layout()
Since most of participants are native speakers, my analysis will be more generalizable to native speakers' lexical relations. Most native speakers in the sample are American, followed by speakers from the UK, Canada, and Australia.
A notable majority of non-native speakers indicated that their native language was one other than those listed as options. Most other responses in this group are from native speakers of Indo-European languages. In both groups, large portions of the English-speaking world turn out to be underrepresented; India, Pakistan, and Nigeria each have more English speakers than the UK and Canada combined.
I would also like to get a sense of what ages, genders, and levels of education are represented. First I display a cross-tabulation of gender and education level, showing the proportions of participants that fall into each intersection. Again, education is encoded as follows: 1 = None, 2 = Elementary school, 3 = High School, 4 = College or University Bachelor, 5 = College or University Master. The possible values for gender are exactly the options given by Small World of Words at the beginning of the free word association task, with 'X' corresponding either to non-binary gender identity or non-disclosure of gender. Following the cross tabluation are a heat map showing the joint distribution of gender and education level and a pie chart of gender distribution.
The third plot shows the distribution of participants' ages for each gender. For this plot I drop the small slice of participants labelled as 'X' to make the graph more readable.
# Demographics stats
fig, ax = plt.subplots(3, 1, figsize = (10, 10))
fig.suptitle("Summarizing Small World of Words participant data")
# Education and gender - cross-tabulation
joint_dist = pd.crosstab(part_df.gender, part_df.education, normalize = True, margins = True)
sns.heatmap(joint_dist.T, ax = ax[0]).set_title("Joint distribution of gender and education level")
# Gender - pie chart
part_df.groupby("gender").size().plot.pie(title = "Gender distibution", autopct='%.2f%%', ax = ax[1],
ylabel = '', figsize = (10, 15))
# Age - histogram
# drop non-binary
part_df["bin_gen"] = part_df["gender"].replace("X", None)
part_df.groupby("bin_gen", dropna = True)["age"].plot.hist(bins = 50, alpha = .5, ax = ax[2], density = True,
legend = True, title = "Age distribution by gender",)
joint_dist
The visualizations show that a majority of the sample is female and a majority is college educated. The ages are positively (right) skewed, with the highest concentration falling between 15 and 30 for both genders. A more detailed summary of the ages is displayed below. The table contains summary statistics for each gender value, and the list describes the ages overall.
These variables will be the parameters I use to investigate patterns in primary lexical relations for people of various backgrounds.
display(part_df.groupby('gender')["age"].describe())
part_df["age"].describe()
Here is an example to demonstrate how WordNet can identify the lexical relation between two words. Word senses are stored in objects called synsets, with each synset consisting of the word, a letter corresponding to its part of speech, and a number indicating which sense of the word is being referenced. For example, here is a list of synsets for 'tree', followed by the definition of the first synset.
tree_synsets = wn.synsets('tree')
print(tree_synsets)
tree_synsets[0].definition()
If I want to determine whether a given word, say, 'oak', is a hyponym (more specific instance) of 'tree', I can ask WordNet:
tree_hypo = [synset.name().split('.')[0] for synset in wn.synsets('tree')[0].hyponyms()] # kinds of tree- just the word
'oak' in tree_hypo
Using this functionality of WordNet I will classify the responses in my dataset in terms of their relation to the cue word and observe overall patterns in the most common relations for each part of speech, as well as correlations with participants' demographics.
The function below takes two strings as parameters, and determines whether the second string occurs in any of WordNet's lists of words related to the first string. If the word is found, it updates a dictionary of frequencies of each of WordNet's relations, and returns a list containing the name of each relation that was found.
# dictionary, relation--> frequency
relations_freq = {}
relations = ['hypernyms', 'instance_hypernyms', 'hyponyms', 'instance_hyponyms', 'member_holonyms',
'substance_holonyms', 'part_holonyms', 'member_meronyms', 'substance_meronyms', 'part_meronyms',
'topic_domains', 'region_domains', 'usage_domains', 'attributes', 'entailments', 'causes',
'also_sees', 'verb_groups', 'similar_tos', 'lemmas']
# these relations only defined for lemmas
lemma_relations = ['antonyms', 'pertainyms', 'derivationally_related_forms']
for r in relations:
relations_freq[r] = 0
for r in lemma_relations:
relations_freq[r] = 0
def count_relation(cue, response):
rel_list = []
relation_found = False
# for each sense of cue
for synset in wn.synsets(cue):
lemmas = synset.lemmas()
# check if response is related
for r in relations:
related = getattr(synset, r) # list of wordnet's related senses
words = [synset.name().split('.')[0] for synset in related()] # just the word
for i in range(len(words)):
words[i] = re.sub('_', ' ', words[i])
#print(r_words)
if response in words:
rel_list.append(r)
relation_found = True
#print('Relation found: ' + str(r) + '\n')
relations_freq[r] += 1
# same but for relations defined for lemmas
for lemma in lemmas:
for r in lemma_relations:
l_related = getattr(lemma, r)
l_words = [lem.name() for lem in l_related()]
for i in range(len(l_words)):
l_words[i] = re.sub('_', ' ', l_words[i])
#print(l_words)
if response in l_words:
rel_list.append(r)
relation_found = True
#print('Relation found: ' + str(r) + '\n')
relations_freq[r] += 1
return rel_list
# if none found, returns empty list
The commands below update my DataFrame to include the lists of relations produced by the count_relation function for each cue and its responses. For most pairs for which a relation is found, the list is just one relation long. If the response appeared in more than one list of words related to the cue, the list length is greater than 1 and the counts dictionary will have been updated for each occurence; I count each relation in the overall tally.
# adding relations to the dataframe
part_df["relation1"] = part_df[["cue", "R1"]].apply(lambda x: count_relation(*x), axis = 1)
part_df["relation2"] = part_df[["cue", "R2"]].apply(lambda x: count_relation(*x), axis = 1)
part_df["relation3"] = part_df[["cue", "R3"]].apply(lambda x: count_relation(*x), axis = 1)
def join_lists(r1, r2, r3):
return list(set(r1)) + list(set(r2)) + list(set(r3))
part_df["relations_list"] = part_df.apply(lambda x: join_lists(x.relation1, x.relation2, x.relation3), axis = 1)
# create a dataframe of counts
counts = pd.DataFrame.from_dict(relations_freq, orient='index', columns=['count'])
# write updated df and counts df to csv for easy getting
part_df.to_csv('rel_df', index = False)
counts.to_csv('counts')
part_df = pd.read_csv('rel_df')
part_df["relations_list"] = part_df["relations_list"].apply(lambda x: re.findall(r'\w+', x))
Now I can observe patterns in relation frequencies. First, I plot the distribution of the most common overall relations that appear in the dataset. The top 6 are lemmas (synonyms), hypernyms, derivationally related forms (semantically related cognates with a different part of speech), hyponyms, similar-to's (a WordNet relation for adjectives that targets semantically related synsets with a wider scope than lemmas), and antonyms. See documentation of WordNet or NLTK's WordNet interface for more information on relation definitions. Other relations account for less than 1% of pairs for which a relation was found.
counts = pd.read_csv('counts')
counts.rename(columns = {"Unnamed: 0": "relation"}, inplace = True)
counts.set_index('relation')
counts = counts.sort_values('count', ascending = False)
# group least frequent together
threshold = 26825
new_dic={}
for key, group in itertools.groupby(relations_freq, lambda k: 'other' if (relations_freq[k]<threshold) else k):
new_dic[key] = sum([relations_freq[k] for k in list(group)])
grouped_counts = pd.DataFrame.from_dict(new_dic, orient = 'index', columns = ['count'])
grouped_counts.sort_values('count', ascending = False).plot.pie(y = 'count', figsize = (5, 5),
autopct = '%1.0f%%', title = "Overall counts of relations")
plt.legend(bbox_to_anchor=(-0.15, 1.), loc = 'center')
Now let's break down the distribution of relations by demographics. I isolate the observations for which at least one relation was found, and explode the dataframe so each relation gets its own row. The bar graphs below represent counts of each relation separated by gender and by education respectively.
df_slice = part_df[(part_df["relations_list"].map(len) > 0)]
df_slice = df_slice.explode('relations_list', ignore_index = True)
fig, ax = plt.subplots(2, 1)
# Plotting relation count by gender, education
df_slice.reset_index().pivot_table(values = 'index', index = 'relations_list', columns = 'gender',
aggfunc = 'count').plot.bar(figsize = (30, 20), width = 1, ax = ax[0],
title = "Relations by gender and education level")
df_slice.reset_index().pivot_table(values = 'index', index = 'relations_list',
columns = 'education', aggfunc = 'count').plot.bar(figsize = (30, 20), ax = ax[1])
plt.rcParams.update({'font.size': 30})
plt.subplots_adjust(hspace = 1)
The patterns of relation frequencies look very similar across genders and education levels. The next figures show the proportions of labelled cue-response pairs that fall under each relation label for each 10-year age range.
# Plotting relations across age: stacked bar plot with age bins
plt.rcParams.update({'font.size': 20})
bins = pd.IntervalIndex.from_tuples([(15, 25), (25, 35), (35, 45), (45, 55), (55, 65), (65, 75), (75, 85), (85, 95), (95, 105)],
closed = 'left')
df_slice["age_int"] = pd.cut(df_slice["age"], bins = bins)
ct = pd.crosstab(df_slice["relations_list"], df_slice["age_int"], normalize = True)
(ct.T).plot.bar(stacked = True, figsize = (5, 5), title = "Relation frequencies given age group", colormap = 'tab20')
plt.legend(bbox_to_anchor=(-0.4, 1.), loc = 'right', fontsize = 10)
To make the bar graph a bit easier to interpret, I pare down the DataFrame so it only contains the most commonly occuring relations.
common = ['lemmas', 'hypernyms', 'derivationally_related_forms', 'hyponyms', 'antonyms', 'similar_tos', 'also_sees',
'part_holonyms', 'topic_domains', 'verb_groups']
df_common = df_slice[(df_slice["relations_list"].isin(common))]
ct2 = pd.crosstab(df_common["relations_list"], df_common["age_int"], normalize = True)
(ct2.T).plot.bar(stacked = True, figsize = (5, 5), title = "Relation frequencies given age group", colormap = 'tab10')
plt.legend(bbox_to_anchor=(-0.25, 1.), loc = 'right', fontsize = 10)
plt.rc({'font': 15})
ct2
No significant difference in the distribution of relations between age groups is visible. Now let's compare the relations by participants' native speaker status.
# Plotting relations by proficiency
ct3 = pd.crosstab(df_slice["relations_list"], df_slice["native_speaker"], normalize = True)
plt.rc({'font.size': 15})
fig, ax = plt.subplots(1, 2, figsize = (20, 10))
fig.suptitle("Relation frequencies by native speaker status")
ct3[True].plot.bar(ax = ax[0], title = "Native speakers")
ct3[False].plot.bar(ax = ax[1], title = "Non-native speakers")
ct3.T
The differences in the distributions with regards to native speaker status also appear to be negligible. Slight deviations are visible, such as non-native speakers having a higher percentage of relations classified as topic domains.
The stacked bar plots below are also separated by native speaker status, and show the relationship between the most common countries for each group and the corresponding frequencies of lexical relations.
# Plotting relations by country
plt.rcParams.update({'font.size':12})
fig, ax = plt.subplots(1, 2)
fig.suptitle("Relation frequencies by country")
nat_valuecounts = df_common[(df_common["native_speaker"] == True)]["country"].value_counts()[:9]
notnat_valuecounts = df_common[(df_common["native_speaker"] == False)]["country"].value_counts()[:9]
n_speaker = df_common[(df_common["native_speaker"] == True) & (df_common["country"].isin(nat_valuecounts.index))]
notn_speaker = df_common[(df_common["native_speaker"] == False) & (df_common["country"].isin(notnat_valuecounts.index))]
ct4 = pd.crosstab(n_speaker["country"], n_speaker["relations_list"], normalize = True)
(ct4).plot.bar(stacked = True, ax = ax[0], title = "Countries of native speakers", legend = False, fontsize = 10)
ct5 = pd.crosstab(notn_speaker["country"], notn_speaker["relations_list"], normalize = True)
ct5.plot.bar(stacked = True, ax = ax[1], title = "Countries of non-native speakers", legend = False, fontsize = 10)
plt.legend(bbox_to_anchor=(-2.3, 1.), prop = {'size': 6})
plt.subplots_adjust(wspace = 1)
display(ct4)
ct5
So far all divisions of the dataset along the participant demographics resemble the overall distributions of relations across the dataset. The last parameter to look at is non-English native languages.
# Plotting relations by other native languages
df_slice[(df_slice["native_speaker"] == False)]["nativeLanguage"].value_counts()[:16]
langs = ["Other_Foreign", "German", "Spanish", "Dutch_Netherlands", "French", "Dutch_Flanders", "Finnish", "Italian",
"Mandarin", "Swedish", "Russian", "Portuguese", "Danish", "Polish", "Norwegian", "Hindi"]
non_native = df_common[(df_common["native_speaker"] == False) & (df_common["nativeLanguage"].isin(langs))]
ct5 = pd.crosstab(non_native["nativeLanguage"], non_native["relations_list"], normalize = True)
ct5.plot.bar(stacked = True, title = "Relation frequencies by native language")
plt.legend(bbox_to_anchor=(-.2, 1.), prop = {'size': 6})
Overall, the proportions of relations in each group closely resemble the overall distribution in the dataset.
Applying WordNet's relations to the data from Small World of Words makes it possible to make inferences about which relations primarily govern the way lexical items are processed in English. This analysis found that:
Future efforts in this research topic would benefit from a more precise method of labelling relations, such as asking a participant after the word association task which meaning they intended by their response. This would potentially introduce human error, but might still be more precise than WordNet's classifications, which assigned more than one relation to some pairs in the absence of clarification about which sense the responder had in mind. In addition, most of the cue-response pairs were not labelled according to any lexical semantic relationship. Other kinds of associations should be considered, such as phonological ones like rhyme. Finally, I treated all of an individual's responses to a cue with equal weights. A future analysis may want to separate these by the order in which they come to mind and examine the differences.
De Deyne, S., Navarro, D. J., Perfors, A., Brysbaert, M., & Storms, G. (2018). The “Small World of Words” English word association norms for over 12,000 cue words. Behavior Research Methods. DOI 10.3758/s13428-018-1115-7.
Gravino, P., Servedio, V., Barrat, A., Loreto, V. (2012). Complex structures and semantics in free word association. Advances in Complex Systems, World Scientific, 15, pp.1250054. <hal-00701709>
Fellbaum, C. (1998, ed.). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
List of Countries by English-Speaking Population. (2021, November 14). Wikipedia, Wikimedia Foundation, from en.wikipedia.org/wiki/List_of_countries_by_English-speaking_population.
Perinan-Pascual, C., & Arcas-Tunez, F., (2004). Meaning postulates in a lexico-conceptual knowledge base. Proceedings. 15th International Workshop on Database and Expert Systems Applications, pp. 38-42, doi: 10.1109/DEXA.2004.1333446.
Tsunoda, T. (2013). 9. Typology of speakers. Language Endangerment and Language Revitalization: An Introduction, Berlin, Boston: De Gruyter Mouton, pp. 117-133. https://doi.org/10.1515/9783110896589.117