Natural Language Processing With Neo4j - Mining Paradigmatic Word Associations

Mining word associations from a body of text is often one of the first Natural Language Processing techniques used when mining text data. Word associations are useful for performing NLP tasks such as part of speech tagging, parsing, entity extraction, etc. We will take a brief look at one type of word association called paradigmatic association and show how we can use the Neo4j graph database to help model our text corpus as a graph and implement a simple paradigmatic relation mining algorithm.

Mining word associations

There are two common types of word associations in natural language processing, paradigmatic and systagmatic:

• Paradigmatic: words A and B are paradigmatically related if they can be substituted for each other. This indicates they belong in the same class, such as “Monday” and “Thursday” or “Cat” and “Dog.
• Syntagmatic: words that can be combined with each other, such as “cold” and “weather”.

For this post we will only concern ourselves with paradigmatic associations.

Computing paradigmatic similarity

The basic steps to compute paradigmatic relations:

1. Represent each word by its context
2. Compute context similarity
3. Words with high context similarity likely have paradigmatic relation

1) Represent each word by its context

Consider a simple text document that consists of sentences like this:

My cat eats fish on Saturday.
His dog eats turkey on Tuesday.
...


What words occur within similar context in each sentence? How could we answer this question? We need some way to represent the context for a given word. We do this by defining functions to compute pieces of the context of a given word:

Right1("cat") = {"eats", "ate", "is", "has", ...}
Left1("cat") = {"my", "his", "big", "a", "the", ...}


This allows us to represent word context as a series of sets of words.

2) Compute context similarity

To compute a measure of paradigmatic relation we take an aggregation of the similarity of relative contexts. For example:

Sim("Cat", "Dog") =
Sim(Left1("Cat"), Left1("Dog")) +
Sim(Right1("Cat"), Right1("Dog"))


We can use the Jaccard index as a measure of similarity.

3) Words with high context similarity likely have paradigmatic relation

Once we have computed this measure of similarity we can simply look for word pairs that have a high measure of similarity.

NOTE: this is a relatively simplistic approach. This method could be imporoved by expanding the size of the context window (for example, including Left2, Window2 word context sets), handling stop words, and term frequency based transformations to adjust similarity scores.

Modeling text data as an adjacency graph

We can model text data as an adjacency graph where each word is a node and an edge between two nodes indicates that the words appear next to each other in the text corpus. The data model we will use looks like this:

As an example, consider the following sentence:

My cat eats fish on Saturday.


Using our data model this would appear as follows in our adjacency graph:

If we add another sentence:

His dog eats turkey on Tuesday.


Our graph is updated like this:

We will use the Neo4j graph database to implement this data model. The text corpus used here is the CEEAUS corpus which is distributed with the MeTA NLP library.

We will first see how we can load this text corpus into Neo4j using a Python script. Then we will explore how we can query the data to mine paradigmatic relations using Cypher, the Neo4j query language.

Loading the data

The data is stored in a single file, with each sentence on a single line. We define a Python function to load this file, normalize the text a bit and execute a Cypher query for each line of text to insert into the graph as per the data model we have defined above. Here’s the Python code for data insertion:

Mining word relations

Now that we’ve inserted our data we need some way to query it to discover word relations.

A Cypher / Python Approach

We can write a simple Cypher query to find the Right1 and Left1 sets quite easily. We can also define some Python methods to handle these Cypher queries and perform some simple set operations to compute Jaccard similarity for a word pair:

This allows us to compute a measure of Paradigmatic similarity for arbitrary word pairs:

In [174]: paradigSimilarity("school", "university")
Out[174]: 0.2153846153846154


A Pure Cypher Approach

The Cypher / Python approach above doesn’t really allow us to identify the most paradigmatically similar words. For example, we really only can ask “what is the paradigmatic similarity between words x,y”. What we really want to ask is “what are the word pairs that are most paradigmatically similar?” To do this we will use a pure Cypher approach to compare word pairs and update the graph with a :RELATED_TO relationship. This relationship will store the similarity score. Once this query runs we can query for word relations using this new :RELATED_TO edge.

Here’s the query:

We iterate over word pairs, fetch the right1 and left1 word sets, then perform the equivilent of union and intersection operations that allow us to compute the Jaccard coefficient for each word pair. Finally, we update the graph with this coefficent in a new :RELATED_TO relationship.

Querying for word relations

Now that we’ve updated the graph with these word association scores we can write simple Cypher queries to discover word associations.

Paradigmatic relations for “school”

What are the words most strongly paradigmatically related to the word “school”?

neo4j-sh (?)$MATCH (s:Word {word: 'school'} )-[r:RELATED_TO]->(o) RETURN s.word,o.word,r.paradig as sim ORDER BY sim DESC LIMIT 25; +-----------------------------------------------+ | s.word | o.word | sim | +-----------------------------------------------+ | "school" | "university" | 0.35416666666666663 | | "school" | "studies" | 0.35279106858054227 | | "school" | "working" | 0.3497129735935706 | | "school" | "parents" | 0.33613445378151263 | | "school" | "college" | 0.33519813519813524 | | "school" | "society" | 0.3310029130253849 | | "school" | "parttime" | 0.3220510229856024 | | "school" | "study" | 0.3186426358772177 | | "school" | "money" | 0.31618037135278515 | | "school" | "eating" | 0.31335453100158983 | | "school" | "place" | 0.3115313895455739 | | "school" | "experience" | 0.30545975141537063 | | "school" | "work" | 0.3031372549019608 | | "school" | "life" | 0.30304354157565166 | | "school" | "smoker" | 0.29819694868238555 | | "school" | "jobs" | 0.2965686274509804 | | "school" | "living" | 0.29284676833696444 | | "school" | "future" | 0.2874499332443258 | | "school" | "hard" | 0.28233938346297893 | | "school" | "children" | 0.2812025508654722 | | "school" | "studying" | 0.27817343946376205 | | "school" | "food" | 0.27809596538118875 | | "school" | "restaurant" | 0.2743995912110373 | | "school" | "meal" | 0.2705982542048116 | | "school" | "friends" | 0.2692307692307693 | +-----------------------------------------------+ 25 rows 45 ms  Strong paradigmatic relation What are the strongest paradigmatically related words in the text corpus? neo4j-sh (?)$ MATCH (s)-[r:RELATED_TO]->(o) RETURN s.word,o.word,r.paradig AS sim ORDER BY sim DESC LIMIT 100;
+---------------------------------------------------+
| s.word      | o.word        | sim                 |
+---------------------------------------------------+
| "opinion"   | "idea"        | 0.39954526663910706 |
| "a"         | "the"         | 0.39918436823602943 |
| "students"  | "college"     | 0.393037294302489   |
| "students"  | "people"      | 0.3798992202990593  |
| "students"  | "smokers"     | 0.37755673858223593 |
| "in"        | "for"         | 0.37120642515379354 |
| "for"       | "in"          | 0.37120642515379354 |
| "important" | "hard"        | 0.37101588316810574 |
| "opinion"   | "area"        | 0.36708144796380093 |
| "opinion"   | "situation"   | 0.3667582417582418  |
| "important" | "necessary"   | 0.3656898656898657  |
| "important" | "bad"         | 0.35802139037433156 |
| "school"    | "university"  | 0.35416666666666663 |
| "students"  | "you"         | 0.35373408158860353 |
| "important" | "harmful"     | 0.3533992213237496  |
| "school"    | "studies"     | 0.35279106858054227 |
| "for"       | "of"          | 0.35274323661808615 |
| "to"        | "and"         | 0.35257738943248285 |
| "school"    | "working"     | 0.3497129735935706  |
| "students"  | "restaurants" | 0.3489169719469919  |
| "important" | "difficult"   | 0.3485156912637829  |
| "students"  | "student"     | 0.34606662818504164 |
| "for"       | "that"        | 0.34368031605589777 |
| "students"  | "parents"     | 0.33853521126760566 |
| "opinion"   | "place"       | 0.3381635025184037  |
| "for"       | "to"          | 0.33750185739002114 |
| "to"        | "for"         | 0.33750185739002114 |
| "students"  | "nonsmokers"  | 0.33671364061040177 |
| "school"    | "parents"     | 0.33613445378151263 |
| "school"    | "college"     | 0.33519813519813524 |
| "opinion"   | "fact"        | 0.3350840336134454  |
| "opinion"   | "reasons"     | 0.3310381355932204  |
| "school"    | "society"     | 0.3310029130253849  |
| "students"  | "money"       | 0.3286558659692988  |
| "have"      | "do"          | 0.3284722920647692  |
| "for"       | "but"         | 0.32701345186452413 |
| "important" | "reason"      | 0.3252759802055577  |
| "for"       | "by"          | 0.3252038356724162  |
| "my"        | "our"         | 0.3236480567977502  |
| "important" | "good"        | 0.3223516563523844  |
| "school"    | "parttime"    | 0.3220510229856024  |
| "for"       | "if"          | 0.32164442810636096 |
| "my"        | "their"       | 0.32032052127726696 |
| "opinion"   | "statement"   | 0.3194444444444444  |
| "opinion"   | "problem"     | 0.31904269972451793 |
| "opinion"   | "point"       | 0.31894736842105265 |
| "school"    | "study"       | 0.3186426358772177  |
| "students"  | "restaurant"  | 0.31855802425422675 |
| "students"  | "work"        | 0.3179591836734694  |
| "school"    | "money"       | 0.31618037135278515 |
| "opinion"   | "lives"       | 0.3159090909090909  |
| "opinion"   | "day"         | 0.3156288156288156  |
| "in"        | "and"         | 0.3154907057525759  |
| "in"        | "of"          | 0.3153160870310091  |
| "in"        | "at"          | 0.31521761006289306 |
| "for"       | "and"         | 0.3140333542277307  |
| "opinion"   | "case"        | 0.3134920634920635  |
| "school"    | "eating"      | 0.31335453100158983 |
| "have"      | "get"         | 0.312328440147989   |
| "school"    | "place"       | 0.3115313895455739  |
| "opinion"   | "country"     | 0.3101851851851852  |
| "to"        | "that"        | 0.30992732204777806 |
| "important" | "studying"    | 0.30947775628626695 |
| "students"  | "them"        | 0.3093731519810763  |
| "for"       | "so"          | 0.3077891508736062  |
| "students"  | "smoker"      | 0.30633078730904817 |
| "students"  | "us"          | 0.30616821341202966 |
| "important" | "well"        | 0.3059533175976921  |
| "school"    | "experience"  | 0.30545975141537063 |
| "opinion"   | "workplace"   | 0.3048855905998763  |
| "school"    | "work"        | 0.3031372549019608  |
| "school"    | "life"        | 0.30304354157565166 |
| "opinion"   | "reason"      | 0.30277349768875195 |
| "for"       | "from"        | 0.3025785639266867  |
| "have"      | "make"        | 0.3020345644760418  |
| "students"  | "working"     | 0.3019127064374789  |
| "students"  | "smoking"     | 0.3011066418324483  |
| "for"       | "on"          | 0.3011051106321491  |
| "for"       | "as"          | 0.30039366281771396 |
| "in"        | "to"          | 0.300318818247031   |
| "to"        | "in"          | 0.300318818247031   |
| "opinion"   | "space"       | 0.3003003003003003  |
| "students"  | "society"     | 0.300220818254309   |
| "opinion"   | "condition"   | 0.3001605136436597  |
| "for"       | "with"        | 0.29984269706562056 |
| "in"        | "but"         | 0.2997641267562301  |
| "school"    | "smoker"      | 0.29819694868238555 |
| "students"  | "time"        | 0.29796635538346405 |
| "opinion"   | "rule"        | 0.2975825672454886  |
| "opinion"   | "matter"      | 0.2975543478260869  |
| "important" | "precious"    | 0.2971576227390181  |
| "in"        | "that"        | 0.2970460614152203  |
| "important" | "place"       | 0.2969177779740351  |
| "important" | "dangerous"   | 0.29680644832159986 |
| "important" | "much"        | 0.2967315570706618  |
| "school"    | "jobs"        | 0.2965686274509804  |
| "have"      | "be"          | 0.296095994591854   |
| "opinion"   | "meal"        | 0.29527272727272724 |
| "opinion"   | "way"         | 0.2952157388777107  |
| "to"        | "of"          | 0.29506851814362445 |
+---------------------------------------------------+
100 rows
425 ms


All Python code is available here.

Final note: Many of the NLP ideas and examples here were borrowed from the Coursera course Text Mining and Analytics by Dr. ChengXiang Zhai.