How to tackle a real-world problem with GuidedLDA

Snapshot of interactive visualization of the topics identified by Guided LDA and the keywords in each topic (pyLDAvis)

Originally posted on Analytics Vidhya.


The prevalent use of online platforms for interaction and the large size of the text data from users’ input makes digesting data increasingly more time-consuming. Sown to Grow is an online educational company with the aim to empower students by providing a platform for setting goals, reflecting on strategies, and interacting with their teachers. For this company to scale up across the US, automated parsing of reflections is necessary. It helps teachers customize the feedback and channel limited resources to vulnerable kids.

Data

The company shared 180k students’ reflections that, based on the company’s rubric system, were considered high quality (having a strategy/strategies). The actual data cannot be shown due to privacy reasons, but my dataframe looked like this:

After cleaning the data, which included removing duplicates, non-related content, and non-English content, I ended up with 104k reflections that I used to identify the strategies. Below is the function I used to correct misspelled words:

To remove the non-English content, I used langdetect to tag the language of the text and remove those that were non-English. Langdetect is pretty accurate when the input is a sentence, but less so when entering just a word.

Initial strategies to solve the problem

Regular LDA

I began to model topics in the reflections using the Gensim Topic modeling package through Latent Dirichlet Allocation (LDA). To prepare the data for topic modeling, I tokenized (split the document into sentences and sentences into words), removed punctuation and made them lower-case. Words smaller than three characters were also removed. These were all done using the Gensim simple preprocessing module. I then defined the function to change words from third-person to first-person, and verbs from past and future tenses to present tense. Following that, words were reduced to their root form (stem and lemmatizing).

After importing the necessary packages and modules, it was time for some preprocessing as explained before:

The example below shows the result of preprocessing (I have used a hypothetical example):

Example:

To create a bag of words on the dataset, Gensim dictionary can be used. Bag of words is simply a dictionary from ‘processed_docs’ containing the number of times a word appears (word count) in the whole document (corpora).

Remove the tokens that appear in less than 15 documents and above the 0.5 document (fraction of the total document, not absolute value). After that , keep the 100000 most frequent tokens.

I created a dictionary that shows the words, and the number of times those words appear in each document, and saved them as bow_corpus:

Now, the data is ready to run LDA topic model. I used Gensim LDA with the capability of running on multiple cores.

Check the words for each topic and its relative weight:

As you can see from the words in each topic, some are shared between topics and there is not a distinct topic that can be tagged for each group of words.

Part of Speech (POS) Tagging

After LDA, I decided to tag the part of speech (POS) for each reflection and extract the verbs. I assumed students are reflecting on what they did, so reflections that have verbs in past tense could give me clue of the topics for learning strategies (e.g. I studied my notes and practiced the past exams). I parsed the reflection and extracted all of the verbs used in the reflections via POS tagging. Then, I looked for the verb tense to identify the relation between reflections having a learning strategy and the tense of the verbs used in it. I noticed there are reflections that clearly have learning strategies and are not necessarily in past tense.

Pipeline used for solving this problem

Guided LDA

This did not help me find the distinct topics of learning strategies. However, both LDA and POS gave me an idea to use GuidedLDA (Github repo), a semi-supervised learning algorithm. The idea was to set some seed words for topics that users believe are representative of the underlying topics in the corpus, and guide the model to converge around those terms. I used a Python implementation of the algorithm explained in the paper by J. Jagarlamudi, H. Daume III and R. Udupa “Incorporating Lexical Priors into Topic Models.” The paper talks about how the priors (in this case priors mean seeded words) can be set into the model to guide it in a certain direction.

In regular LDA, each word is first randomly assigned to a topic controlled with Dirichlet priors via the Alpha parameter (now you know where LDA gets its name from). The next step is identifying which term belongs to which topic. LDA uses a very simple approach by finding the topic for one term at a time.

Let’s assume we want to find the topic for the word ‘study’. LDA first assumes that every other word in the corpus is assigned to the right topic. In the last step, each word is uniformly distributed in all topics and it is assumed that it’s the correct topic for those words. Then LDA computes which words, ‘study’ is frequently paired with, then, which is the most common topic among those terms. We will assign ‘study’ to that topic, and ‘study’ will probably go near whichever topic ‘textbook’ and ‘notes’ are in. These three words are now closer to each other than they were prior to this step. The model then moves on to next word, and repeats the process as many number of times as needed to converge. With guided LDA, we explicitly want the model to converge in a way that the words ‘study’ and ‘textbook’ are in one topic. To do so, GuidedLDA gives some extra boost to ‘study’ and ‘textbook’ to lie in a specific topic. In this algorithm, the parameter of how much extra boost should be given to a word is controlled by seed_confidence, and it can be a range between 0 and 1. With a seed_confidence of 0.1, you can bias the seeded words by 10% more towards the seeded topics.

To use python implementation of GuidedLDA you can:

or

Start GuidedLDA by preprocessing the data, as you do with any NLP work. For that, I defined my own preprocessing functions:

After defining all the necessary functions for preprocessing, it is time to apply it to the target column (here, corrected_content) of the dataframe, and save it as new column, ‘lemmatized_tokens’.

Next, to generate term-document matrix, I used CountVectorizer class from scikit learn package:

First, we need to instantiate CountVectorizer. For the full list of the parameters you can refer to scikit learn website. I changed the tokenizer to the customized one I previously defined, and the stop words to the list that I had created based on my own dataset. Here, I have used n-gram range of 4 words. Now, it is time to fit and transform the corpus to generate the term-document matrix:

To model the topics with GuidedLDA after importing the package, a dictionary of the terms is created.

I then provided a list of seed words for the model. I used the semantic of the text along with initial keywords I got from LDA modeling and the dictionary of verbs from POS. For that, I created a list of lists in which each list included the keywords that I wanted grouped under a specific topic.

As you can see, I provided the model with seed words for 6 topics.

Checking the words for each topics:

The results looks like:

To visualize the data, I used pyLDAvis package’s powerful interactive visualization and the result is below. Six topics are distinctly separated and the theme from each topic can be grouped as:

  1. Finish homework/complete assignment

2. Check past quizzes and questions/understand answers

3. Talking and asking teacher/pay attention

4. Read/study notes and books

5. Answering questions and learn the problems

6. Write stories, essay and book

Source code can be found on GitRepo. I look forward to hearing any feedback or questions.

Shahrzad Hosseini developed Reflectometer during her time as an Insight Data Science Fellow in 2019.

Are you interested in working on high-impact projects and transitioning to a career in data? Sign up to learn more about the Insight Fellows programs and start your application today.

Insight Fellows Program

Insight Fellows Program - Your bridge to a thriving career

Thanks to Insight

Insight Fellows Program

Insight Fellows Program - Your bridge to a thriving career

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade