top of page

Case Study: Voyant

  • alexanderrpreston7
  • Nov 15, 2024
  • 4 min read

By Al Preston

 

Have you ever seen one of these?

 


 

Word clouds are very cool, but they don’t tend to mean much on the surface. Word clouds like this only show the frequency of particular words in a body of work (also called a corpus). You can clearly see that ‘know’ was used the most as it is the biggest word in the cloud.

Very cool to look at, why does it matter?

Well, datamining corpus to tell you frequency of words, relationship between words, and the context of words can tell you a lot about a particular body of work. For example, Cameron Belvins used datamining software to analyze Martha Ballard’s Diary (link). For those who don’t know, Martha Ballard was a midwife in Maine who recorded her work and life daily in a diary. In 1991, Laurel Thatcher Ulrich took the diary that historians up until that point had declared as useless information, and datamined it by hand.

She discovered that Ballard lost very few mothers and babies in her time as a midwife and could also track disease through a town from the moment it appeared to the last house it hit. Through one diary alone. Cameron Belvins was more interested in checking Ulrich’s work through the use of digital datamining tools. He found that her findings were accurate with a few errors here and there on behalf of the digital tools being unable to differentiate between words that were spelled differently across the diary.

He was able to follow weather patterns through the diary and a few more interesting pieces of information. However, he did not get the same results that Ulrich had when doing the datamining by hand. There are some things a computer just can’t see.

One of the programs that allows you to do this kind of work is Voyant. It comes preloaded with the works of Shakespeare and Jane Austin, but you can also upload your own corpus, all for free (fair warning; the bigger the corpus, the slower Voyant runs).

Once you upload something, Voyant has already done the rest. There is some tweaking you have to do. Stopwords, a limiting element you can input into Voyant to remove words like ‘the’ or ‘and’, allow you to narrow your word clouds and other visualizations.

The word cloud above came from four separate corpus that I have on hand. The transcripts from the Making Gay History podcast, the full transcription of the Chuck Honse and Scott Noxon interviews that The Holiday Pride have completed, and a written interview with Kate Hammil also completed by The Holiday Pride.

Word clouds aren’t the only thing that Voyant can produce. They can also make Bubblelines like these:


Caption: the words ‘gay’ and ‘know’ and where they appear in all four corpus.

 

Caption: the words ‘gay’, ‘bar’, ‘lesbian’, and ‘woman’ and where they appear in all four corpus.

 

            I really like the bubblelines. They can show frequency and relationships between words in one concise way.

            I didn’t just compare these four works. I also looked at just the three Pittsburgh interviews as well as just the two oral histories. Once again, with more bubblelines:


Caption: the word ‘transgender’ in only the corpus of The Holiday Pride.

 

Caption: the word ‘money’ in only the corpus of The Holiday Pride.


Caption: the words ‘Pittsburgh’ and ‘gay’ in only the corpus of The Holiday Pride.

 

            Voyant can also show common phrases. You can choose particular words and Voyant will show you how many times that word in combination with other words appear in the corpus, as you can see below:

 

Caption: phrases in reference to the word ‘know’ in all four corpora.

 

Caption: phrases in reference to the word ‘gay’ in all four corpora.

 

Caption: phrases in reference to the word ‘lesbian’ in all four corpora.

 

            You can also visualize words across individual corpus. The example below shows the word ‘gay’ across all four corpora, demonstrating which one used the word more than the others:



            I have also compared just the two oral histories from The Holiday Pride in this neat word tree that that can also show the relationships between words in a corpus:

 



            So, having done all of this, what did I learn?

            Well…not a lot. Which is okay!

            Sometimes there is nothing to gain from datamining like this. Oral histories can also present an interesting challenge to tools like Voyant. Depending on who transcribed a history, misspelling, acknowledgement of accents, and odd turns of phrases that only appear when people speak can confuse Voyant.

            To get these results, I had to add so many stop words that there wasn’t much left. While these are not large corpus, between all four there were about 100,000 words to sort through. I had to remove every shortened term, misspelled word, names, and contractions (EX: that’s) that Voyant has a hard time identifying on its own.

            Cleaning up a corpus of all of those terms can help Voyant do it’s thing, however, oral histories can be so full of those terms that removing them can fundamentally change how the oral history reads, removing important context.

            Voyant does have a problem, in general, with removing words from their context, which can entirely change their meaning.

            I chose these corpora to see if Voyant would be a useful tool in understanding oral histories of queer people. The Holiday Pride’s oral histories only have two fully finished transcripts so utilizing the transcripts from Making Gay History, was just to show proof of concept.

            Sometimes visualizations like this can help us understand the language of a time or population, what words could mean in different contexts. Why someone may have used certain words where. However, that is hard to translate to oral history. Perhaps with a larger corpus from Pittsburgh I could find something interesting, but this set of corpora didn’t tell me much.



Comments


bottom of page