The downloaded binary packages are in
/var/folders/gs/k4qxq0zx61b5qthhfkv6rcw40000gp/T//Rtmp3xFJFb/downloaded_packages
Exploratory Text-Data Analysis
A Text analysis with Tidytext
This is an exploratory data analysis (flesh out later)
Loading the dataset & Tokenization
Our dataset
Our initial dataset contains answers from 67 participants to the following questions:
Research Areas: Which research area do you identify with? Choose all that apply
Pressing questions: Articulate what are 2 of the most pressing questions that need to be addressed to further our understanding of wildfire impacts on biogeochemical cycling in watersheds.
The datafile is organized in long format with three columns for the variables: Unique Identifier “ID”, “question” and “answers”. The original excel file was saved as CSV-UTF-8 for text analysis purposes.
ID | Question type | Answers |
---|---|---|
IA0001 | pressing-q | better understanding of the temporal scales of the different impacts as well as their potential interactions movement of materials ash, pyrogenic carbon, eroded soil between compartments from hillslopes to the hydrological network as well as physical and biological alterations of those during the transport |
IA0002 | pressing-q | How are flow pathways affected by wildfires change of hydrological processes beyond vegetation removal? How long do changes in hydrology and biogeochemistry last? including importance of forest disturbance history |
IA0003 | pressing-q | How important are deeper groundwater flowpaths for prolonging elevated nutrient yields in burned watersheds? The vast majority of post-wildfire studies on biogeochemical cycling have focused on surface or shallow flowpaths, yet post-fire geochemical tracers indicate elevated contributions from deeper bedrock flowpaths during some hydroclimatic conditions Murphy 2018. What is the role of hyporheic flow through wildfire-deposited or reworked sediments in stream beds for enhancing or buffering elevated constituent loading in streams? Wildfires alter streambed sediments through aggradation, incision, and stabilization phases controlled by links between hydroclimatic conditions and geomorphic processes Moody, 2017. Despite the ubiquity of changes in streambed sediments, we know little about how these streambed changes and associated alteration of hyporheic flow changes biogeochemical cycling. |
IA0001 | research-a | Nutrient Cycling, Pyrogenic Organic Matter, Soil Stability, Quality, Mineralogy, Pyrogenic Organic Matter Transport, Fluxes, Water Quality, Watershed Management |
IA0002 | research-a | Water Routing, Hillslope Hydrology, Surface Water – Groundwater Interactions, Hydrological Modeling |
IA0003 | research-a | Water Routing, Hillslope Hydrology, Hydrological Modeling |
IA0004 | research-a | Pyrogenic Organic Matter, Pyrogenic Organic Matter Transport, Fluxes, Pyrogenic Organic Matter Method Development, Biogeochemical Modeling, Successional Dynamics, Ecological Resilience, Ecological Modeling |
FALSE | FALSE | FALSE |
Tokenization
Tokenization is the process of separating every single word contained in an answer and creating a new data file that will contain as many rows as total words in the original dataset.
For this part we are going to separate the answers for the first two questions in two data sets. This is because the research area answers easier to analyze than those for the pressing questions. The research area answers are separated by commas and contain no filler words (due to the format used for the original question in the google form).
#For research area questions
<- t_df%>%filter(question=="research-a")
ra_dat
#For pressing questions
<- t_df%>%filter(question=="pressing-q") pq_dat
Let’s use wordclouds to visualize the ouput of tokenizing the research area dataset. I had to adjust the size of the words using the scale parameter (maximum size = 4 by default). Otherwise certain long words would not be fit into the plot (see stackoverflow answer)
Pressing questions answers requires an extra step. We need to remove “filler words” a.k.a. stop words (e.g., a, the, this…). We also need to deal with plurals and singulars of the same word.
Stop word | Lexicon |
---|---|
a | SMART |
a's | SMART |
able | SMART |
about | SMART |
above | SMART |
according | SMART |
you | onix |
young | onix |
younger | onix |
youngest | onix |
your | onix |
yours | onix |
More info about lexicons and text categorization can be found here
Let’s use wordclouds again to see the difference between the raw answers and “cleaned” answers for the pressing questions topic.
<- pq_dat %>%
pq_tokens unnest_tokens(output = word, input = answers)%>%
anti_join(stop_words, by = "word")%>%
filter(str_detect(word,"[:alpha:]"))%>%
distinct() %>%
count(word, sort = FALSE) %>%
mutate(length = nchar(word))
head(pq_tokens)
# A tibble: 6 × 3
word n length
<chr> <int> <int>
1 2017a 1 5
2 21st 1 4
3 ability 3 7
4 abney 1 5
5 absence 1 7
6 acccurately 1 11
Dealing with Plurals and singulars
From Stack overflow
“The best way to do this is to use a tool to tag your plural nouns, and then to convert these to singular. Unlike the stemmer solution, this will not stem words such as stemming to stem, or quickly to quick, etc.I recommend using the spacyr package for this, which integrates nicely with quanteda.”
Yet, there are a few steps required for this to work, as detailed in Running Python Chunks in RStudio and rmarkdown. After that you could also watch this YouTube Tutorial.
Raw Data
<- as.data.frame(pq_dat %>%
gls unnest_tokens(output = word, input = answers)%>%
anti_join(stop_words, by = "word")%>%
filter(str_detect(word,"[:alpha:]"))%>%
distinct() %>%
count(word, sort = TRUE))
head(gls)
word n
1 fire 44
2 wildfire 29
3 water 21
4 cycling 20
5 impacts 20
6 biogeochemical 18
“Clean” Data
We used the function anti_join
to remove stop words. We also want to remove numbers (like year of publication). To do so we use the combination of filter
and str_detec
to preserve only alphabetic characters "[:alpha:]"
. Finally, to make sure we are not double counting words, we use the function distinct
.
Glossary
Besides identifying the words that are more frequently used in the answers, we could use our dataset to create a consistent glossary.
Word | |
---|---|
1 | fire |
2 | wildfire |
3 | water |
4 | cycling |
5 | impacts |
6 | biogeochemical |
1073 | world |
1074 | wu |
1075 | wuis |
1076 | xanes |
1077 | xas |
1078 | yosemite |
To do so, we could use a frequency filter; including words above a frequency threshold (e.g. 2)
<- filter(gls, n > 2)
gls_fw <- rbind(head(gls_fw),tail(gls_fw))
gls_fwp colnames(gls_fwp) <- c("Word", "Frequency")
::kable(gls_fwp,format="html") %>%
knitrhtml_table_width(c(200,200))
Word | |
---|---|
1 | fire |
2 | wildfire |
3 | water |
4 | cycling |
5 | impacts |
6 | biogeochemical |
201 | supply |
202 | traditional |
203 | transported |
204 | types |
205 | variation |
206 | vary |
We still have a long list of words that are not necessarily associated to wildfires. We can look into different associations to get collections of keywords. Bit before that, we will start with term frequencies
Term Frequencies
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Words with low frequency make a large contribution to the text.
We observe deviations from the Zipf’s law both at the higher rank words (rare)and the lower rank words (more common). On the one hand, we have more words than expected at higher ranks, and less words than expected at lower ranks. This is perhaps the effect of abundant technical terms in the corpus analyzed here.
Analyzing word’s frequencies using Zipf’s law seems to have some drawbacks, see: Zipf’s law holds for phrases, not words. However the unnest_tokens
function can also be applied to sentences.
ID | question | sentences |
---|---|---|
IA0001 | pressing-q | better understanding of the temporal scales of the different impacts as well as their potential interactions movement of materials ash, pyrogenic carbon, eroded soil between compartments from hillslopes to the hydrological network as well as physical and biological alterations of those during the transport |
IA0002 | pressing-q | how are flow pathways affected by wildfires change of hydrological processes beyond vegetation removal? |
IA0002 | pressing-q | how long do changes in hydrology and biogeochemistry last? |
IA0002 | pressing-q | including importance of forest disturbance history |
IA0003 | pressing-q | how important are deeper groundwater flowpaths for prolonging elevated nutrient yields in burned watersheds? |
IA0003 | pressing-q | the vast majority of post-wildfire studies on biogeochemical cycling have focused on surface or shallow flowpaths, yet post-fire geochemical tracers indicate elevated contributions from deeper bedrock flowpaths during some hydroclimatic conditions murphy 2018. |
The bind_tf_idf
function
A function from the tidytext
package that aims to find the most common words in a text by decreasing the weight of the most common terms and increasing it for the less common (i.e. meeting in the middle scenario)
“The logic of tf-idf is that the words containing the greatest information about a particular document are the words that appear many times in that document, but in relatively few others.”
It is meant to be for document intercomparison. Here we could use it for questions, abstracts, or even journal articles.
Text mining with multiple pdf’s
Correlation analysis
Relationships between words: n-grams
<- pq_dat %>%
pq_digrams filter(str_detect(answers,"[:alpha:]"))%>%
unnest_tokens(bigram, answers, token = "ngrams", n = 2) %>%
separate(bigram,c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
count(word1, word2, sort = TRUE) %>%
mutate(rank = row_number(),
total=sum(n),
t_freq = n/total)
head(pq_digrams)
# A tibble: 6 × 6
word1 word2 n rank total t_freq
<chr> <chr> <int> <int> <int> <dbl>
1 post fire 25 1 1444 0.0173
2 biogeochemical cycling 18 2 1444 0.0125
3 debris flows 12 3 1444 0.00831
4 organic matter 10 4 1444 0.00693
5 water quality 9 5 1444 0.00623
6 climate change 8 6 1444 0.00554
#Distribution of frequency values
%>% filter(rank < 26) %>%
pq_digrams unite(bigram, word1, word2, sep = " ") %>%
ggplot(aes(t_freq, fct_reorder(bigram, t_freq), fill = t_freq)) +
geom_col(show.legend = FALSE) +
labs(x = "Frequency", y = NULL)
#Zipf's law for survey answers
%>%
pq_digrams ggplot(aes(rank,t_freq)) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
geom_abline(intercept = -0.62, slope = -1.1,
color = "gray50", linetype = 2) +
scale_x_log10() +
scale_y_log10()+
xlab("Rarity")+
ylab("Frequency")
Visualizing a network of bigrams with ggraph
#library(igraph)
#library(graphlayouts)
#library(qgraph)
<- pq_digrams %>%
bigram_graph filter(rank < 101) %>%
graph_from_data_frame()
bigram_graph
IGRAPH 65dcd75 DN-- 123 100 --
+ attr: name (v/c), n (e/n), rank (e/n), total (e/n), t_freq (e/n)
+ edges from 65dcd75 (vertex names):
[1] post ->fire biogeochemical->cycling
[3] debris ->flows organic ->matter
[5] water ->quality climate ->change
[7] pyrogenic ->carbon fire ->regimes
[9] wildfire ->impacts pyrogenic ->organic
[11] aquatic ->ecosystems black ->carbon
[13] fire ->severity microbial ->communities
[15] carbon ->storage fire ->impacts
+ ... omitted several edges
set.seed(2017)
# a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
#
# ggraph(bigram_graph, layout = "fr") +
# geom_edge_link(show.legend = FALSE,
# arrow = a, end_cap = circle(.07, 'inches')) +
# geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
# arrow = a, end_cap = circle(.035, 'inches')) +
# geom_node_point(color = "blue", size = 3) +
# geom_node_text(aes(label = name), vjust = 1, hjust = 1)+
# theme_void()
# V(bigram_graph)$size <- V(bigram_graph)$t_freq*10
<- layout_with_fr(bigram_graph)
l <- get.edgelist(bigram_graph,names=FALSE)
e <- qgraph.layout.fruchtermanreingold(e,vcount=vcount(bigram_graph))
m <- degree(bigram_graph,mode="all")
deg <- degree(bigram_graph, mode= "all")
fsize
#png(filename=paste("assets/NetworkAnalysis_words_",Sys.Date(),".png", sep = ""), res = 100)
plot(bigram_graph,layout=m, edge.arrow.size =.05,vertex.color = "pink", vertex.size =500,vertex.frame.color="deeppink",vertex.label.color="black", vertex.label.cex=fsize/5,vertex.label.dist=0.8,edge.curve = 0.75,edge.color="skyblue",edge.label.family="Arial", rescale=F, axes = FALSE, ylim = c(-50,90), xlim = c(-55,120), asp =0)
#dev.off()
#png(filename=paste("assets/NetworkAnalysis_bubbles_",Sys.Date(),".png", sep = ""), res = 100)
plot(bigram_graph,layout=m, edge.arrow.size =.05,vertex.color = "pink", vertex.size =deg*150,vertex.frame.color="deeppink",vertex.label.color="black", vertex.label.cex=0.55,vertex.label.dist=0.8,edge.curve = 0.75,edge.color="skyblue",edge.label.family="Arial", rescale=F, axes = FALSE, ylim = c(-50,90), xlim = c(-55,120), asp =0)
#dev.off()
Editing network graphs: GitHub-Issue