Counting vowels in a word #2

deepayan · 2024-09-02T11:06:45Z

deepayan
Sep 2, 2024
Maintainer

This problem is motivated by a question in your Probability homework. Consider the random experiment of choosing a word uniformly randomly from a collection of words, and obtaining the distribution of (a) the length (number of characters) and (b) the number of vowels in the chosen word.

Here, we will consider a larger collection of words, where manual inspection is not practical. Specifically, we will use the responses about hobbies in the survey data collected from the class. The responses can be obtained as a vector of character strings as follows.

survey <- read.csv("https://deepayan.github.io/BSDS/2024-01-DE/data/bsds-survey.csv")
hobbies <- survey$hobbies
s <- head(hobbies)
s

[1] "Mathematical Programming"                                                              
[2] "I like cricket "                                                                       
[3] "Learning new things "                                                                  
[4] "Playing outdoor sports"                                                                
[5] "I love playing football , also loves swimming , and also likes to cook delicious food."
[6] "Cricket and football"

To answer (a) and (b), we would need to do several things first:

remove unnecessary characters like punctuation marks,
turn all characters into lowercase, to make counting vowels simpler,
split sentences into words, and
remove surrounding spaces, if any.

Once we do this, we will also need some way to count the number of characters in a word.

R has several functions for string manipulations. Here are some of them (for details, see their documentation, e.g., ?nchar for the nchar() function).

`gsub()`

Replaces characters or patterns by a replacement string.

> gsub(",", "", s)
[1] "Mathematical Programming"                                                            
[2] "I like cricket "                                                                     
[3] "Learning new things "                                                                
[4] "Playing outdoor sports"                                                              
[5] "I love playing football  also loves swimming  and also likes to cook delicious food."
[6] "Cricket and football"

`tolower()`

Replaces characters by their lowercase versions.

> tolower(s)
[1] "mathematical programming"                                                              
[2] "i like cricket "                                                                       
[3] "learning new things "                                                                  
[4] "playing outdoor sports"                                                                
[5] "i love playing football , also loves swimming , and also likes to cook delicious food."
[6] "cricket and football"

`trimws()`

Removes white space at the beginning or end of a string.

> trimws(s)
[1] "Mathematical Programming"                                                              
[2] "I like cricket"                                                                        
[3] "Learning new things"                                                                   
[4] "Playing outdoor sports"                                                                
[5] "I love playing football , also loves swimming , and also likes to cook delicious food."
[6] "Cricket and football"

`paste()`

Combines multiple strings together.

> paste(1:6, s, sep = ". ")
[1] "1. Mathematical Programming"                                                              
[2] "2. I like cricket "                                                                       
[3] "3. Learning new things "                                                                  
[4] "4. Playing outdoor sports"                                                                
[5] "5. I love playing football , also loves swimming , and also likes to cook delicious food."
[6] "6. Cricket and football"                                                                  
> paste(s, collapse = "; ")
[1] "Mathematical Programming; I like cricket ; Learning new things ; Playing outdoor sports; I love playing football , also loves swimming , and also likes to cook delicious food.; Cricket and football"

`strsplit()`

Splits strings into parts by breaking at a pattern. This is a very useful function, but also a little complicated to use. A simple example is to split sentences on the space character as follows.

> strsplit(s, " ")
[[1]]
[1] "Mathematical" "Programming" 

[[2]]
[1] "I"       "like"    "cricket"

[[3]]
[1] "Learning" "new"      "things"  

[[4]]
[1] "Playing" "outdoor" "sports" 

[[5]]
 [1] "I"         "love"      "playing"   "football"  ","         "also"     
 [7] "loves"     "swimming"  ","         "and"       "also"      "likes"    
[13] "to"        "cook"      "delicious" "food."    

[[6]]
[1] "Cricket"  "and"      "football"

The result is a list, which a structure we have not yet learned about. A list is also a vector, but its elements can be arbitrary objects. In this case, each list element is a character vector (of different lengths). We will learn about lists in more detail later, but in this case, a function called unlist() is enough to turn this list of character vectors back into a longer character vector of all words.

> strsplit(s, " ") |> unlist()
 [1] "Mathematical" "Programming"  "I"            "like"         "cricket"     
 [6] "Learning"     "new"          "things"       "Playing"      "outdoor"     
[11] "sports"       "I"            "love"         "playing"      "football"    
[16] ","            "also"         "loves"        "swimming"     ","           
[21] "and"          "also"         "likes"        "to"           "cook"        
[26] "delicious"    "food."        "Cricket"      "and"          "football"

Explore these functions, and any related functions you can find through the help system, and try to solve (a) and (b).

Feel free to seek any clarifications, or to share your ideas and solutions.

Anant-Agarwal-26 · 2024-09-05T12:23:09Z

Anant-Agarwal-26
Sep 5, 2024

survey <- read.csv("https://deepayan.github.io/BSDS/2024-01-DE/data/bsds-survey.csv")
hobbies <- survey$hobbies
s <- head(hobbies)
s=tolower(s)
s=trimws(s)
paste(1:6, s, sep = ". ")
paste(s, collapse = "; ")
strsplit(s, " ")
s=strsplit(s, " ") |> unlist()
v <- 0
c <- 0
for (g in s) {
for (i in strsplit(g, NULL)[[1]]) {
if (i %in% strsplit("bcdfghjklmnpqrstvwxyz", NULL)[[1]]) {
c <- c + 1
} else if (i %in% strsplit("aeiou", NULL)[[1]]) {
v <- v + 1
} else {
}
}
}
cat("no. of vowels are", v, "\n")
cat("no. of consonants are", c, "\n")

0 replies

Jazzifyyy · 2024-09-05T16:07:48Z

Jazzifyyy
Sep 5, 2024

Not sure if I understand the question correctly. Just fetching a random hobby from the vector and doing the calculations.

survey <- read.csv("https://deepayan.github.io/BSDS/2024-01-DE/data/bsds-survey.csv")
hobbies <- survey$hobbies
s <- head(hobbies)

random_hobby <- sample(s, 1)
print(random_hobby)

String_counter <- function (n)
{
str <- tolower(trimws(gsub("[[:punct:]]", "", n)))
numberchar <- nchar(str)
str <- gsub("[aeiou]", "", str)
numberconsonants <- nchar(str)
numbervowels <- numberchar - numberconsonants
cat("Number of characters: ", numberchar, "\n")
cat("Number of vowels: ", numbervowels, "\n")
}

String_counter(random_hobby)

0 replies

TANISH-GHOSH · 2024-09-05T20:05:17Z

TANISH-GHOSH
Sep 5, 2024

# ***This gives distribution of all words instead of randomly chosen word.***
 
survey = read.csv("https://deepayan.github.io/BSDS/2024-01-DE/data/bsds-survey.csv")
hobbies = head(survey$hobbies)
c1 = gsub("[.,]", "", hobbies)  # uses [] to collectively remove .,
c2 = trimws(c1) |> tolower()
c3 = strsplit(c2, " ") |> unlist()
c4 = c3[c3 != ""]    # uses logical indexing
c4 # list of words

word_len = nchar(c4)   # no of characters assigned to each vec element
table(word_len) |> plot(ylab = "frequency") 

c5 = gsub("[bcdfghjklmnpqrstvwxyz]", "", c4)
vowel_len = nchar(c5)
table(vowel_len) |> plot(ylab = "frequency")

2 replies

deepayan Sep 8, 2024
Maintainer Author

Good work everyone.

Can we try to modify this to produce a single data frame with relevant information? This data frame could contain, say, four columns giving

word --- every word that appears in the text
frequency --- the number of time the word appears
nchar --- the number of characters
nvowels --- the number of vowels

TANISH-GHOSH Sep 15, 2024

My attempt

#> c4
#[1] "mathematical" "programming"  "i"            "like"        
#[5] "cricket"      "learning"     "new"          "things"      
#[9] "playing"      "outdoor"      "sports"       "i"           
#[13] "love"         "playing"      "football"     "also"        
#[17] "loves"        "swimming"     "and"          "also"        
#[21] "likes"        "to"           "cook"         "delicious"   
#[25] "food"         "cricket"      "and"          "football" 

df <- data.frame(word = unique(c4), nchar = nchar(unique(c4)))

c6 <- c()      # initial null vector
for (i in df$word){     # df$word is a chr vec
  v <- gsub("[^aeiou]", "",i) 
  n_v <- nchar(v)              
  c6 <- c(c6, n_v)             # append
}
df$nvowels <- c6               # add new column in df

c7 <- c()                      # ini null vec
for (i in df$word){
  freq <- sum(i == c4)         # i == c4 gives logical vec for each i and sum gives #(TRUE) for that i
  c7 <- c(c7, freq)
}
df$frequency <- c7

avasterinbloom · 2024-09-08T18:36:35Z

avasterinbloom
Sep 8, 2024

My answer for part (a) is

`survey <- read.csv("https://deepayan.github.io/BSDS/2024-01-DE/data/bsds-survey.csv")
hobbies <- survey$hobbies
s <- head(hobbies)

t1<-gsub(",", "", s) |> tolower() |> trimws()
t2<-paste(1:6, t1, sep = ". ")
t3<-paste(t2, collasp="; ")
s<-strsplit(t3, " ") |> unlist()

countchars <- function(input_string) {
vowels <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z")
count <- 0

for (char in strsplit(input_string, "")[[1]]) {
if (char %in% vowels) {
count <- count + 1
}
}

return(count)
}

c<-0
for (i in (1:length(s))){
c <- c + (countchars(s[i]))
}

#H <- "Number Of Vowels in the sentence is/are "
paste("Number Of Character(s) in the sentence(s) is/are", c)`

And my answer for part (b) is

`survey <- read.csv("https://deepayan.github.io/BSDS/2024-01-DE/data/bsds-survey.csv")
hobbies <- survey$hobbies
s <- head(hobbies)

t1<-gsub(",", "", s) |> tolower() |> trimws()
t2<-paste(1:6, t1, sep = ". ")
t3<-paste(t2, collasp="; ")
s<-strsplit(t3, " ") |> unlist()

countVowels <- function(input_string) {
vowels <- c("a", "e", "i", "o", "u")
count <- 0

for (char in strsplit(input_string, "")[[1]]) {
if (char %in% vowels) {
count <- count + 1
}
}

return(count)
}

c<-0
for (i in (1:length(s))){
c <- c + (countVowels(s[i]))
}

#H <- "Number Of Vowels in the sentence is/are "
paste("Number Of Vowel(s) in the sentence(s) is/are", c)`

0 replies

Ansh-sarraf · 2024-09-08T18:52:03Z

Ansh-sarraf
Sep 8, 2024

i used a function called nchar() taught in tutorial class

survey <- read.csv("https://deepayan.github.io/BSDS/2024-01-DE/data/bsds-survey.csv")
hobbies <- survey$hobbies

hobbies <-tolower(hobbies)

hobbies <- gsub("\.", "", hobbies) #removing punctuation marks
hobbies <- gsub(",", "", hobbies) |> trimws()

hobbies <- strsplit(hobbies, " ") |> unlist() #splitting into individual words

len <- function(h){ #for the distribution of length
h <- nchar(h) #counting number of characters in each element
table(h) |> prop.table() |> barchart(horizontal = FALSE)
}

vow <- function(h){ #for the distribution of vowel
h <- gsub("[^aeiou]", "", h)
h <- nchar(h) #counting number of characters in each element
table(h) |> prop.table() |> barchart(horizontal = FALSE)

}

0 replies

Counting vowels in a word #2

Uh oh!

Uh oh!

deepayan Sep 2, 2024 Maintainer

gsub()

tolower()

trimws()

paste()

strsplit()

Replies: 5 comments · 2 replies

Uh oh!

Anant-Agarwal-26 Sep 5, 2024

Uh oh!

Jazzifyyy Sep 5, 2024

Not sure if I understand the question correctly. Just fetching a random hobby from the vector and doing the calculations.

Uh oh!

TANISH-GHOSH Sep 5, 2024

Uh oh!

Uh oh!

deepayan Sep 8, 2024 Maintainer Author

Uh oh!

TANISH-GHOSH Sep 15, 2024

Uh oh!

Uh oh!

avasterinbloom Sep 8, 2024

Uh oh!

Ansh-sarraf Sep 8, 2024

i used a function called nchar() taught in tutorial class

deepayan
Sep 2, 2024
Maintainer

`gsub()`

`tolower()`

`trimws()`

`paste()`

`strsplit()`

Replies: 5 comments 2 replies

Anant-Agarwal-26
Sep 5, 2024

Jazzifyyy
Sep 5, 2024

TANISH-GHOSH
Sep 5, 2024

deepayan Sep 8, 2024
Maintainer Author

avasterinbloom
Sep 8, 2024

Ansh-sarraf
Sep 8, 2024