Stopwords removal in scala


#1

I need to update the below code to remove stopwords. Can anyone please help.

val rdd = sc.textFile(“hdfs://localhost:9000/user/Stripped.txt”)
rdd.map{
.split(’\n’).map{ substrings =>
substrings.trim.split(’ ').
map{
.replaceAll("""\W""", “”).toLowerCase()}.
sliding(2)
}.
flatMap{identity}.map{.mkString(" ")}.
groupBy{identity}.mapValues{
.size}
}.
flatMap{identity}.reduceByKey(+).sortBy(_._1).
saveAsTextFile(“hdfs://localhost:9000/user/spark/output/Test”)


Words to be striped
val stop_words = List(“the”,“a”,“http”,“i”,“me”,“to”,“what”,“in”,“rt”).toSet


#2

Sounds like a job for unit testing. I don’t think you are going to be able
to crowd-source a solution.

Brian Maso


#3

There is a StopWordsRemover is Spark. You’ll have to convert your RDD into a Dataset though.

See https://spark.apache.org/docs/2.2.0/ml-features.html#stopwordsremover

Cheers,
Dinko


#4

HI Dinko,
Thanks. Took sample from the above url and changed mine as below. But getting error as file not found even though it exists. Any hints for me pls.

scala> object StopWordsRemoverExample {
| def main(args: Array[String]): Unit = {
| //val s = Source.fromFile("/home/hadoop/Desktop/CompleteSherlockHolmesStripped.txt").mkString
| //val outputFile = new File("/home/hadoop/Desktop/CompleteSherlockHolmesStopremoved.txt")
| //val writer = new BufferedWriter(new FileWriter(outputFile))
| val remover = new StopWordsRemover()
| .setInputCol("/home/hadoop/Desktop/CompleteSherlockHolmesStripped.txt")
| .setOutputCol(“home/hadoop/Desktop/CompleteSherlockHolmesStopremoved.txt”)
| val dataSet = spark.createDataFrame(Seq(
| (0, Seq(“the”,“a”,“http”,“i”,“me”,“to”,“what”,“in”,“rt”))
| )).toDF(“id”, “raw”)
| remover.transform(dataSet).show(false)
| spark.stop()
| }
| }
defined object StopWordsRemoverExample

scala> StopWordsRemoverExample.main(Array())
java.lang.IllegalArgumentException: Field “/home/hadoop/Desktop/CompleteSherlockHolmesStripped.txt” does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)


#5

Methods setInputCol and setOutputCol expect column names of your dataset, not paths to files. The code would approximately look like this (adjust it to your needs):

import org.apache.spark.ml.feature.StopWordsRemover

// spark is instance of org.apache.spark.sql.SparkSession

val dataSet = spark
  .read.text("/path/to/SherlockHolmsFile.txt")
  .map(row => row.getString(0).split("""\s+""")) // transform String into Array[String]
  .toDF("words")
// dataSet has one column named "words", see by running dataSet.printSchema()

val remover = new StopWordsRemover()
  .setInputCol("words")
  .setOutputCol("removed")
  .setStopWords(Array("the","a","http","i","me","to","what","in","rt"))

val newDataSet = remover.transform(dataSet)
// newDataSet now has two columns, "words" and "removed"

newDataSet
  .select("removed") // if you're interested only in "clean" text
  .map(row => row.getSeq[String](0).mkString(" ")) // make Array[String] into String
  .write.text("/path/to/SherlockHolmsWithoutStopWords.txt")

If you run this code from spark-shell you’ll already have the spark instance provided, otherwise you’ll have to create it by yourself.

Cheers,
Dinko


#6

Thanks for your time and your explanation. I tried running the above in spark-shell command and getting below error

scala> .read.text("/user/CompleteSherlockHolmesStripped.txt")
:1: error: illegal start of definition
.read.text("/user/CompleteSherlockHolmesStripped.txt")

scala> .toDF(“words”)
:1: error: illegal start of definition
.toDF(“words”)


#7

it should be:

scala> val dataSet = spark.read.text("...").map(...).toDF("...")

notice that the code I wrote before was not typed in spark-shell. You can paste it there though (just change the paths first) by typing :paste and then pasting the code:

scala> :paste
// Entering paste mode (ctrl-D to finish)

cheers,
Dinko


#8

Thanks for your guidance… I am able to compile and see the output file generated inside the directory.
Next I searched the file with exact word matching command
grep -w “in” part-r-00000-71e6bd53-99c3-40d3-a4fc-73e8a993be41.txt
but could see the stop words in that output file. Is that correct what i am doing…


#9

Are you sure you selected the proper column? In my code I set the name of the output column to removed, so I first selected only that column and then wrote the content to disk:

scala> newDataSet.select("removed").map(/* converts array into string */).write.text("...")

You can compare the schemas of the original dataset and the new one and see that the new one has one extra column - the one without stopwords:

dataSet.printSchema()
newDataSet.printSchema()

cheers,
Dinko


#10

Below is the script i used.

import org.apache.spark.ml.feature.StopWordsRemover
val dataSet = spark.read.text("/user/CompleteSherlockHolmesStripped.txt").map(row => row.getString(0).split("""\s+""")).toDF(“words”)
val remover = new StopWordsRemover()
.setInputCol(“words”)
.setOutputCol(“removed”)
.setStopWords(Array(“the”,“a”,“http”,“i”,“me”,“to”,“what”,“in”,“rt”))
val newDataSet = remover.transform(dataSet)
newDataSet
.select(“removed”).map(row => row.getSeqString.mkString(" “)).write.text(”/user/CompleteSherlockHolmesWithoutStopWords.txt")

Used below to verify
dataSet.printSchema()
newDataSet.printSchema()

Ouput:
root
|-- words: array (nullable = true)
| |-- element: string (containsNull = true)

root
|-- words: array (nullable = true)
| |-- element: string (containsNull = true)
|-- removed: array (nullable = true)
| |-- element: string (containsNull = true)

used the command grep -w “the” Could see the word “the” in multiple lines where the is keyword used in stopwords.

can pls advise.


#11

All I can say is try with a smaller example. Write yourself a few lines of text, try that and see whether stop-words would get removed to ensure that the code works.

In the code, lines of text are split over one or more whitespace characters into the array of words (that’s what split("""\s+""") does). Perhaps that was not enough for some of those "the"s? How does the original text look like around some of the not-removed stop words?

For stop words to be removed, they should be singled out as words in the array. I just used \s+ because it was the easiest way for me to test the code on my toy example.

cheers,
Dinko