HI Dinko,
Thanks. Took sample from the above url and changed mine as below. But getting error as file not found even though it exists. Any hints for me pls.
scala> object StopWordsRemoverExample {
| def main(args: Array[String]): Unit = {
| //val s = Source.fromFile("/home/hadoop/Desktop/CompleteSherlockHolmesStripped.txt").mkString
| //val outputFile = new File("/home/hadoop/Desktop/CompleteSherlockHolmesStopremoved.txt")
| //val writer = new BufferedWriter(new FileWriter(outputFile))
| val remover = new StopWordsRemover()
| .setInputCol("/home/hadoop/Desktop/CompleteSherlockHolmesStripped.txt")
| .setOutputCol(“home/hadoop/Desktop/CompleteSherlockHolmesStopremoved.txt”)
| val dataSet = spark.createDataFrame(Seq(
| (0, Seq(“the”,“a”,“http”,“i”,“me”,“to”,“what”,“in”,“rt”))
| )).toDF(“id”, “raw”)
| remover.transform(dataSet).show(false)
| spark.stop()
| }
| }
defined object StopWordsRemoverExample
scala> StopWordsRemoverExample.main(Array())
java.lang.IllegalArgumentException: Field “/home/hadoop/Desktop/CompleteSherlockHolmesStripped.txt” does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
Methods setInputCol and setOutputCol expect column names of your dataset, not paths to files. The code would approximately look like this (adjust it to your needs):
import org.apache.spark.ml.feature.StopWordsRemover
// spark is instance of org.apache.spark.sql.SparkSession
val dataSet = spark
.read.text("/path/to/SherlockHolmsFile.txt")
.map(row => row.getString(0).split("""\s+""")) // transform String into Array[String]
.toDF("words")
// dataSet has one column named "words", see by running dataSet.printSchema()
val remover = new StopWordsRemover()
.setInputCol("words")
.setOutputCol("removed")
.setStopWords(Array("the","a","http","i","me","to","what","in","rt"))
val newDataSet = remover.transform(dataSet)
// newDataSet now has two columns, "words" and "removed"
newDataSet
.select("removed") // if you're interested only in "clean" text
.map(row => row.getSeq[String](0).mkString(" ")) // make Array[String] into String
.write.text("/path/to/SherlockHolmsWithoutStopWords.txt")
If you run this code from spark-shell you’ll already have the spark instance provided, otherwise you’ll have to create it by yourself.
scala> val dataSet = spark.read.text("...").map(...).toDF("...")
notice that the code I wrote before was not typed in spark-shell. You can paste it there though (just change the paths first) by typing :paste and then pasting the code:
scala> :paste
// Entering paste mode (ctrl-D to finish)
Thanks for your guidance… I am able to compile and see the output file generated inside the directory.
Next I searched the file with exact word matching command
grep -w “in” part-r-00000-71e6bd53-99c3-40d3-a4fc-73e8a993be41.txt
but could see the stop words in that output file. Is that correct what i am doing…
Are you sure you selected the proper column? In my code I set the name of the output column to removed, so I first selected only that column and then wrote the content to disk:
scala> newDataSet.select("removed").map(/* converts array into string */).write.text("...")
You can compare the schemas of the original dataset and the new one and see that the new one has one extra column - the one without stopwords:
Below is the script i used.
import org.apache.spark.ml.feature.StopWordsRemover
val dataSet = spark.read.text("/user/CompleteSherlockHolmesStripped.txt").map(row => row.getString(0).split("""\s+""")).toDF(“words”)
val remover = new StopWordsRemover()
.setInputCol(“words”)
.setOutputCol(“removed”)
.setStopWords(Array(“the”,“a”,“http”,“i”,“me”,“to”,“what”,“in”,“rt”))
val newDataSet = remover.transform(dataSet)
newDataSet
.select(“removed”).map(row => row.getSeqString.mkString(" “)).write.text(”/user/CompleteSherlockHolmesWithoutStopWords.txt")
Used below to verify
dataSet.printSchema()
newDataSet.printSchema()
All I can say is try with a smaller example. Write yourself a few lines of text, try that and see whether stop-words would get removed to ensure that the code works.
In the code, lines of text are split over one or more whitespace characters into the array of words (that’s what split("""\s+""") does). Perhaps that was not enough for some of those "the"s? How does the original text look like around some of the not-removed stop words?
For stop words to be removed, they should be singled out as words in the array. I just used \s+ because it was the easiest way for me to test the code on my toy example.