How to parse and retrive specific portion from text file using Scala

kumarraj · November 29, 2017, 5:34pm

hi,
I am new to scala…
I am using scala spark dataframe and want to read a text file and retrive specific portion based on prefix and suffix delimiter or strings.

I have sample.txt and it contains,

76ydU First:
NAME=1
CLASS=2
MARK=3
;
7uuy6 SECOND:
NAME=1
CLASS=2
MARK=3
;
12ydU First:
NAME=1
CLASS=2
MARK=3
;

34ydU First:
NAME=1
CLASS=2
MARK=3
;

In the above file, I want to read only FIRST named values…Read between “FIRST:” to “;”.
If it is SECOND:, avoid and remove to read from “SECOND:” to “;”.

Finally, I want to convert FIRST values to csv file as,
NAME,CLASS,MARK.

I am not able to read properly…
please help on this,

thanks,
Kumar

Chris · November 29, 2017, 7:15pm

Hi Kumar,

unfortunately I don’t have a spark server running. But maybe the scala only implementation will give you a hint.

// open the file (assuming test.txt is your filename)
val infile = new File("test.txt")
// open BufferedSource (scala.io)
val reader = new BufferedSource(new FileInputStream(filein))
reader
  .getLines // read line by line
  .grouped(5) // always 5 lines in a group
  .map(_.toVector) // map the packages to a vector
  .filter( vec => vec(0).contains("First")) // and filter it
  .foreach( vec => println(vec(1)+","+vec(2)+","+vec(3) )) // print it out

The output will be:
NAME=1,CLASS=2,MARK=3
NAME=1,CLASS=2,MARK=3
NAME=1,CLASS=2,MARK=3

Hope that helps.

Have fun
Chris

kumarraj · November 30, 2017, 7:01pm

Hi Chris,
thank you for the reply…
I need to omit or remove part of text file which starts SECOND: and ends “;”.
So, if line contains SECOND: , then remove from SECOND: to ;.

I need only FIRST: members (all three FIRST: to ; occurrences).
How can I do it in Scala.

Regards,
Kumar

kumarraj · December 2, 2017, 2:22pm

Hi,
how can I convert to comma seperated with delimiter "="
like NAME,1
CLASS,2… so all name comes as in one colum, all CLASS comes in seperate colum.

Also, if white space and tab between NAME=1 available how can I remove…
I tried with trim, but it shows error…
Thank you

Chris · December 2, 2017, 3:55pm

Hi Kumar,

I don’t know exactly what you want to do. You posted an input file example - can you post a fitting output file example?
The next thing:

I tried with trim, but it shows error…

What kind of error? Exception? Which?
Can you post the code showing the error?

Greetings
Chris

kumarraj · December 2, 2017, 4:50pm

Hello Chris,
Thank you for the reply.
FIRST:
val1=1
val2=2
val3=3
;

SECOND:
     val1=1
     val2=2
     val3=3
;

FIRST:
    val1=1
    val2=2
    val3=3
;

1st step, I need to findout the SECOND regex string and remove its contents.
So, final one I want,
FIRST:
val1=1
val2=2
val3=3
;
FIRST:
val1=5
val2=6
val3=7
;
If whitespaces before and after comes needs to be removed.
Finally, I need to convert all FIRST val1 in one colum, val2 in one column, val 3 in one colums.
It can be comma separated (csv) format like below, Like,
FIRST:
val1,1,5
val2,2,6
val3,3,7
When I tried the above method, it separated with whole row like, val1=1, val2=2,val3=3
I want the comma separated value format.
How can achieve it in scala…
If huge file comes, how can i read only FIRST values from the text file an populate.
If one I get the idea, I will try this in spark dataframe method.
Thank you

Chris · December 2, 2017, 7:43pm

Hi Kumar,

maybe the following code will work for you. It works with Scala Stream to reduce the memory footprint when parsing huge files. The downside is - you will get 3 files which you have to concatenate afterwards. But I don’t have an idea how to do it easily in one step.

package kumarraj

import java.io.File
import java.io.{ BufferedReader, FileInputStream }
import java.io.{ BufferedWriter, FileOutputStream, OutputStreamWriter }
import scala.collection.mutable.HashMap
import scala.io.BufferedSource

object Main extends App {
  val filein = new File("test.txt")
  val reader = new BufferedSource(new FileInputStream(filein))
  val sinkMap = HashMap.empty[String, BufferedWriter]

  reader
    .getLines // read line by line
    .toStream // map to a scala stream
    .map(_.trim) // remove whitespaces
    .filter(_.nonEmpty) // remove empty lines
    .grouped(5) // always 5 lines in a group
    .map(_.toVector) // map the groups to a vector
    // at this point you will have groups starting with FIRST or SECOND
    .filter( vec => vec(0).contains("FIRST")) // and filter it
    .map(vec => vec.tail.filter(_ != ";")) // Skip group name and ';'
    .foreach( vec => { // every group
      vec.foreach( vax => { // every item in a group
        val s = vax.split('=')
        val v = "," + s(1) // prepend ',' as separator char
        getWriter(sinkMap, s(0)) // get a writer
          .write(v, 0, v.length) // and print
      })
    })
  reader.close()
  sinkMap.foreach( t => t._2.close()) // close every writer

  def getWriter(map : HashMap[String, BufferedWriter], key : String) : BufferedWriter = {
    if(!map.contains(key)) { // if not existing
      val s = new BufferedWriter(
        new OutputStreamWriter(
          new FileOutputStream(key+".txt"))) // create a new
      s.write(key, 0, key.length) // and write the key first
      map += ( (key, s) )
    }
    map.get(key).get // return the writer
  }
}

kumarraj · December 3, 2017, 9:54am

Hello Chris,
thank you and perfect…
I used java fileWriter.append mechanism (set to true) to add all into single file.
Also implemented with dataframe spark mechanism…
Thank you…

kumarraj · December 3, 2017, 12:32pm

Hi Chris,
I have some problem in the code…
If I have different file with each line have TEST string.
Only first line have A124 TEST…
Like as follows,
A124 TEST STATUS UPDATE
TEST = 1
TEST_RESULT = 2
TEST_ID = 3
TEST_CONFIG =
TEST_PARAM1 = 12
TEST_PARAM1 = 20
The above code if I put delmiter as “TEST”, it always take from TEST_PARAM
result as,
TEST_PARAM1,12,20
TEST_PARAM2,24,25 (it from repeated TEST string in file.

How can take all values from A124 TEST STATUS UPDATE
I required from TEST = 1
When I put parse string as vec => vec(0).contains(“A124”)) not provides any output and shows terminated Main$1
grouped(5) given,is this a problem…
how can I solve this…

Thank you…

Chris · December 3, 2017, 6:26pm

Hi Kumar,

if you only want to skip the first line, you can use the stream.tail function. In the sample code above, I would it insert here:

  reader
    .getLines // read line by line
    .toStream // map to a scala stream
    .map(_.trim) // remove whitespaces
    .filter(_.nonEmpty) // remove empty lines
    .tail // skip first line containing "A124 TEST ..."
    .grouped(5) // always 5 lines in a group

If you want to handle the first 5 lines different, you can use

val header = reader
  .getLines
  .map(_.trim)
  .filter(_.nonEmpty)
  .take(5) // only take 5 lines
  .toVector // now you have a Vector of 5 lines

reader
  ... // like you already have

Have a nice evening
Chris

kumarraj · December 4, 2017, 7:46am

Hi Chris,
thank you very much…
The above example working fine with small file which contains simple string as I given previously.
When I use large log file which contain following lines , it shows array out of bound exception and not writing anything in output file.
When I give parse element based on A0500 (vec(0).contains(“A0500”), it not written anythin in file…not entering the getWrite method, when I give, “LOG” then only it enter the write loop.

TIME_STAMP 2017-06-12 MON 12:20:38
   A0500 INFO LOG RESULT
   SENSOR_TYPE                 = A2MN
   SENSOR_INSTANCE              = TEMP_SENSE
   NOTIFY_NUM              = 1
   LOG_TIME                   = 2017-06-12 MON 12:20:38
   SENSOR_NN                    = 192.168.1.01
   AGENT_NO                   = 12
   LOG_INVOCATION_NUM           = -
   LOG_OUT                 = TEST_FAIL
   LOG_ADDITIONAL_INFO  = 
     SENSOR NUM           = 3
     SENSOR state           = disable_state
     LOG_Type   = FAIL_TYPE
     RES             = FAIL
     FAIL_Reason       = SENSOR_FAIL
;
TIME_STAMP 2017-06-12 MON 11:20:38
   B0500 INFO ERROR RESULT
   SENSOR_TYPE                 = A2MN
   SENSOR_INSTANCE              = TEMP_SENSE
   NOTIFY_NUM              = 1
   ERROR_TIME                   = 2017-06-12 MON 11:20:38
   SENSOR_NN                    = 192.168.1.10
   AGENT_NO                   = 12
   ERROR_INVOCATION_NUM           = -
   ERROR_ADDITIONAL_INFO  = 
     SENSOR NUM           = 5
     SENSOR state           = enable_state
     LOG_Type   = ERROR_TYPE
     RES             = FAIL
     Error_Reason       = SENSOR_ERROR
;

TIME_STAMP 2017-06-12 MON 13:12:48
   A0500 INFO LOG RESULT
   SENSOR_TYPE                 = B2MN
   SENSOR_INSTANCE              = TEMP_SENSE
   NOTIFY_NUM              = 0
   LOG_TIME                   = 2017-06-12 MON 13:12:48
   SENSOR_NN                    = 192.168.1.02
   AGENT_NO                   = 15
   LOG_INVOCATION_NUM           = -
   LOG_OUT                 = TEST_FAIL
   LOG_ADDITIONAL_INFO  = 
     SENSOR NUM           = 3
     SENSOR state           = disable_state
     LOG_Type   = FAIL_TYPE
     RES             = FAIL
     FAIL_Reason       = SENSOR_FAIL
;
TIME_STAMP 2017-06-12 MON 13:20:38
   B0500 INFO ERROR RESULT
   SENSOR_TYPE                 = A2MN
   SENSOR_INSTANCE              = TEMP_SENSE
   NOTIFY_NUM              = 1
   ERROR_TIME                   = 2017-06-12 MON 13:20:38
   SENSOR_NN                    = 192.168.2.17
   AGENT_NO                   = 01
   ERROR_INVOCATION_NUM           = -
   ERROR_ADDITIONAL_INFO  = 
     SENSOR NUM           = 9
     SENSOR state           = disable_state
     LOG_Type   = ERROR_TYPE
     RES             = FAIL
     Error_Reason       = SENSOR_ERROR
;
and so on

Here in the above, I need to remove or omit ERROR_RESULT (from B0500 INFO ERROR RESULT to where ever comes in log file.
Need to take only LOG RESULT.
In the above when I try to extract portion between A0500 INFO LOG RESULT and “;” (where ever it comes.)

When I use tail, it takes only forst 4 lines result and not writing remaining elements ,
AGENT_NO = 15
LOG_INVOCATION_NUM = -
LOG_OUT = TEST_FAIL
LOG_ADDITIONAL_INFO =
SENSOR NUM = 3
SENSOR state = disable_state
LOG_Type = FAIL_TYPE
RES = FAIL

Do I need to change the grouped() element value or any other changes in the program.
Kindly advice on this.
Thank you…

Chris · December 4, 2017, 8:38pm

Hi Kumar,

from “TIME_STAMP 2017 …” to “;” are 17 lines. So you need to change the grouped() value to 17.
The tail lines you need only for one-time header lines in the file. If there is no header, you don’t need them.
The string “A0500” are in the second line of a group, so you need the filter change to
.filter( vec => vec(1).contains("A0500"))
If you want to keep to line with the timestamp, you have to delete the line
.map(vec => vec.tail.filter(_ != ";")) // Skip group name and ';' => the group name is in line 2

Maybe you can use a small extract of the file in scala REPL to see intermediate results from your code.

Greetings
Chris

kumarraj · December 5, 2017, 10:53am

Hi Chris,
thank you…
Here, it always shows Exception in thread “main” java.lang.ArrayIndexOutOfBoundsException: 1
at testApp$$anonfun$3$$anonfun$apply$1.apply(testApp.scala:63)
at testApp$$anonfun$3$$anonfun$apply$1.apply(testApp.scala:61)
at testApp$$anonfun$3.apply(testApp.scala:61)
at testApp$$anonfun$3.apply(testApp.scala:60)
error.
I tried with grouped(17) and also changed its value.
Also given vec(1) as start delimiter.
But the above error comes.
mostly error points to,
val v = “,” + s(1) // prepend ‘,’ as separator char
But if I change the index value it prints wrongly.
Do I need to change ,
foreach( vec => { // every group
vec.foreach( vax => { // every item in a group
val s = vax.split(’=’)
val v = “,” + s(1) // prepend ‘,’ as separator char
getWriter(sinkMap, s(0)) // get a writer
.write(v, 0, v.length) // and print
})
What may be my problem…Also it didn’t write any in the output file…
Thank you…

Chris · December 6, 2017, 9:09pm

Hi Kumar,

the line
val v = “,” + s(1) // prepend ‘,’ as separator char
uses the second entry of the vector. It seems there is a case when the vector is empty, or has only 1 line. In the line above, you are splitting lines into parts with the separator char ‘=’. The lines
TIME_STAMP 2017-06-12 MON 12:20:38
A0500 INFO LOG RESULT
doesn’t have a ‘=’ char, so there may be the problem. Do you need the lines or can they be filtered out?
You posted some sample lines above. What do you expect as result? Can you post the lines?
Maybe I can show you an example implementation the next days.

Greetings
Chris

kumarraj · December 7, 2017, 12:15pm

Hello Chris,
thank you very much…
for better view, the following data in a text file,
I have text as below,
TIME STAMP1
A1200 EVENT START
EVENT NAME = DOS
EVENT_INS = 1
EVENT_ID = 100
BUFFER = 233355
FORMAT = ATC
LOC = C:/User/data
;
TIME STAMP2
A1201 EVENT START
EVENT NAME = DOS
EVENT_INS = 0
EVENT_ID = 87
BUFFER = 773355
FORMAT = ETC
LOC = C:/User/data
;

how can I remove TIME STAMP2 based on A1201,need to remove from A1201 to ; using scala.A1201 sensor part will repeat at different location in the file…Where ever it comes, I need to remove from A1201 to ; … how can I do with scala spark ?. Then I will convert the = to , by mkstring method thank you

kumarraj · December 7, 2017, 4:55pm

Hi chris,
thank you.
Yes I need the lines.
IME_STAMP 2017-06-12 MON 12:20:38
A0500 INFO LOG RESULT
…
;
TIME_STAMP 2017-06-12 MON 11:20:38
B0500 INFO ERROR RESULT
…
;
Based on B0500 and ; I need to delete those part (because B0500 INFO ERROR RESULT is standard and repeated multiple times.
Thank you

Chris · December 17, 2017, 4:29pm

Hello Kumar,

maybe you got your code running meanwhile. I played a bit and got this code:

package kumarraj

import java.io.File
import java.io.{ BufferedReader, FileInputStream }
import java.io.{ BufferedWriter, FileWriter }
import scala.collection.mutable.HashMap
import scala.io.BufferedSource

object Main extends App {
  val filein = new File("test2.txt")
  val fileout = new File("out2.txt")
  val reader = new BufferedSource(new FileInputStream(filein))
  val writer = new BufferedWriter(new FileWriter(fileout))

  reader
    .getLines // read line by line
    .toStream // map to a scala stream
    .map(_.trim) // remove whitespaces
    .filter(_.nonEmpty) // remove empty lines
    .grouped(9) // always 8 lines in a group
    .map(_.toVector) // map the groups to a vector
    // at this point you will have groups with sensor data
    .filter( vec => vec(1).contains("A1200")) // keep when 2nd line contains A1200 - all others are deleted
    .map(vec => vec.filter(_ != ";")) // Skip ';'
    .foreach( vec => { // every group
        val line = vec.mkString(",") // create line from vector
        writer.write(line, 0, line.length) // and print
    })
  reader.close()
  writer.close()
}

Hope this helps you

Have fun
Chris

kumarraj · December 18, 2017, 8:03am

Hi Chris,

I have implemented using scala file parse and regex with spark dataframe methods.
Your code is one more nice method to parse the fields.
Thank you very much for your good inputs and the methods.

Thank you.