Loading spark dataframe from textfile

rawdata

'101,Miller,4000,m,11,hyd'
'102,Blake,5000,m,12,pune'
'103,Sony,6000,f,14,pune'
'104,Sita,7000,f,25,Hyd'
from re import split
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext
sc =SparkContext.getOrCreate()
myspark=sc.textFile("emp1.txt")
# myspark.collect()

myspark1=myspark.map(lambda x:x.split(","))
myspark1.collect()
from pyspark.sql import Row

r3=myspark1.map(lambda x:Row(eid=int(x[0]),name=x[1],salary=x[2],Gender=x[3],dno=x[4],city=x[5]))
r4=spark.createDataFrame(r3)
r4.show()

I am getting ValueError: invalid literal for int() with base 10: “'101”
how to deal this kind of issues in spark? pls suggest

Welcome to the Scala community @ss_ary41
You need to remove the single quotes from your text file. Like this

101,Miller,4000,m,11,hyd
102,Blake,5000,m,12,pune
103,Sony,6000,f,14,pune
104,Sita,7000,f,25,Hyd

Then it works:

========================== RESTART: /home/spam/a.py ==========================
+---+------+------+------+---+----+
|eid|  name|salary|Gender|dno|city|
+---+------+------+------+---+----+
|101|Miller|  4000|     m| 11| hyd|
|102| Blake|  5000|     m| 12|pune|
|103|  Sony|  6000|     f| 14|pune|
|104|  Sita|  7000|     f| 25| Hyd|
+---+------+------+------+---+----+

By the way, since the code is Python, you might be better off asking in a Python discussion board. Most folks here probably wouldn’t know too much about Python type errors. You are still welcome here of course :smiley:

Also the code you posted is horribly styled, and it was missing some parts. It looks like you are copy-pasting without understanding anything. That’s a very poor way to learn. Here is my code

from pyspark import SparkContext
from pyspark.sql import SparkSession, Row

sc = SparkContext.getOrCreate()
spark = SparkSession.builder \
    .master("local") \
    .appName("test program") \
    .getOrCreate()

myspark = sc.textFile("emp1.txt")

myspark1 = myspark.map(lambda x: x.split(","))
myspark1.collect()

r3 = myspark1.map(lambda x:
                  Row(eid = int(x[0]),
                      name = x[1],
                      salary = x[2],
                      Gender = x[3],
                      dno = x[4],
                      city = x[5]))

r4 = spark.createDataFrame(data = r3)
r4.show()