How to get a DF using arrays of multiple types and impossibility of using schema inferrence


#1

Hello everyone !
I’m new here so i hope i will respect the chart.
Hope i’ll be clear.
I have a question about scala (i’m working on spark scala actually).
I have on data where each row is an array of string initially.
To apply my schema, i need to convert some values at some positions in my array to double (5 to 18), position 1 to Int, and let the other in string.

val churn_app1 = churn_app.map(x=> {for(i<-0 to 19) yield { if (i==1) x(i).toInt else if ((5 to 18).toSet contains i) x(i).toDouble else x(i).toString }}.toArray)

I “succeed” doing that but now, i have an array of any and when i apply my scheme, it doesn’t recognize the types because every elements of my arrays are of type any.
This is the object i want to apply to every array but i always have " not enough arguments" but i’m sure it’s related to the any type.
case class Account(state: String, len: Integer, acode: String,
intlplan: String, vplan: String, numvmail: Double,
tdmins: Double, tdcalls: Double, tdcharge: Double,
temins: Double, tecalls: Double, techarge: Double,
tnmins: Double, tncalls: Double, tncharge: Double,
timins: Double, ticalls: Double, ticharge: Double,
numcs: Double, churn: String)

When i do it manually :
churn_app.map(line=> Account(line(0),line(1).toInt,line(2),line(3),line(4),line(5).toDouble,line(6).toDouble,line(7).toDouble,line(8).toDouble,line(9).toDouble,line(10).toDouble,line(11).toDouble,line(12).toDouble,line(13).toDouble,line(14).toDouble,line(15).toDouble,line(16).toDouble,line(17).toDouble,line(18).toDouble,line(19)))
it works perfectly obv, but what will happen when i’ll have 1000 elements in my array … ?

My ultimate goal is to get a data frame in order to apply some algorithms.

Thank you in advance for your help.
If things aren’t clear enough, i can help.


#2

I only know the term “data frame” from R, but I understand you want to
store a large table with possibly 1000 columns, where each column can have
its own type.

You probably want some object that represents the type of a value. If you
know for sure you only going to have a few types (say, String and
primitives), you can easily roll your own. If you want to be more flexible,
you can use scala.reflect.runtime.universe.Type, although it’s a bit
unwieldy, because its a path-dependent type dependent on universe.

You can wrap values together with their type in something like:

trait TypedValue {

  • def value: Any*

  • def tpe: Type*

  • def asString: Option[String]*

  • def asDouble: Option[Double]*

  • def asInt: Option[Int]*

}

If you have only few types, you can instantiate the above for each type.
Otherwise, you can also write a more generic:

case class TypedValue(value: Any, tpe: Type) {

  • def as[T: TypeTag]: Option[T] = if(tpe =:= typeOf[T])
    Some(value.asInstanceOf[T]) else None**}*

If you have only a few types, you can hard-code a few options to convert
your original String into a TypedValue:

def toAny(valueString: String, tpe: Type): Any = {

  • if(tpe =:= typeOf[String] {*

  • valueString*

  • } else if(tpe =:= typeOf[Int]) {*

  • valueString.toInt*

  • } …*
    }

*val toTypedValue(valueString: String, tpe: Type): TypedValue = *

  • TypedValue(toAny(valueString, tpe), tpe)*

Otherwise, you can have a collection of converters.


#3

Hello Curoli !
Thank you for your time and your complete answer.
To be more specific, i want to be able to have an array with multiple types, in order to apply a class object (using case class Account in my example) and then have an array of Account. After that i can transform to a data frame. The problem in making type transformation and stock after, is that i get an array of any. When i’m mapping, applying Account directly on my type transformations, there are no problem.
I can’t make the link between what you explained me and what i want. Can you be more specific ? (i’m sorry, it’s maybe my understanding of the concepts the problem).


#4

HLists might help.


#5

In this case, I misunderstood you. I thought you wanted to avoid using a
row object because you found it unfeasible for 1000 columns.

I think in this case, you either use code generation, or use reflection, or
you have to put in all fields by hand in one way or another.

But creating an object with 1000 fields sounds like a terrible idea.