-
Notifications
You must be signed in to change notification settings - Fork 628
aws-blog-spark-parquet-conversion: java.lang.ClassCastException when tranforming JSON into parquet #89
Comments
Hey fabioptoi, did you ever get anywhere with this? |
Hey @alexwbai ! No, actually I ended up not running ETL for my files now since it was taking too much time to resolve problems with EMR/Spark (like the one in this Issue) and ended up prioritizing other things. Is there anything you can help with this issue? |
Hey fabioptoi! No I've basically run into the same issue as you. I've successfully converted a days worth of data to parquet from json but when i go for a whole month, the spark job stalls on stage two. I had opened a case with AWS and they are pointing me to the error you had described above (which I had been ignoring until this point since I had success before). Still trying to figure it out... If I find anything I'll update here.. |
Hey @fabioptoi ok I've gotten this to work. It's been a journey of pain (but a great learning experience!). My goal here was to convert a month's worth of GZIP JSON data to SNAPPY PARQUET (similar to you I think). I had about 190GB of JSON data broken up into about 400k small files partitioned by day. I'll break this up in the errors I ran into and then how I got around them:
http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/ Below are the final settings I used on my job. I used 3 x r3.8xlarge instances with an m4-2xlarge master (overkill..?) and put the job process in the background incase my session got disconnected: Deployment config parameters in the EMR console: Spark-submit settings: This completed on my data in about 1.3 hours and I've since used s3-dist-cp to put it back in S3 and have loaded it into Redshift Spectrum and Athena successfully. Good luck! |
Hello,
I was trying to use the same script to transform GZIP JSON data into parquet, but I encountered the following error:
I'm using the EMR configurations (instance types, number, version, ...) as shown in this example. I created my hive table using
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
. An example of the DDL scripts to create and alter my table is:createtable1.hql
addpartitions1.py
In order for spark to be able to read my table, I had to do the following modifications on the configuration file in my master node (adding the SERDE to my both driver and executor's classpath):
The python script is practically the same. I only added a few type casting to some columns in my dataframe, before calling the
write2parquet
function, like:rdf = rdf.withColumn('columnName', rdf['columnName'].cast(DoubleType()))
Nevertheless, I tried without those type castings and still got the same error. The following is the script used without type casting:
convert2parquet1.py
Is there something I'm missing here?
The text was updated successfully, but these errors were encountered: