Another goose chase
We have started experimenting with Apache Spark since last few months. I am very excited about Spark, it makes designing data pipelines um...somewhat elegant. Today I was trying to test a simple ml workflow and ended up spending entire day troubleshooting how to build and deploy Spark instead.
We have a two different Hadoop clusters, one for experimentation and one for real project work. I have Spark 1.2 deployed on both clusters but for some reason Spark 1.2 did not work with Hive's parquet tables. Let's call the Cluster where Hive's parquet tables worked as "working" and other one as "not working"
Whenever I tried to query a parquet table using hive context, I got a class not found for ParquetHiveSerDe class. While comparing classpath and configuration data for Spark across both clusters (and Spark1.1), I found that this SerDe class is included in the spark assembly fat jar. When looking through jar file (via jar tf) on "working" cluster, I could clearly see the SerDe class. On "not working" cluster, I instead got
java.util.zip.ZipException: invalid CEN header (bad signature) at java.util.zip.ZipFile.open(Native Method) at java.util.zip.ZipFile.<init>(ZipFile.java:132) at java.util.zip.ZipFile.<init>(ZipFile.java:93) at sun.tools.jar.Main.list(Main.java:997) at sun.tools.jar.Main.run(Main.java:242) at sun.tools.jar.Main.main(Main.java:1167)
My initial google searches related to jar exceptions lead me to think my jar file was getting corrupted while copying dist/ directory to the "not working" cluster. Same jar file however worked fine on "working" cluster. After trying zipping and moving the files via different options, scp, usb drive etc. I still was was not able to read from spark assembly fat jar. I even recompiled Spark code again in case there were some java version incompatibilities
After a few hours I finally figured out that the problem was not really in the jar file. Turns out, CDH upgrade had badly messed up my java sym links. Even though java -version showed java 7, my $JAVA_HOME symlinks were set to Java 6. Plus jar executable was also pointing to java6. This JIRA helped me nail down the issue
https://issues.apache.org/jira/browse/SPARK-1703
Once I set right Java Home links, my errors were gone. Somewhat of a bust. Hopefully this helps someone else and they don't waste half a day solving this problem.



















