Hive on Spark is not working

Problem: in Hive CLI, the simple command doesn’t return a result.

Solution: make sure you have at least one worker (or slave) for Spark Master

hive> select count(*) from subset1_data_stream_with_cgi;

Status: Running (Hive on Spark job[0])
Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
2016-06-30 15:09:54,526    Stage-0_0: 0/1    Stage-1_0: 0/1
2016-06-30 15:09:57,545    Stage-0_0: 0/1    Stage-1_0: 0/1
2016-06-30 15:10:00,561    Stage-0_0: 0/1    Stage-1_0: 0/1

Continue reading “Hive on Spark is not working”

Hive CLI doesn’t start

Problem: Hive CLI turned off suddenly, and I cannot start Hive CLI again

Error message:

java.sql.SQLException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=/mnt/storage/DATA/hadoop/metastore_db;create=true, username = APP

Diagnosis: since Derby database allow only 1 connection to its database, it creates a *.lck in the folder databaseName above. So to this folder, and delete those *.lck file.

After I deleted dbex.lck and db.lck, then hive can start as usual.

PySpark and Jupyter

Problem: how to run PySpark in Jupyter notebook.

Some assumption before starting:

  • You have Anaconda installed.
  • You have Spark installed. District Data Lab has an exceptional article on how to get started with Spark in Python. It’s long, but detailed.
  • pyspark is in the $PATHvariable.

There are 2 solutions:

  1. The first one, it modified the environment variable that pyspark read. Then the jupyter/ipython notebook with pyspark environment would be started instead of pyspark console.
  2. The second one is installing the separate spark kernel for Jupyter. This way is more flexible, because the spark-kernel from IBM This solution is better because this spark kernel can run code in Scala, Python, Java, SparkSQL.

Continue reading “PySpark and Jupyter”