Problem: in Hive CLI, the simple command doesn’t return a result.
Solution: make sure you have at least one worker (or slave) for Spark Master
hive> select count(*) from subset1_data_stream_with_cgi;
Status: Running (Hive on Spark job)
Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
2016-06-30 15:09:54,526 Stage-0_0: 0/1 Stage-1_0: 0/1
2016-06-30 15:09:57,545 Stage-0_0: 0/1 Stage-1_0: 0/1
2016-06-30 15:10:00,561 Stage-0_0: 0/1 Stage-1_0: 0/1
Continue reading “Hive on Spark is not working”
Problem: Hive CLI turned off suddenly, and I cannot start Hive CLI again
java.sql.SQLException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=/mnt/storage/DATA/hadoop/metastore_db;create=true, username = APP
Diagnosis: since Derby database allow only 1 connection to its database, it creates a *.lck in the folder databaseName above. So to this folder, and delete those *.lck file.
After I deleted dbex.lck and db.lck, then
hive can start as usual.
Problem: how to run PySpark in Jupyter notebook.
Some assumption before starting:
- You have Anaconda installed.
- You have Spark installed. District Data Lab has an exceptional article on how to get started with Spark in Python. It’s long, but detailed.
pyspark is in the
There are 2 solutions:
- The first one, it modified the environment variable that
pyspark read. Then the jupyter/ipython notebook with pyspark environment would be started instead of pyspark console.
- The second one is installing the separate spark kernel for Jupyter. This way is more flexible, because the spark-kernel from IBM This solution is better because this spark kernel can run code in Scala, Python, Java, SparkSQL.
Continue reading “PySpark and Jupyter”
3 levels of testing
- Scenario / Large Tests: test the whole application
- Functional / Medium Tests: test the interaction between classes/modules
- Unit / Small Tests: test the logic of the application, the outcome of functions in individual classes.
Simple take-away: Separate the object creation from the business logic