On November 16, we hosted the Modernize your Existing EDW with IBM Big SQL and Hortonworks Data Platform webinar with speakers from Hortonworks, Carter Shanklin and Roni Fontaine, and IBM, Nagapriya Tiruthani. The webinar provided an overview of how organizations can modernize their existing data warehouse solutions to easily offload data into Apache Hadoop and Apache Hive. It also provided best practices and use cases for offloading and porting workloads from Oracle, Db2 and Netezza as well as use cases for using Hive and/or Db2 Big SQL. To get access to the slides, go here.
Some great questions came across during the webinar. As promised, here is a brief capture of that Q&A.
1. Do Big SQL and fluid query have separate offerings?
No, Big SQL includes Fluid Query technology (aka Federation) to connect to remote data sources.
2. Does Big SQL support in-memory databases?
Yes, Big SQL federates to in-memory databases. SAP Hana was tested in-house using JDBC. Spark connector can also be used to access NoSQL or in-memory databases.
3. Does Big SQL have its own security implementation or does it simply use the security features in RDBMS and Hadoop?
Yes, Big SQL has Role-based access control (RBAC) which enables granular security settings on data for row filtering and column masking.
4. What is the best approach for ingesting data into Hadoop using BI SQL or Big SQL play role once the data is in Hadoop for ELT processes?
We have a couple of blogs that can help you understand data ingestion and ETL using Big SQL. ETL processing in Big SQL: https://developer.ibm.com/hadoop/2016/07/28/useful-tips-on-etl-processing-in-big-sql/ and Ingest using Big SQL: https://developer.ibm.com/hadoop/2017/09/18/big-sql-ingest-adding-files-directly-to-hdfs/
5. From the architecture of Big SQL, I noticed that it uses Slider to leverage Yarn. But Slider is going to get deprecated and so how does Big SQL run as a yarn process?
Slider project though deprecated, it is actually getting merged with YARN. Therefore, Big SQL will be integrated with YARN to handle the resources for long running processes.
6. Big SQL currently has a number of tables limitation of around 65,000. Is there a plan for Big SQL to remove that limitation?
We are exploring options to remove some of the Db2 imposed limits on Big SQL?
7. Can we use BIG SQL as an ETL tool to load data from Oracle to Hadoop?
Yes, you can use LOAD or Insert. Select to offload data from Oracle to Hadoop?
8. Do I need Big SQL if I have Hive LLAP with SQOOP/Flume/Kafka/Spark Streaming integration?
If you want to query data that is just Hadoop, Hive LLAP might be adequate. If you want to combine data by federating to different sources or run complex queries with high concurrency, Big SQL will be a better fit.
9. Why do I need Big SQL when Hive can do everything I need?
Big SQL has its own unique set of capabilities. It can federate all your data behind a single SQL engine, it is compatible with Oracle and it provides performance optimization around highly complex workloads. Hive doesn’t handle Oracle or provide federation. Hive has its own unique capabilities around EDW Optimization use cases. If federation is important to you, it is worthwhile to look at this technology to use with Hive.
10. Do Hive and Big SQL run on the same cluster?
Big SQL has an Ambari management pack. It is fully managed with the Ambari stack. You can use the management pack to deploy Big SQL to run side by side on the same cluster as Hive.
11. When would I use Hive versus Druid?
Druid is a very interesting technology. It does not have a SQL interface. We created a Hive Druid interface so can do the analytics. How do I get SQL analytics for streaming data? You can use Druid as the place to land the streaming data and use Hive as the analytics layer on top. It’s essential to use both technologies.
12. Does Druid integrate with Storm like Hive does?
Druid is typically integrated with Storm via Kafka, with Storm processing data and writing it to Kafka while Druid reads and indexes the data landed in Kafka for fast analytics. Hortonworks Data Flow (HDF) includes Streaming Analytics Manager which provides a drag-and-drop UI to make this end-to-end process simple.
13. Do I have to use Druid API or Hive API when data from historical/realtime gets loaded?
For querying data, the Hive SQL API can be used to query data across both Hive and Druid, including joins across Hive and Druid data.
14. Can Big SQL and Hive share data nodes on the same cluster? And what will be the impact?
Yes, Big SQL and Hive both run within YARN in the Hadoop cluster and can run at the same time. This will lead to a performance impact as both Hive and Big SQL will compete for CPU, memory and I/O resources. More mission-critical applications often need greater separation which can be controlled using YARN capacity management features.
15. Can I perform CDC in Hive with ACID feature?
Yes. See https://hortonworks.com/blog/update-hive-tables-easy-way-2/ by Carter Shanklin for more information.
A short video on IBM Db2 Big SQL: https://www.youtube.com/watch?v=fMaEeNsyrgE
To learn more about HDP and IBM Db2 Big SQL and also try the Sandbox with tutorials, go to https://hortonworks.com/partners/ibm-bigsql/
Bigdata and data center