At the recent Data Day Texas, Paige Roberts of Syncsort caught up with Joey Echeverria, an architect at Splunk, and author of O’Reilly book, Hadoop Security. In the first of this three-part blog, Roberts and Echeverria discuss some common Hadoop security methods, and why they’re so desperately needed.
Paige: Let’s start with an introduction. Tell us about yourself.
Okay, so that’s two radically different things there. Let’s start with the Hadoop Security aspect.
Hadoop is an interesting system because it’s fundamentally built on the premise that users send unsigned code to be run on large clusters of machines to do arbitrary data processing. In that kind of a world, most of the usual security controls that you rely on in terms of a single server operating system don’t really work the same way. Even if you lock down access to the local file system, which you can’t do completely because Hadoop itself needs access to the local file system, you still have to access the distributed file system.
With distributed systems there’s often not a central source of truth in terms of identity and authentication information, so you have to add that from outside. The way that Hadoop does that is with Kerberos which gives you authentication and identity, and because Kerberos can be a relatively expensive protocol, Hadoop implements its own authentication tokens which are basically proxies for your Kerberos ticket. You then use your Kerberos ticket to get an authentication token and the rest of the time you use the authentication token so it’s a little bit lighter weight. That’s sort of the rough basics there, but the details get very, very complicated, very, very quickly depending on use case and exactly what Hadoop related projects are trying to deploy.
I know that security is a big concern. Obviously if you’ve got a data lake, the last thing you want is someone to break in there, and raid all your data. My question, is what’s your best defense?
It depends on what threats you’re trying to mitigate against. Standard best practices are always, always, always to put clusters behind corporate firewalls. Never have them open to the public internet. Even if you are implementing other security controls, you want them on that private network. You need to use a VPN or actually be at a physical office location to access it. That’s number one.
The other thing that you’re going to want to do is implement auditing right away. The reason why is because at some point, no matter how good your security controls are, someone is going to break in, and you want evidence of that break in and what they touch. You also want to do authentication, which I just talked about with Kerberos, and then you want to set up your authorization rules and have the Hadoop file system which has POSIX based file permissions that you can implement. Then there are other systems that will allow you to do more fine grain security control, such as role based access controls.
Yeah, Ranger lets you do that. Apache Sentry lets you do that. They usually integrate with the higher level query engine. So, Apache Impala, Apache Hive, those all can use those deeper policy engines.
Make sure to check out part two when Echeverria gets into more detail regarding different methods of security when dealing with Hadoop.
Download our eBook, 2018 Big Data Trends: Liberate, Integrate & Trust, for 5 Big Data trends to watch for in the coming year.