I’m based in France, but I often travel for work, including to Silicon Valley. I’ve noticed that the machine learning concerns in Silicon Valley tend to be different from those elsewhere in the U.S. — and outside of the U.S.
So, I’ve compiled a few concrete suggestions for those hearing about the machine learning efforts in Silicon Valley, but who work elsewhere. These suggestions consider where machine learning and data science are headed on a large scale — as opposed to the fascinating (but often narrow) research happening in Silicon Valley.
Contribute to open source
Even if contributing to open source isn’t a big focus where you work, you’re probably still pulling down open source code. That’s completely fine. Not everybody is going to be a contributor, especially at first. But I’d still encourage you to contribute even a little bit. That’ll teach you how decisions are made and what the code can and can’t do. Those details will help steer your own coding efforts.
Some governments outside of the U.S. have been hostile to Github in the past; but, if you’re comfortable, create a profile and learn from some tutorials. Scan Apache’s open source projects and choose one to explore. Play around with Jupyter notebooks, and reach out to other contributors. Then, when contributing to open source becomes more important in your own dev community, you’ll be ready.
Get started with cloud
Here again, maybe you work within a company or region that hasn’t yet embraced the cloud. Maybe the data is sensitive or maybe you’ve got an in-house system that’s working well. But sooner or later, you’ll probably need to work with the cloud, so plan in advance.
Start with a simple account with a public cloud vendor. I work for IBM, so I’ll recommend ours: Data Science Experience. (Registration is free.) Next, find some publicly available datasets, load them and start exploring simple algorithms and tools. If you’re new to machine learning, my advice is don’t try to dive into deep learning or other advanced approaches. Instead, start with decision trees and linear models, and work your way up.
Cloud is about speed. Consider your bottlenecks at work and try running experiments on the cloud to find where things could go faster. Obviously, don’t use your own company’s data, but find public data that’s similar. Chances are your company is going to move toward cloud eventually. If you’re already familiar with the ideas and tools, you can help everyone transition more smoothly.
Keep learning Python
If you’re already using Python for machine learning, you might be frustrated by the lack of support, especially for large data sets.
But many data scientists love Python, so more open source projects are improving their support. You’ll be able to do more as time goes on. Hang in there.
Pay attention to data gravity
It almost always makes sense to move computation to the data rather than moving data to the computation. Same for tools. Bring tools to the data instead of bringing data to the tools.
Simple advice, but it’s hard to follow when you need to set up a data management architecture quickly or cheaply. If the people managing your data have security or data integrity concerns, spend time making your case. In the long term, it makes a huge difference, especially with machine learning, where computation is getting more and more intense.
Know your data
Maybe this seems obvious, but I’ll mention two specific things: fragility and bias. Both are still huge challenges, even in Silicon Valley, and probably always will be.
Fragility is about how models respond to changes in data. What happens if a field changes meaning or stops being populated? We sometimes think machine learning systems can run indefinitely once we’ve created and trained them. Not true. We need to incorporate ways for them to signal that their inputs have changed. If we don’t, they can quickly lead us astray.
Bias in machine learning is too big of a topic to cover here. I’ll just encourage you to research the issues and take it seriously. One thing worth saying: biases span the range from obvious to subtle. You’ll probably find and correct the obvious ones, but don’t get overconfident. Subtle biases are just as destructive — and they’re everywhere.
OK. I hope some of that is helpful. For more good advice, check out the new Machine Learning for Dummies book, which is completely free from IBM.
I’ll close by repeating the first tip: Please jump into the world of open source if you haven’t already. We need your intelligence, your opinions and your enthusiasm. The open source code, tools and community will push you forward like nothing else.