IBM’s Data Science Experience (DSX) comes in multiple flavors: cloud, desktop, and local. In this post we cover an IoT trucking demo on DSX local, i.e. running on top of Hortonworks Data Platform (HDP). We train and deploy a model, and then we use that model to score simulated incoming trucking data in Apache NiFi. We closely follow a data science lifecycle process as we discuss all the steps.
Fig 1. Data science lifecycle
Step 1 – Problem Definition
Imagine a trucking company that dispatches trucks across the country. The trucks are outfitted with sensors that collect data. For instance, location of the driver, weather conditions, and recent events such as speeding, the truck weaving out of its lane, or following too closely. This data is generated once per second and streamed back to the company’s servers.
Fig 1. Sample input data
The company needs a way to process this stream of data and run some analysis on the data so that it can make sure trucks are traveling safe and that the driver is not likely to make any violations anytime soon. And all this has to be done in real-time.
Step 2 – ETL & Feature Extraction
Fig 2. Input features and output label correlation matrix
For predicting violations, we simulate trucking events in terms of location, miles driven, and weather conditions. We perform multiple feature engineering steps and examine correlations between different features.
The first video covers the following:
- Fetching data from HDFS
- Feature engineering
- Data visualization
Step 3 – Learning & Model Deployment
Fig 3. Model testing in DSX’s UI
Once the data is ready, we build a predictive model. In our example we are use the SparkML Random Forest classifier. Classification is a statistical technique which assigns a class to each driver: violation or normal. We train the model on a small dataset containing historical data and evaluate the model on several different metrics: accuracy, precision, and area under ROC curve. Finally, we deploy and test the model in a DSX UI and make RESTful API calls.
The second video covers the following:
- Building a Random Forest classifier in Spark ML
- Saving the model in a Machine Learning repository
- Deploying the model online via UI
- Testing the model via UI and RESTful API
Step 4 – Simulating End-to-end Data Flow
Fig 4. Simulated trucking data flow in Apache NiFi
With the model’s accessibility via RESTful API calls, we simulate an end-to-end flow in Apache NiFi. Here we have multiple processors that deal with data simulation (by randomly selecting a combination of acceptable values) and making a call to the model to decide whether a violation is likely to occur. Depending on the model prediction we write the results to a plain-text file.
The third video covers the following:
- Simulating trucking data
- Calling the model via RESTful API
- Routing data based on the API response: violation or no violation
- Storing results
The next step would be to attach dashboards that would allow more advanced monitoring and trigger alerts for the trucking fleet. These alerts would be both useful to the trucking management as well as to the individual drivers who could take corrective action to diminish probability of a violation.
If we deployed this to production, we would replace the simulated data with actual sensor and weather data. We would also make sure we simplified the model by removing redundant features, i.e. features that are highly positively correlated, e.g. hours and miles driven, and do not provide any additional useful information.
For Random Forest classifiers, normalizing data is not necessary, although for other classifiers this may be a necessary step.
Fig 5. Example of an unbalanced data in the training data set
Finally, we would keep monitoring model health as new data came in, making sure that our models are still performing with acceptable metrics, e.g. area under ROC curve.
Bigdata and data center