IBM’s Science Experience (DSX) comes in multiple flavors: cloud, desktop, and . In this post we cover an IoT trucking on DSX , i.e. running on top of Hortonworks Data Platform (HDP). We train and deploy a model, and then we use that model to score simulated incoming trucking data in NiFi. We closely follow a data science lifecycle process as we discuss all the steps.

Fig 1. Data science lifecycle

Step 1 – Problem Definition

Imagine a trucking company that dispatches trucks across the country. The trucks are outfitted with sensors that collect data. For instance, location of the driver, weather conditions, and recent events such as speeding, the truck weaving out of its lane, or following too closely. This data is generated once per second and streamed back to the company’s servers.

Fig 1. Sample input data

The company needs a way to process this stream of data and run some analysis on the data so that it can make sure trucks are traveling safe and that the driver is not likely to make any violations anytime soon. And all this has to be done in real-time.

Step 2 – ETL & Feature Extraction

Fig 2. Input features and output label correlation matrix 

For predicting violations, we simulate trucking events in terms of location, miles driven, and weather conditions. We perform multiple feature engineering steps and examine correlations between different features.

The first covers the following:

  • Fetching data from HDFS
  • Feature engineering
  • Data visualization

Step 3 – Learning & Model Deployment

Fig 3. Model testing in DSX’s UI

Once the data is ready, we build a predictive model. In our example we are use the SparkML Random Forest classifier. Classification is a statistical technique which assigns a class to each driver: violation or normal. We train the model on a small dataset containing historical data and  evaluate the model on several different metrics: accuracy, precision, and area under ROC curve. Finally, we deploy and test the model in a DSX UI and make RESTful API calls.

The second video covers the following:

  • Building a Random Forest classifier in Spark ML
  • Saving the model in a Machine Learning repository
  • Deploying the model online via UI
  • Testing the model via UI and RESTful API

Step 4 – Simulating End-to-end Data Flow

Fig 4. Simulated trucking data flow in Apache NiFi

With the model’s accessibility via RESTful API calls, we simulate an end-to-end flow in Apache NiFi. Here we have multiple processors that deal with data simulation (by randomly selecting a combination of acceptable values) and making a call to the model to decide whether a violation is likely to occur. Depending on the model prediction we write the results to a plain-text file.

The third video covers the following:

  • Simulating trucking data
  • Calling the model via RESTful API
  • Routing data based on the API response: violation or no violation
  • Storing results

Closing Thoughts

The next step would be to attach dashboards that would allow more monitoring and trigger alerts for the trucking fleet. These alerts would be both useful to the trucking management as well as to the individual drivers who could take corrective action to diminish probability of a violation.

If we deployed this to production, we would replace the simulated data with actual sensor and weather data. We would also make sure we simplified the model by removing redundant features, i.e. features that are highly positively correlated, e.g. hours and miles driven, and do not provide any additional useful information.

For Random Forest classifiers, normalizing data is not necessary, although for other classifiers this may be a necessary step.

Fig 5. Example of an unbalanced data in the data set

Finally, we would keep monitoring model health as new data came in, making sure that our models are still performing with acceptable metrics, e.g. area under ROC curve.

Resources



Source link
Bigdata and data center

LEAVE A REPLY

Please enter your comment!
Please enter your name here