Watson Studio Local is now part of IBM Cloud Pak for Data. Discover More Cloud Pak for Data.
Summary
This code pattern shows how data researchers can take advantage of remote Spark clusters and calculate environments to deploy a spam and train filter model. The design is built utilizing natural language processing and machine knowing algorithms and is utilized to classify whether an offered text is spam or not.
Description
This code pattern is a presentation of how information scientists can utilize remote Spark clusters and compute environments from Hortonworks Data Platform (HDP) to train and release a spam filter model utilizing Watson Studio Local
A spam filter is a category model built utilizing natural language processing and machine knowing algorithms. The design is trained on an SMS spam collection dataset to classify whether a given text is spam, or ham (not spam).
This code pattern supplies several examples to tackle this issue, using both regional (Watson Studio Local) and remote (HDP cluster) resources.
After completing this code pattern, youll comprehend how to:
Load information into Spark DataFrames and use Sparks machine learning library (MLlib) to establish, train and deploy the Spam Filter Model.
Load the information into pandas DataFrames and use Scikit-learn maker discovering library to establish, train and release the Spam Filter Model.
Use the sparkmagics library to link to the remote Spark service in the HDP cluster by means of the Hadoop Integration service.
Use the sparkmagics library to push the python virtual environment containing the Scikit-learn library to the remote HDP cluster via the Hadoop Integration service.
Plan the Spam Filter model as a python egg and distribute the egg to the remote HDP cluster by means of the Hadoop Integration service.
Run the Spam Filter Model (both PySpark and Scikit-learn versions) in the remote HDP cluster making use of the remote Spark context and the remote python virtual environment, all from within IBM Watson Studio Local.
Conserve the Spam Filter Model in the remote HDP cluster and import it back to Watson Studio Local and batch rating, and assess the model.
Flow
The spam collection data set is filled into Watson Studio Local as a possession.
The user communicates with the Jupyter notebooks by running them in Watson Studio Local.
Watson Studio Local can either use the resources offered in your area or make use of HDP cluster resources by linking to Apache Livy, which belongs of the Hadoop Integration service.
Livy links with the HDP cluster to run Apache Spark or access HDFS files.
Instructions
Get the comprehensive guidelines in the README file. These steps will show you how to:
Clone the repo.
Develop task in IBM Watson Studio Local.
Create project assets.
Commit changes to Watson Studio Local Master Repository.
Run the note pads noted in each example.