Often, we see good data science enthusiasts struggle with coding. PyCaret is your one-stop solution! It is the ultimate low code machine learning library you have been waiting for!

PyCaret “ is an open-sourced, low-code,  End to End, Machine Learning library in Python” 

Simply put, you can design & deploy your machine learning model within seconds in any notebook of our choice! This powerful library has neatly automated multiple steps in the machine learning pipeline and is very interactive too! Compared to other leading machine learning libraries the number of lines needed to build an end to end machine learning library is very less. PyCaret also integrates seamlessly with all the leading ML libraries and tools. 

This article features some of the unique features of PyCaret and helps you get started on the topic. Let’s get started!  

Installing PyCaret

PyCaret works in Jupyter, AWS, Colab etc. Installation is straightforward and done using PIP 

Python

Importing Data in PyCaret 

Importing data in PyCaret is done in two ways – importing external data using pandas or using the in-built data from PyCaret repository. You can look at the available dataset in PyCaret here

Example :

Python

Environment Setup

The most important feature in PyCaret is setting up an environment. This is a mandatory step before we proceed to model building. An environment setup can be done in two steps. 

  • Initially, the machine learning module should be imported from PyCaret Library. 

Example :

Python
  • Environment can be set up using setup( ) function where data and target can be passed as arguments.
Python

Example :

Once setup is done, the output will be the inference of the dataset providing categorical and numeric features, missing values and a lot more descriptions and its value will be available. 

If the inferences from the setup function is correct, you can press enter in the dialog box that shows down or you may enter ‘quit’ to manually pass extra arguments for changing the features values and run the setup command. 

If the setup is complete, we receive the message as Setup Successfully Completed!

Session ID 

A random number known as session_id  can be passed as an argument in the setup function. This will help in reproducibility of the environment in the same or different environment. 

Example :

Python

We have our data and environment ready. Now let’s jump into the most time taking phase in the machine learning process.

Data Pre-Processing 

About 80 percent of time consumed lies in pre-processing and cleaning the data. PyCaret automatically has the ability of pre-processing the data. PyCaret provides a library of over 25 preprocessing steps that can be used to prepare your data. From feature transformation to advanced feature engineering, PyCaret automates it all.

Missing Value Imputation

One technique includes missing value imputation. Rather than omitting the ‘NaN’ values, PyCaret imputes the empty or NaN values.

The Strategy is to impute the missing values with default ‘mean value’ for numerical features and ‘constant value’ for categorical features.

This can also be imputed using other measures like median or mode for numerical or categorical features using numeric_imputation and categorical_imputation as arguments in the setup function.

Note : 

In classification tasks, PyCaret automatically encodes the target variable if it is a categorical feature.

Normalization 

Scaling the numerical features to a single unit is often followed in the data cleaning process. Normalizing the variables can be easily done in PyCaret using normalize parameters within the setup function. The default value is set to False and PyCaret normalizes based on Z-Score measure. 

However, this can be changed using normalize_method parameter

Transformation

Transformation is also another important step that is sometimes followed to make the data convert to a  gaussian distribution. This is made simple by using transform parameter ( default value = False ) and the method of transformation can be changed using tranform_method parameter. 

Feature Selection

PyCaret reduces our work in analysing the important feature that contributes more in predicting our target variable. The feature_selection function automatically ignores and removes the less important variables present in the dataset. Default value is False which can be changed and added as a parameter within setup function.

Manually, we can also remove specific variables of our choice which can be done using ignore_features function. It takes a list of variable names to be ignored. 

Multicollinearity

Among many variables present in a dataset, few variables may be highly correlated with others. It is essential to remove multicollinearity among variables. PyCaret’s function remove_multicollinearity will do the work for us. Among the highly correlated variables, the less correlated variable with the target is removed.  ( Default Value = False ).

Note : These tasks are a part of the pipeline in PyCaret developed when we set up an environment.

Model Building 

Now, we have the data been processed automatically in the setup environment. 

Models available in the library

PyCaret supports multiple Supervised and Unsupervised machine learning models. The table below lists the available models in PyCaret. 

Model Building flow

Train-Test Split

Splitting of a dataset into a training set and a testing set is followed in every machine learning process for training the model on the train set and testing the model on the test set. 

By default, the ratio in which data is split into train and test sets is 70 and 30 percent respectively. This can be changed using train_size as an argument in the setup environment. 

Comparing Models

A useful feature in PyCaret is compare_model function which  trains and compares all available models based on modules imported ( say regression / classification / clustering ) with respect to common evaluation metrics using K-Fold Cross validation.

This feature produces a table of the model results in terms of evaluation metrics being sorted with the better model first and the lower performed model at the bottom. 

For classification problems, The table is sorted based on higher accuracy and R2 Score for regression problems. The default value of k in k-fold is set to 10. This can also be changed using fold as an argument within compare_models() function.

Output of compare_models

Note : compare_models is a function which can be used only for classification and regression tasks.

Creating a Model 

After we see the comparison results of many different algorithms for either classification or regression problems, creating a model of the best performed algorithm is very simple using PyCaret.

A model can be created using the ‘create_model’ function which takes only one value as a parameter. The model abbreviation can be passed as a string argument in the create_model function. 

Output : 

The output will be a table result of k-fold cross validated scores and trained model object  in terms of supervised learning algorithms ( regression / classification ). In case of unsupervised learning modules, only trained model objects will be produced. 

Example : 

Python

create_model also takes argument for changing fold parameter and round for rounding off to specified decimals. 

Tuning a Model 

Once we have created a model, tuning of hyper-parameters for any machine learning algorithm can be easily done in PyCaret using tune_model function. It takes a mandatory string parameter which is the Model abbreviation. 

The output of tune_model is the same to that of create_model.

Parameters in tune_model : 

  • Iterations can be manually passed as an argument using n_iter ( default value = 10 iterations )
  • Fold parameter can be changed  from its default value set to 10 using fold
  • The decimal values in the table can be rounded off using a round argument.
  • Optimization of a model can be done on a specific metric ( default value for Regression is R2 and Classification is Accuracy ) passed as an argument within tune_model. 

Ensemble Model

ensemble_model is used for ensembling a trained model which takes a trained object as a mandatory parameter.

Bagging or Boosting can be passed as a method within ensemble_model function. 

The output is also the same as mentioned in the create_model function. 

To know more about ensemble models, learn from digital tesseract Boosting Explained  and the official site of PyCaret here .

Blend Models

The feature blend_model combines different machine learning models and works as an ensemble technique    ( Voting System or Average of all model outcomes ). Blend model will use all the available models in classification or regression modules as default. However, Specific trained models can also be passed as an estimator_list within blend_model. 

Stack Models

PyCaret has many interesting features that makes predictions more accurate. One such feature is stack_models function. Stack model uses meta learning which learns to adapt new  data when tested that makes predictions better. Meta learning is followed by using a meta model that helps in predictions of multiple base estimators. The required models to form a meta model can be passed using meta_model as an argument ( Default model = Linear ). 

Multiple Layer Stacking is also possible using create_stacknet function which takes estimator_list as a list within lists. 

Please check the official site of PyCaret on Stack Models here

Model Analysis

Analysis of models is the most time consuming part in any Machine Learning process. PyCaret makes our work easier in this phase too. 

Plot Model 

Plotting the model based on different metrics can be done using plot_model feature. The parameters within plot_model are trained model object and type of plot. 

Types of plot include confusion  matrix, AUC, Residual plots etc., 

To get an idea on different types of plots that can be passed within plot_model, see here.

Interpret Model

Interpretation of a model based on different metrics can be done using interpret_model function. 

Similar to plot_model function, interpret_model also takes two parameters. One is a trained model object and the other is type of plot as string.

Model Testing

We have successfully built our model and interpreted the results using the above features in PyCaret. Inorder to see how the model works, predictions should be done on either test set or new data. 

predict_model will predict on the test data from the pipeline taking the trained model object as parameter. 

In case, predictions are to be done on new data other than the test set, predict_model will take data as another parameter.

Example :

Python

Finalizing a model 

finalize_model is a function that allows to train the chosen model for the last time on the entire dataset. final_model takes a trained model object as a parameter. 

Example :

Python

Saving a Model and Experiment

save_model is a function that saves a model and its entire pipeline as a pickle file. It takes two parameters which are the trained model object and the name to be saved as a string. This saved model can later be used using load_model function. 

Example :

Python

Saving an Experiment

The transformation pipeline, models built inside the pipeline, and all the output in the Experiment can be saved as a pickle file using PyCaret’s save_experiment function.

Similar to loading a saved model, an experiment can also be loaded using load_experiment function. 

Conclusion 

This blog is an introduction to PyCaret. It is an interesting library which handles end-to-end machine learning processes. As said, PyCaret is a low-code machine learning library where the codes seem like just a word. Pre-Processing, Analysis and Interpreting the model becomes very easy using PyCaret.  Stay tuned to Digital Tesseract for many more interesting blogs.

Reference Article 

Official Site of PyCaret – https://pycaret.org/

Tags : pycaret, low-code library