A common challenge most data scientists face in real-time is to share insights from their analysis to business stakeholders. What if we are able to create a web app and share the intuitive dashboards to stakeholders around the world. Are you worried about the skills/effort required to build an app? Worry no more! Streamlit is your answer!
Streamlit is an open-source app framework that helps data scientists and machine learning engineers to create beautiful apps with very few lines of python code! There are no special prerequisites required to work with streamlit. Everything is done with simple python and also it is open-source.
In this article, we will be building a machine learning dataset explorer app using streamlit which will help us to understand the various features and distributions of some of the standard machine learning datasets via analysis and visualization.
But before that let’s create a python virtual environment and install all required packages.
Then create a virtualenv named ‘exploreapp’ (or any name you’d like)
Activate the virtual environment (Linux / Mac OS)
Activate the virtual environment (Windows)
Once done you can see (exploreapp) on the leftmost side of your terminal line meaning the virtual environment is activated.
Now we will install the required packages necessary to build our app.
With all the packages installed, we can check if streamlit is working or not by using the following command on terminal
It should output the following in the browser.
You can stop the app by pressing ‘CTRL + C’ on your terminal on Windows/Linux or ‘Command + C’ on Mac.
Now, we will create a python file named ‘app.py’, it will contain all the codes necessary to run our app. After that create a folder named ‘datasets’ in the same directory and copy all the csv files (you will find them in the attachments below) into it. Now open ‘app.py’ in any text editor.
Import Required Packages
Create the main function:
The ‘title’ method is used to create a heading in our app and it takes a string as an argument.
Now, call the main function,
To check that our app is working, open the terminal and run
Now it will return a url pointing to the app, open it in the browser and you’ll see something like this. Hurray – we have the app up and running! that was simple – isn’t it?
Now we will add some functionality in the main function. The first thing that we need to have is the ability of the user to choose a dataset from our dataset collection. To do so add the following code to the main function.
Here ‘st.write()’ method displays on the app whatever we throw in it as an argument, whether it may be a string, a number, or even a dataframe, it is just like the ‘print()’ function in python. It supports markdown as well.
In the next line we used a function called ‘select_dataset_file()’, which returns the location of the dataset chosen by the user. To do so we will create a ‘selectbox’ widget in streamlit which prompts the user to choose one item from a list of multiple items.
Next, we called another function ‘check_dataset_category()’ which takes the filename chosen by the user and returns the category of that dataset. Here we have taken 3 categories based on the type of machine learning operation that can be performed over them i.e. Regression, Classification, and Timeseries. With each type of dataset, we can apply different analysis and visualization techniques over them. For example, we don’t need to check how many data samples of each class does a sequential data have. Neither we do have to visualize a time-series plot for a classification dataset.
Now let’s get back to our main function, the next line uses an ‘info’ widget which displays an informational message. Using it, we will display the type of dataset the user has chosen.
With that done, your app would look something like this
Using the dataset filename chosen by the user, we’ll read that dataset using pandas and save it to a variable as a dataframe. Now we will use a ‘checkbox’ widget which when enabled will show the user the dataframe. To make it more interactive we can also add a ‘number_input’ widget to let the user choose the number of rows they want to see. The parameters used in ‘number_input’ are, a ‘label’ which displays what the number is for (we passed an empty string there and used markdown above it for visual aesthetics), a ‘min_value’ which is the minimum input value we can take, and a ‘value’ which is the initial value of the ‘number_input’.
Similarly, we will create checkbox to,
- Show list of columns
- Show datatypes of each column
Now if a user wants to see some specific columns of the dataframe, then we can use a widget called ‘multiselect’ which enables us to select multiple things from a list. To do so, we will create a ‘checkbox’ to enable this functionality, then get the list of columns from the dataframe and pass that to the ‘multiselect’ widget.
There’s one more method we used up there, ‘dataframe’ which is used to display dataframes in the app.
Let’s add a few more things to our app,
Category specific analysis
Now we will perform some analysis for the type of data we have selected. For a classification type dataset we can show the value count of each class and display a pie chart showing what percentage of each class data present in the dataset. To plot the pie chart we used the ‘plot’ method of a pandas dataframe and passed that to a ‘write’ method of streamlit and called the ‘pyplot()’ method to display the plot.
We can also show the correlation plot of the dataset of type Regression and Classification in which we can see how related the features are among themselves by using the heatmap plot in seaborn.
Data Visualization in streamlit is very easy, we can either use the built-in methods like ‘line_chart’, ‘area_chart’, ‘bar_cart’ etc, or we can pass our custom plots to streamlit via the ‘write’ method.
For Timeseries data, we need to plot the data with respect to time so the best way to represent that is with a line chart, area chart, or a bar chart. To do so we will first check the type of dataset we are working with, then extract the column containing the time data, and using a ‘selectbox’ widget we can prompt the user to choose a type of plot they want. And then we will create a ‘button’ widget, when pressed, based on the type of plot chosen the chart will be generated.
For classification datasets, we can show visualizations for classes and see their distribution and ranges with the help of plots like box plot, kde plot, violin plot and swarm plot. Here we will use seaborn plots and pass them to streamlit ‘write’ function.
With that our data explorer app is complete. As you can see streamlit gives a new and interesting way to interact with our datasets. You can not only perform data analysis very easily and efficiently but also deploy various machine learning models with just a few lines of codes. You can also host the apps you’ve built using streamlit on various platforms like heroku and aws. Check the official site for good usecases!
You can find the source code and datasets used in this app in the attachments below. Why don’t you try to add some more datasets and try different analyses and visualizations on them using this streamlit, but don’t forget to add the new dataset file name to the ‘check_dataset_category()’ function.
A sample demo of the app we have just created!
Hope you enjoyed this blog and got to learn something new! Would like to hear your feedback on the same!
You can download the code from this link: