“Data is the new oil” is a quote every machine learning enthusiast must have heard at least once! Well, this is true and this oil is one of the most essential raw materials to build a meaningful machine learning model! Aspiring data science enthusiasts often struggle to find the right data for building machine learning solutions and this article intends to help them on this front!
This article presents the Top 5 Dataset repositories which you could use to get your data science journey started! We have also included a bonus repository for all the NLP enthusiasts! Stop Searching for Data and start building your model!
We start with the most obvious & a popular option – KAGGLE! Kaggle is one of the most useful resources for data science students to get vast amounts of data. Millions of people/enterprises upload data of different use cases in the portal and most are available for free. For each of the datasets that are hosted, you also get a lot of reference notebooks and a community to interact with! What else can you as far? All you have to do is just register! If you would like to contribute you can add your datasets also to the portal! Apart from datasets, you also have multiple machine learning competitions that will help your learning journey! Did you know that Kaggle is owned by Google?
UCI MACHINE LEARNING REPOSITORY
You may not know it, majority of the datasets data science students use are from UCI! It first started publishing public datasets in 1987 from graduate students at UC Irvine and now has more than 497+ datasets addressing multiple branches of machine learning. You can find good multivariate and univariate time series datasets, and some of the data are already clean! All the datasets are free and citation is all that is requested to use datasets from this repository! Portal also has features to filter datasets based on machine learning branch, type of data, domain, etc., that makes our life easier!
WORLD BANK DATASET
Next in line is one of the best sources to find civic, geographic, financial, human resources, healthcare datasets from multiple countries! This repository is from the World bank and their access to the resources around the world has helped them build some of the most unique and useful datasets. Datasets can be used free of charge with minimal restrictions. This portal also allows searches based on country, indicators, years, etc.,! You can download data based on country or in bulk!
INDIAN GOVT OPEN DATASET
It’s a go-to repository for Data Science enthusiasts trying to build an India specific use case! It is one of the largest open data repository created by the Govt of India. It hosts datasets from over 173 departments of the govt and has data from all states across the county! As of date, it has more than 8000+ catalog of data! Some of them are very unique and you can’t find this data elsewhere. You have an option to filter data by the ministry, dept, state, etc. In addition, you can also find blogs, articles, and events that are useful. Hackathons are conducted for students, researchers & communities!
Are you an AWS enthusiast, this is for you! Registry of Open Data on AWS hosts some very interesting datasets covering many domains! This portal has some interesting datasets for NLP and computer vision. If you are looking to build something on AWS you can use this dataset directly from the cloud(S3).
Hope you enjoyed this article and as promised here is your bonus repository :
Your Bonus – A dream NLP repository:
Are you an NLP enthusiast? then this is for you! a one-stop repository for anyone looking for NLP datasets. This portal hosts a range of datasets related to multiple NLP applications. That is not all, you can also find very good reference notebooks and papers to start with
I know we promised you one bonus repository. But couldn’t really stop sharing this additional repository! If you are looking for a facial images dataset following is your go to place:
I hope you enjoyed this article. We have shared 5+2 awesome dateset repositories that will help to get your data science journey started! You will find very interesting datasets from these repositories that will satisfy both the learners and experts alike! What are you waiting for? Get started with your model
Stay tuned to Digital Tesseract for such interesting articles!
Other Useful Repositories :
- Common Datasets: https://lionbridge.ai/datasets/
- Computer Vision Datasets: https://www.visualdata.io/discovery
- Microsoft Open Data: https://msropendata.com/
- Common Datasets: https://github.com/awesomedata/awesome-public-datasets
- Google Dataset: https://datasetsearch.research.google.com/