List of Machine Learning dataset from different domain

Author

Shruti

Created

July 20, 2024August 11, 2024

Updated

August 11, 2024July 20, 2024

Comments

Reading time

6 min

Views

Here is list of open source datasets from different domains.

1. Datasets for General Machine Learning:

a. Classification:

Iris Species: This dataset is the best known database to be found in the pattern recognition literature. It contains 3 classes of 50 instances each, where each class refers to a type of iris plant.
Mushroom Classification: This data set includes descriptions of hypothetical samples corresponding to 23 species of mushrooms. Each species is identified as definitely edible or poisonous.
Titanic: Machine Learning from Disaster: This is the legendary Titanic ML competition the best, first challenge for you to dive into ML competitions. Objective is to predict which passengers survived the Titanic ship
+ many more: List of machine learning classification datasets from UCI repository.

b. Regression:

Student Performance Dataset: This data has students achievement in secondary education of two schools.
Bike Sharing dataset: Bike sharing systems are new generation of traditional bike rentals where the whole process from membership, rental and return back has become automatic.
House Prices: Advanced Regression Techniques: This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set.
+many more: list of machine learning regression dataset from UCI repository.

c. Clustering:

Mall Customers Clustering Analysis: This data set is created for the learning purpose of the customer segmentation concepts , also known as market basket analysis.
Credit Card Customer Data: Customer Credit Card Information Dataset which can be used for Identifying Loyal Customers, Customer Segmentation, Targeted Marketing and other such use cases in the Marketing Industry.
Customer Personality Analysis: Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.
+many more:

Imagenet: ImageNet is an image database organized according to the WordNet hierarchy and consists of thousands of images in each category. This dataset is the most widely used dataset for image classification.
Cifar10: The CIFAR-10 dataset consists of 60000 32×32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
Google’s Open Image: The images are very diverse and often contain complex scenes with several objects. It contains image-level labels annotations, object bounding boxes, object segmentations, visual relationships, localized narratives, and more.
Fashion MNIST : Fashion-MNIST is a dataset of Zalando’s article image and consisting of a training set of 60,000 examples and a test set of 10,000 examples, associated with a label from 10 classes.
MIT indoor Image dataset: This database contains 67 Indoor categories, and a total of 15620 images. The number of images varies across categories, but there are at least 100 images per category.
Microsoft COCO 2017 Dataset : COCO is a large-scale object detection, segmentation, and captioning dataset of many object types.
+ many more:

Yelp review: The Yelp dataset is a subset of our businesses, reviews, and user data for use in personal, educational, and academic purposes.
Amazon Reviews: This dataset consists of reviews from amazon. The dataset includes ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review.
SMS Spam collection Dataset: The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham or spam.
TensorFlow 2.0 Question Answering: The goal is to predict short and long answer responses to real questions about Wikipedia articles. The dataset is provided by Google’s Natural Questions, but contains its own unique private test set.

GoodRead Book dataset: This dataset contains ratings for ten thousand popular books. There are 100 reviews for each book and ratings go from one to five.
MovieLens 20M Dataset: The datasets describe ratings and free-text tagging activities from MovieLens, a movie recommendation service.
Influencers in Social Networks: Predict which people are influential in a social network.
Netflix Movies and TV Shows: It has over 8000 movies or tv shows data available on their platform, as of mid-2021.

National Stock Exchange data set: The data is the price history and trading volumes of the fifty stocks in the index NIFTY 50yo. All datasets are at a day-level with pricing and trading values split across .cvs files
Climate Change Earth Surface Temperature Data: This dataset has global temperatures since 1750.
+many more:

IMDB dataset: This is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training, and 25,000 reviews for testing.
Twitter US airline Sentiment: A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as “late flight” or “rude service”).
Multi Domain Sentiment dataset: The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains). Some domains (books and dvds) have hundreds of thousands of reviews. Others (musical instruments) have only a few hundred. Reviews contain star ratings (1 to 5 stars)
Amazon Product data: This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).
Standford Sentiment dataset : This dataset has a collection of movie reviews.

COVID-19 Open Research Dataset: It has various dataset related to covid 19.
Blood Cell Images: The diagnosis of blood-based diseases often involves identifying and characterizing patient blood samples. Automated methods to detect and classify blood cell subtypes have important medical applications.

2000 HUB5 English: This dataset consists of transcripts of 40 English telephone conversations
Libri speech: The LibriSpeech corpus is a collection of approximately 1,000 hours of audiobooks that are a part of the LibriVox project.
Spoken Wikipedia Corpana: This dataset is a corpus of aligned Spoken Wikipedia articles from the English, German, and Dutch Wikipedia.
Free Spoken Digit Data set: Free Spoken Digit Dataset (FSDD) is a simple audio/speech dataset consisting of recordings of spoken digits in wav files at 8kHz.
TIMIT: The TIMIT Acoustic-Phonetic Continuous Speech Corpus is a standard dataset used for evaluation of automatic speech recognition systems.

Are you looking for any other specific dataset. here are few link where you can search:

Happy Learning !!