There are countless sources where one might obtain data. All of our devices and anything connected to those devices are sources of data. When employed as a data scientist, your employer may specify data sources that you are to use but oftentimes, you will have leeway to choose your sources. Here, I‘ll cover some places you wish to search for data should the need or curiosity arise.
It’s hard to believe that Barack Obama appointed the first Chief Information Officer of the United States only 11 years ago when the newly created post was given to Vivek Kundra. One of Kundra’s first acts was to create data.gov to high-value readable datasets. It’s grown over the years to include over two hundred thousand data sets from federal, state, and local municipalities. Data sets are often available in multiple formats: CSV, JSON, HTML, and Excl formats to name a few. It’s definitely the first place to look for public data ranging from the census, COVID, or local crime data.
Next, I’ll discuss the Registry of Open Data on Amazon Web Services (AWS). There aren’t that many open sets here but those sets hosted openly on AWS can be really interesting and of high quality. Some data sets in the collection at present include the Cancer Genome Atlas, Landsat8- a collection of satellite imagery of all land on earth, and Common Crawl, a collection of web crawl data for more than 25 billion web pages.
The University of California at Irvine’s Machine Learning Repository is also an excellent place to find high-quality data sets. here you won’t find the most current and newsworthy data sets as many sets are quite old. However, the sets are organized for students of data science so that you can search for sets in a number of different ways:
- Machine Learning Task- sets are organized by classification, regression, clustering, or other so you can find sets that will work really well with the type of modeling you want to implement
- Attribute Type- Univariate, Multivariate, Sequential, and Time Series are a few kinds of data for which you will find sets here
- Attribute Type- Categorical, Numerical or mixed
- Subject area- Data sets related to a number of science and business subject domains
Finally, there is Kaggle. If you haven’t discovered Kaggle, discover it now. Kaggle is a community of developers and data scientists developed by Google to foster the development of machine learning practitioners. They host competitions aimed at solving real-world machine learning problems and offer prize money for winning solutions. A by-product of this is is that it has become a publisher of data sets and is a great place to find data on a variety of subjects. they also host courses in python, machine learning, SQL and other important data science tools. So it’s a one-stop shop for your development down the path of data science.
There will be times where a satisfactory set isn’t available and in those cases, there are a number of different ways one can generate data. In my next blog, I’ll review a number of common methods you can use to create data sets that fit your needs. Stay tuned!