This is a collection of datasets we have used in our research and which are made freely available. In many cases the datasets have also been deposited in the standard UCI Machine Learning Repository. We have also tried to provide the datasets in different formats, including in a format that is readable by Excel so that the data can be loaded in different software platforms. Simply click on the corresponding icon of each dataset to download in that platform-compatible format.
Please cite the relevant papers if you use these datasets in your research.
This study looked into the problem of mapping dysphonia measures (speech signal characteristics) to a standard clinical metric of Parkinson’s disease symptom severity. The dataset comprises 5875 samples and 16 features to predict a real valued response (regression problem). It can also be used as a multi-class classification problem if the response is rounded to the nearest integer. More details can be found in the IEEE Transactions on Biomedical Engineering 2010 paper. Please include the following citation if you use it in your work:
This study looked into the problem of assessing heating load and cooling load (that is, energy efficiency) as a function of some building parameters. The dataset comprises 768 samples and 8 features to predict two real valued responses (regression problem). It can also be used as a multi-class classification problem if the response is rounded to the nearest integer. More details can be found in my Energy and Buildings 2012 paper. See also this Supplementary Material with additional information. Please include the following citation if you use it in your work:
This study uses 309 speech signal processing algorithms to characterize 126 signals from 14 individuals collected during voice rehabilitation. The aim is to replicate the experts’ assessment denoting whether these voice signals are considered “acceptable” or “unacceptable” (binary classification problem). More details can be found in the IEEE Transactions on Neural Systems and Rehabilitation Engineering 2014 paper. Please include the following citation if you use it in your work:
The accurate estimation of the fundamental frequency (F0) is a well-known challenging problem in the speech signal processing research community. Unfortunately, it is difficult to obtain objective ground truth values with contemporary approaches which rely on EGGs. Here, we used a sophisticated, state of the art physiological model of voice production to construct sustained /a/ vowels, where the exact ground truth of F0 values is known. We benchmarked 10 established F0 estimation algorithms, and proposed a novel fusion approach to further improve F0 estimates. We would like to encourage researchers to use this database when evaluating F0 estimation algorithms in order to benchmark results in this application. More details can be found in the Journal of Acoustical Society of America 2014 paper. (Note that here I am providing 130 *.wav files, and the ground truth values are provided in an Excel spreadsheet). Please include the following citation if you use it in your work: