SOFTWARE

This page offers specialised software we use for our research, which may prove useful to researchers working on related areas. We use different software platforms to work on data analytics, primarily Matlab, R, and Python. Simply click on the corresponding icon of each dataset to download in that platform-compatible format.

In all cases the provided source code is heavily annotated, facilitating further experimentation. The code is provided "as is" with no guarantees that will work well for your needs. Each function includes references and explanation of the key ideas.

In all cases the source code is provided under the standard GPL license and is free for academic use. Please contact us for commercial use.

Please cite the relevant publications if you make use of this code. We would be grateful to receive feedback regarding bugs, suggestions to improve the code, or additional functions/options to be included in forthcoming versions of the functions/toolboxes.

Actigraphy Analysis Toolbox (version 1.0)

This toolbox contains functions which can be used to analyze actigraphy data collected using the Geneactiv smartwatch. We process raw 3D accelerometry data to characterize activity, sleep, and circadian variability patterns. The toolbox contains many well-known established approaches, and also additional novel patterns.

(manuscript under review, the Matlab source code will become available upon publication)

Automatic Stationary Speech Signal Detection in Sustained Vowels

This function identifies the most stationary signal segment of a sustained vowel (by default 2 seconds), which can then be used to characterize the underlying phonation, e.g. using the Voice Analysis Toolbox.

(manuscript under review, the Matlab source code will become available upon publication)

Estimation of mutual information (vanilla kde-based approach)

This function computes the mutual information, which can be thought of as a more general method compared to correlation coefficients in order to quantify the association between two random variables (vectors). Most freely available implementations of mutual information estimation rely on prior sub-optimal intermediate steps such as estimating probability densities using histogram techniques; here I provide a simple proof of concept approach estimating densities relying on kernel density estimation before computing the mutual information. Note there are more sophisticated and accurate approaches for computing the mutual information, but this (one might say naïve) implementation is simple, easy to understand, and computationally fairly efficient. The mutual information is not upper bounded which makes its direct interpretation difficult; for this reason I am also providing a normalised version. The normalised mutual information ranges between 0 and 1, where 0 denotes no association between the two random variables (that is, they are independent) and 1 denotes perfect association (knowledge of one random variable allows perfect prediction of the other). The function has been created with the standard goal in data analysis of determining the univariate association of each feature (attribute) with the outcome (target) we aim to predict. Please include the following citation if you use it in your work:

A. Tsanas: Accurate telemonitoring of Parkinson’s disease symptom severity using nonlinear speech signal processing and statistical machine learning, D.Phil. thesis, Oxford Centre for Industrial and Applied Mathematics, University of Oxford, UK, 2012

Alternatively, if you prefer a journal paper citation:

A. Tsanas, M.A. Little, P.E. McSharry, L.O. Ramig: "Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson’s disease symptom severity", Journal of the Royal Society Interface, Vol. 8, pp. 842-855, June 2011

Information fusion using adaptive Kalman filtering

The concept of combining multiple experts or sensors is fascinating. One approach to achieve this is to use the adaptive Kalman filter, where we use some confidence metrics (also known as signal quality indices) to account for the variable confidence in the estimates of each expert (or sensor). This variability in confidence may be due to the particular characteristics of the measures quantity, or inherent limitations of the expert (sensor) at accurately providing estimates at the given instance (this can also be assessed with respect to the estimates of the other experts). The function provided here is for the application of estimating the fundamental frequency of sustained vowels, but the framework is generic and the user could adapt appropriately the function with the confidence metrics for different applications. More details can be found in my JASA2014 paper. Please include the following citation if you use it in your work:

A. Tsanas, M. Zañartu, M.A. Little, C. Fox, L.O. Ramig, G.D. Clifford: “Robust fundamental frequency estimation in sustained vowels: detailed algorithmic comparisons and information fusion with adaptive Kalman filtering”, Journal of the Acoustical Society of America, Vol. 135, pp. 2885-2901, 2014

Mapping UPDRS to Hoehn & Yahr

This function serves to map the commonly used Parkinson’s disease symptom severity rating scale UPDRS to a Parkinson’s disease symptom severity stage called H&Y. I am grateful to Dr. A. Kramer for developing the SAS code. More details can be found in my Parkinsonism and Related Disorders 2012 paper. Please include the following citation if you use it in your work:

A. Tsanas, M.A. Little, P.E. McSharry, B.K. Scanlon, S. Papapetropoulos: "Statistical analysis and mapping of the Unified Parkinson’s Disease Rating Scale to Hoehn and Yahr staging", Parkinsonism and Related Disorders, Vol. 18 (5), pp. 697-699, 2012

In addition to my Matlab implementation, see also the implementation in SAS, kindly provided by Dr Andrew Kramer: SAS file.

Relevance, redundancy and complementarity trade-off (RRCT):
a new robust feature selection algorithm

This is the implementation of the RRCT algorithm in MATLAB and Python (the MATLAB version is faster and is what I used in the manuscript; I would be grateful if anyone who improves on the computational speed of the Python implementation contacts me). In short, RRCT is a computationally efficient approach towards tackling feature selection using an information theoretic transformation of the underlying statistical relationships expressed using the two first central order moments. In my work I demonstrated that RRCT is very competitive against 19 state-of-art filter feature selection algorithms across 12 diverse datasets. Please include the following citation if you use it in your work:

A. Tsanas: "Relevance, redundancy and complementarity trade-off (RRCT): a generic, efficient, robust feature selection tool", Patterns, (in press), 2022

Simple correlation-based feature selection: mRMR-Spearman

This function is a simplified, computationally efficient, robust approach for feature selection using the minimum Redundancy Maximum Relevance (mRMR) principle. The original paper by Peng et al. (2005) used the mutual informationcriterion (its computation is extremely demanding if done via proper density estimation, and problematic if done via crude histograms as in the open source code provided by Peng) to select features. Instead, here I opt for the Spearman correlation coefficient criterion which allows for a fast and computationally inexpensive feature selection algorithm (hence I call this technique mRMRSpearman). More details can be found in my simple methodological guide for data analysis book chapter. Please include the following citation if you use it in your work:

A. Tsanas, M.A. Little, P.E. McSharry: "A methodology for the analysis of medical data", Chapter 7 in Handbook of Systems and Complexity in Health, pp. 113-125, Eds. J.P. Sturmberg, and C.M. Martin, Springer, 2013

Stage-independent, single-lead EEG sleep spindle detector

A robust, efficient, and minimal requirement sleep staging-independent spindle detector by processing a single EEG (most algorithms in the research literature require the use of additional EEG leads, and frequently also the hypnogram). More details can be found in my Frontiers in Human Neuroscience 2015 paper. Please include the following citation if you use it in your work:

A. Tsanas, G.D. Clifford: “Stage-independent, single lead EEG sleep spindle detection using the continuous wavelet transform and local weighted smoothing”, Frontiers in Human Neuroscience 9:181, 2015

Statistical Machine Learning Toolbox (version 1.0)

This toolbox contains functions which can be used in a variety of data analysis applications. The toolbox tackles problems in the general field of statistical machine learning, including functions for data visualisation, feature selection, regression and classification, using a wide range of available and refined methods. It is fairly basic for now, but I intend to keep updating it with additional functions for methodological concepts. Please include the following citation if you use it in your work, or look at specific functions within the toolbox for appropriate referencing and citing purposes:

A. Tsanas: Accurate telemonitoring of Parkinson’s disease symptom severity using nonlinear speech signal processing and statistical machine learning, D.Phil. thesis, Oxford Centre for Industrial and Applied Mathematics, University of Oxford, UK, 2012

Time-series analysis: features using wavelet decomposition

This function aims to characterize time series (extracting a feature vector) using standard wavelet decomposition. It was originally proposed to analyze properties of voice signals, but the technique is generic and could be, in principle, applied to any time series. More details can be found in my Nonlinear Theory and its Applications 2010 paper. Please include the following citation if you use it in your work:

A. Tsanas, M.A. Little, P.E. McSharry, L.O. Ramig: “New nonlinear markers and insights into speech signal degradation for effective tracking of Parkinson’s disease symptom severity”, International Symposium on Nonlinear Theory and its Applications (NOLTA), pp. 457-460, Krakow, Poland, 5-8 September 2010 (won the student paper award)

Voice Analysis Toolbox (version 1.0)

This toolbox presents a number of speech signal processing algorithms, aiming at the objective characterization of voice, and in particular the assessment of voice disorders. These algorithms are mainly directed at quantifying amplitude (shimmer variants), frequency (jitter variants) and increased noise (signal-to-noise measures). Note that the toolbox has been developed and has only been validated in settings with the sustained vowel /a/. The algorithmic tools herein may be generalizable to other sustained vowels, but they are definitely not appropriate for conversational speech. The toolbox was developed in a series of journal and conference studies, and the most important are highlighted below. Please include the following citations if you use this toolbox in your work:

Voting mechanism for feature selection

Applying most feature selection algorithms on perturbed versions of a dataset will likely result in different feature subsets being selected. This function takes as input a matrix with the computed feature subsets across L repetitions, where each repetition holds a perturbed (e.g. bootstrapped) version of the original design matrix, and votes for the most appropriate final feature ranking. More details can be found in my TNSRE2014 paper. Please include the following citation if you use it in your work: