top of page


This page offers specialised software we use for our research, which may prove useful to researchers working on related areas. We use different software platforms to work on data analytics, primarily Matlab, R, and Python. Simply click on the corresponding icon of each dataset to download in that platform-compatible format.

In all cases the provided source code is heavily annotated, facilitating further experimentation. The code is provided "as is" with no guarantees that will work well for your needs. Each function includes references and explanation of the key ideas.


In all cases the source code is provided under the standard GPL license and is free for academic use. Please contact us for commercial use.

Copyright © DARTH group, 2022

Please cite the relevant publications if you make use of this code. We would be grateful to receive feedback regarding bugs, suggestions to improve the code, or additional functions/options to be included in forthcoming versions of the functions/toolboxes.

This toolbox contains functions which can be used to analyze actigraphy data collected using the Geneactiv smartwatch. We process raw 3D accelerometry data to characterize activity, sleep, and circadian variability patterns. The toolbox contains many well-known established approaches, and also additional novel patterns.

This function identifies the most stationary signal segment of a sustained vowel (by default 2 seconds), which can then be used to characterize the underlying phonation, e.g. using the Voice Analysis Toolbox

This function computes the mutual information, which can be thought of as a more general method compared to correlation coefficients in order to quantify the association between two random variables (vectors). Most freely available implementations of mutual information estimation rely on prior sub-optimal intermediate steps such as estimating probability densities using histogram techniques; here I provide a simple proof of concept approach estimating densities relying on kernel density estimation before computing the mutual information. Note there are more sophisticated and accurate approaches for computing the mutual information, but this (one might say naïve) implementation is simple, easy to understand, and computationally fairly efficient. The mutual information is not upper bounded which makes its direct interpretation difficult; for this reason I am also providing a normalised version. The normalised mutual information ranges between 0 and 1, where 0 denotes no association between the two random variables (that is, they are independent) and 1 denotes perfect association (knowledge of one random variable allows perfect prediction of the other). The function has been created with the standard goal in data analysis of determining the univariate association of each feature (attribute) with the outcome (target) we aim to predict. Please include the following citation if you use it in your work:


Alternatively, if you prefer a journal paper citation:

The concept of combining multiple experts or sensors is fascinating. One approach to achieve this is to use the adaptive Kalman filter, where we use some confidence metrics (also known as signal quality indices) to account for the variable confidence in the estimates of each expert (or sensor). This variability in confidence may be due to the particular characteristics of the measures quantity, or inherent limitations of the expert (sensor) at accurately providing estimates at the given instance (this can also be assessed with respect to the estimates of the other experts). The function provided here is for the application of estimating the fundamental frequency of sustained vowels, but the framework is generic and the user could adapt appropriately the function with the confidence metrics for different applications. More details can be found in my JASA2014 paper. Please include the following citation if you use it in your work:

This function serves to map the commonly used Parkinson’s disease symptom severity rating scale UPDRS to a Parkinson’s disease symptom severity stage called H&Y. I am grateful to Dr. A. Kramer for developing the SAS code. More details can be found in my Parkinsonism and Related Disorders 2012 paper. Please include the following citation if you use it in your work:

In addition to my Matlab implementation, see also the implementation in SAS, kindly provided by Dr Andrew Kramer: SAS file.

This is the implementation of the RRCT algorithm in MATLAB and Python (the MATLAB version is faster and is what I used in the manuscript; I would be grateful if anyone who improves on the computational speed of the Python implementation contacts me). In short, RRCT is a computationally efficient approach towards tackling feature selection using an information theoretic transformation of the underlying statistical relationships expressed using the two first central order moments. In my work I demonstrated that RRCT is very competitive against 19 state-of-art filter feature selection algorithms across 12 diverse datasets. Please include the following citation if you use it in your work:

This function is a simplified, computationally efficient, robust approach for feature selection using the minimum Redundancy Maximum Relevance (mRMR) principle. The original paper by Peng et al. (2005) used the mutual informationcriterion (its computation is extremely demanding if done via proper density estimation, and problematic if done via crude histograms as in the open source code provided by Peng) to select features. Instead, here I opt for the Spearman correlation coefficient criterion which allows for a fast and computationally inexpensive feature selection algorithm (hence I call this technique mRMRSpearman). More details can be found in my simple methodological guide for data analysis book chapter. Please include the following citation if you use it in your work:

A robust, efficient, and minimal requirement sleep staging-independent spindle detector by processing a single EEG (most algorithms in the research literature require the use of additional EEG leads, and frequently also the hypnogram). More details can be found in my Frontiers in Human Neuroscience 2015 paper. Please include the following citation if you use it in your work:

This toolbox contains functions which can be used in a variety of data analysis applications. The toolbox tackles problems in the general field of statistical machine learning, including functions for data visualisation, feature selection, regression and classification, using a wide range of available and refined methods. It is fairly basic for now, but I intend to keep updating it with additional functions for methodological concepts. Please include the following citation if you use it in your work, or look at specific functions within the toolbox for appropriate referencing and citing purposes:

This function aims to characterize time series (extracting a feature vector) using standard wavelet decomposition. It was originally proposed to analyze properties of voice signals, but the technique is generic and could be, in principle, applied to any time series. More details can be found in my Nonlinear Theory and its Applications 2010 paper. Please include the following citation if you use it in your work:

This toolbox presents a number of speech signal processing algorithms, aiming at the objective characterization of voice, and in particular the assessment of voice disorders. These algorithms are mainly directed at quantifying amplitude (shimmer variants), frequency (jitter variants) and increased noise (signal-to-noise measures). Note that the toolbox has been developed and has only been validated in settings with the sustained vowel /a/. The algorithmic tools herein may be generalizable to other sustained vowels, but they are definitely not appropriate for conversational speech. The toolbox was developed in a series of journal and conference studies, and the most important are highlighted below. Please include the following citations if you use this toolbox in your work:

Applying most feature selection algorithms on perturbed versions of a dataset will likely result in different feature subsets being selected. This function takes as input a matrix with the computed feature subsets across L repetitions, where each repetition holds a perturbed (e.g. bootstrapped) version of the original design matrix, and votes for the most appropriate final feature ranking. More details can be found in my TNSRE2014 paper. Please include the following citation if you use it in your work:

Additional external links by members (and former members) of the group
  • github
  • github
  • github
  • github
  • github
bottom of page