Page 2, second column, first paragraph states: "All features were normalized by gender using z-scores."
This is an error. All non-pitch features were in fact raw. However, pitch mean and max were scaled so that male and female values would lie in the same range. Pitch range is approximately [75,500] for female speakers and [50,300] for male speakers. For pitch-related variables to be comparable across genders, female pitch tracks were transformed as follows:
new_pitch_value = K * original_pitch_value + D
with K, D such that 500K + D = 300, 75K + D = 50.